shard query with duplicated documents cause inaccuate paginating
When we have duplicated documents (same uniqueID) among the shards, the query results could be non-deterministic, this is an known issue. The consequence when we display the search results on our UI page with paginating is: if user click the 'last page', it could display an empty page since the total doc count returned by the query with dups is not accurate (includes dups apparently). Is there a known work around for this problem? We tried the following 2 approaches but each of them problem: 1) use a query like: curl -d q=*:*fl=message_idrows=1start=1999 http://[hostname]:8080/mywebapp/shards/[coreid]/select? Since I am using a very large number for the 'rows', it will return the accurate doc count, but it takes about 20 second to run this query on an average customer with a little over 1 million rows returned, so the performance is not acceptable. 2) use facet query: curl -d q=*:*fl=message_idfacet=truefacet.mincount=2rows=0facet.field=message_idindent=on http://[hostname]:8080/[mywebapp]/shards/[coreid]/select? the test shows this might not return accurate doc counts from time to time. any suggestions what is the best work around to get an accurate doc count with sharded query with dups, and efficient when run with large data set? thanks Jie -- View this message in context: http://lucene.472066.n3.nabble.com/shard-query-with-duplicated-documents-cause-inaccuate-paginating-tp4133666.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Selectively hiding SOLR facets.
You could use facet.mincount parameter. default value is 0, setting it as N would require a minimum appearance on the result Set Iker 2014-04-29 4:56 GMT+02:00 atuldj.jadhav atuldj.jad...@gmail.com: Yes, but with my query *country:USA * it is returning me languages belonging to countries other than USA. Is there any way I can avoid such languages appearing in my facet filters? -- View this message in context: http://lucene.472066.n3.nabble.com/Selectively-hiding-SOLR-facets-tp4132770p4133638.html Sent from the Solr - User mailing list archive at Nabble.com. -- /** @author imartinez*/ Person me = *new* Developer(); me.setName(*Iker Mtz de Apellaniz Anzuola*); me.setTwit(@mitxino77 https://twitter.com/mitxino77); me.setLocations({St Cugat, Barcelona, Kanpezu, Euskadi, *, World]}); me.setSkills({*SoftwareDeveloper, Curious, AmateurCook*}); *return* me;
Re: Selectively hiding SOLR facets.
How would you know it if you did this manually? Solr does not know that Dutch is not valid for USA. You need to give it some sort of signal. One way could be to have a dynamic field for a facet which includes country name. So, you have language_USA, language_Belgium, etc. Then, when you do country:USA, you also do facet count against language_USA. You could even get super fancy with field aliases and other tricks, though this may get messy if you do need it for each country. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Tue, Apr 29, 2014 at 9:56 AM, atuldj.jadhav atuldj.jad...@gmail.com wrote: Yes, but with my query *country:USA * it is returning me languages belonging to countries other than USA. Is there any way I can avoid such languages appearing in my facet filters? -- View this message in context: http://lucene.472066.n3.nabble.com/Selectively-hiding-SOLR-facets-tp4132770p4133638.html Sent from the Solr - User mailing list archive at Nabble.com.
How to reduce enumerating docs
Hi all, My doc has two fileds namely length and fingerprint, which stand for the length and text of the doc. I have a custom SearchComponent that enum all the docs according to the term to search the fingerprint. That could be very slow because the number of docs is very huge and the operation is time consume. Since I only care about the docs with the length within a close range around that specified in the query, what's the right way to accelerate? Thanks DocsEnum docsEnum = sub_reader.termDocsEnum(term); if (docsEnum == null) { continue; } while ((doc = docsEnum.nextDoc()) != DocsEnum.NO_MORE_DOCS) { // do something expensive }
Stored vs non-stored very large text fields
Dear reader, I'm trying to use solr for a hierarchical search: metadata from the higher-levelled elements is copied to the lower ones, and each element has the complete ocr text which it belongs to. At volume level, of course, we will have the complete ocr text in one doc and we need to store it for highlighting. My solr instance is configured like this: java -Xms12000m -Xmx12000m -jar start.jar [ imported with 4.7.0, performance tests with 4.8.0 ] Solr index files are of this size: 0.013gb .tip The index into the Term Dictionary 0.017gb .nvd Encodes length and boost factors for docs and fields 0.546gb .tim The term dictionary, stores term info 1.332gb .doc Contains the list of docs which contain each term along with frequency 4.943gb .pos Stores position information about where a term occurs in the index 12.743gb .tvd Contains information about each document that has term vectors 17.340gb .fdt The stored fields for documents ocr Configuring the ocr field as non-stored I'll get those performance measures (see docs/s) after warmup: jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 3.96 s bytes: 1.878 MB 64768 docs found; got 64768 docs 16353 docs/s; 0.474 MB/s ... and with ocr stored, even _not_ requesting ocr with fl=... with disabled documentCache class=solr.LRUCache ... / and enableLazyFieldLoadingfalse/enableLazyFieldLoading [ with documentCache and enableLazyFieldLoading results are even worser ] ... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51): jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 61.58 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1052 docs/s; 0.030 MB/s ... using solr-4.8.0 and oracle-jdk1.7.0_55 : jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 58.80 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1102 docs/s; 0.032 MB/s Is there any reason why stored vs non-stored is 16 times slower? Is there a way to store ocr field in a separate index or somethings like this? Kind regards, J. Barth -- J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580 pgp public key: http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc
Re: How to reduce enumerating docs
Can't you just specify the length range as a filter query? If your length type is tint/tlong, Solr already has optimized code that uses multiple resolutions depth to efficiently filter through the numbers. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Tue, Apr 29, 2014 at 3:23 PM, 郑华斌 huabin.zh...@qq.com wrote: Hi all, My doc has two fileds namely length and fingerprint, which stand for the length and text of the doc. I have a custom SearchComponent that enum all the docs according to the term to search the fingerprint. That could be very slow because the number of docs is very huge and the operation is time consume. Since I only care about the docs with the length within a close range around that specified in the query, what's the right way to accelerate? Thanks DocsEnum docsEnum = sub_reader.termDocsEnum(term); if (docsEnum == null) { continue; } while ((doc = docsEnum.nextDoc()) != DocsEnum.NO_MORE_DOCS) { // do something expensive }
Re: Stored vs non-stored very large text fields
Couple of random thoughts: 1) The latest (4.8) Solr has support for nested documents, as well as for expand components. Maybe that will let you have more efficient architecture: http://heliosearch.org/expand-block-join/ 2) Do you return OCR text to the client? Or just search it? If just search it, you don't need to store it 3) If you do need to store it and return it, do you always have to return it? If not, you could look at lazy-loading the field (setting in solrconfig.xml). 4) Is OCR text or image? The stored fields are compressed by default, I wonder if the compression/decompression of a large image is an issue. 5) JDK 8 apparently makes Lucene much happier (speed of some operations). Might be something to test if all else fails. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Tue, Apr 29, 2014 at 3:28 PM, Jochen Barth ba...@ub.uni-heidelberg.de wrote: Dear reader, I'm trying to use solr for a hierarchical search: metadata from the higher-levelled elements is copied to the lower ones, and each element has the complete ocr text which it belongs to. At volume level, of course, we will have the complete ocr text in one doc and we need to store it for highlighting. My solr instance is configured like this: java -Xms12000m -Xmx12000m -jar start.jar [ imported with 4.7.0, performance tests with 4.8.0 ] Solr index files are of this size: 0.013gb .tip The index into the Term Dictionary 0.017gb .nvd Encodes length and boost factors for docs and fields 0.546gb .tim The term dictionary, stores term info 1.332gb .doc Contains the list of docs which contain each term along with frequency 4.943gb .pos Stores position information about where a term occurs in the index 12.743gb .tvd Contains information about each document that has term vectors 17.340gb .fdt The stored fields for documents ocr Configuring the ocr field as non-stored I'll get those performance measures (see docs/s) after warmup: jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 3.96 s bytes: 1.878 MB 64768 docs found; got 64768 docs 16353 docs/s; 0.474 MB/s ... and with ocr stored, even _not_ requesting ocr with fl=... with disabled documentCache class=solr.LRUCache ... / and enableLazyFieldLoadingfalse/enableLazyFieldLoading [ with documentCache and enableLazyFieldLoading results are even worser ] ... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51): jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 61.58 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1052 docs/s; 0.030 MB/s ... using solr-4.8.0 and oracle-jdk1.7.0_55 : jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 58.80 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1102 docs/s; 0.032 MB/s Is there any reason why stored vs non-stored is 16 times slower? Is there a way to store ocr field in a separate index or somethings like this? Kind regards, J. Barth -- J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580 pgp public key: http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc
Apache Solr - Pdf Indexing.
Hi Team, I am indexing PDF using Apache Solr 3.6 . Passing around 3000 keywords using the OR operator and able to get the files containing the keywords. Kindly guide me to get the keyword list in a .PDF file. Note : In Schema.xml have declared a unique tag id. Thanks Regards. Vignesh.V cid:image001.jpg@01CA4872.39B33D40 Ninestars Information Technologies Limited., 72, Greams Road, Thousand Lights, Chennai - 600 006. India. Landline : +91 44 2829 4226 / 36 / 56 X: 144 blocked::http://www.ninestars.in/ www.ninestars.in -- STOP Virus, STOP SPAM, SAVE Bandwidth! http://www.safentrix.com/adlink?cid=0 --
Re: How to reduce enumerating docs
Will the filter query execute before or after my custom search component? In fact, I care about that, for example??if the following \docsEnum will contain 1M docs for term \aterm without the flter query, will it be less than 1M in case that the filter query is present? DocsEnum docsEnum = sub_reader.termDocsEnum(aterm); -- Original -- From: Alexandre Rafalovitch;arafa...@gmail.com; Send time: Tuesday, Apr 29, 2014 5:13 PM To: solr-usersolr-user@lucene.apache.org; Subject: Re: How to reduce enumerating docs Can't you just specify the length range as a filter query? If your length type is tint/tlong, Solr already has optimized code that uses multiple resolutions depth to efficiently filter through the numbers. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Tue, Apr 29, 2014 at 3:23 PM, ?? huabin.zh...@qq.com wrote: Hi all, My doc has two fileds namely length and fingerprint, which stand for the length and text of the doc. I have a custom SearchComponent that enum all the docs according to the term to search the fingerprint. That could be very slow because the number of docs is very huge and the operation is time consume. Since I only care about the docs with the length within a close range around that specified in the query, what's the right way to accelerate? Thanks DocsEnum docsEnum = sub_reader.termDocsEnum(term); if (docsEnum == null) { continue; } while ((doc = docsEnum.nextDoc()) != DocsEnum.NO_MORE_DOCS) { // do something expensive } .
Re: Apache Solr - Pdf Indexing.
Your question is not terribly clear. Are you having troubles indexing PDF in general? Try the tutorial and specifically look for extract handler. Or you already got PDF into the system but your 3000 Keyword query does not match it? In which case it might be just that PDF extraction is limited by definition. Try to have the extracted content stored (not just indexed) and see whether the extracted text matches your expectations. Otherwise, rephrase the query. Say what you expected, what you got instead and where you are stuck. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Tue, Apr 29, 2014 at 4:23 PM, vignesh vignes...@ninestars.in wrote: Hi Team, I am indexing PDF using Apache Solr 3.6 . Passing around 3000 keywords using the OR operator and able to get the files containing the keywords. Kindly guide me to get the keyword list in a .PDF file. Note : In Schema.xml have declared a unique tag *“id”.* *Thanks Regards.* *Vignesh.V* *[image: cid:image001.jpg@01CA4872.39B33D40]* Ninestars Information Technologies Limited., 72, Greams Road, Thousand Lights, Chennai - 600 006. India. Landline : +91 44 2829 4226 / 36 / 56 X: 144 www.ninestars.in -- STOP Virus, STOP SPAM, SAVE Bandwidth! www.safentrix.com http://www.safentrix.com/adlink?cid=0 --
Re: Stored vs non-stored very large text fields
Am 29.04.2014 11:19, schrieb Alexandre Rafalovitch: Couple of random thoughts: 1) The latest (4.8) Solr has support for nested documents, as well as for expand components. Maybe that will let you have more efficient architecture: http://heliosearch.org/expand-block-join/ Yes, I've seen this, but as far as I understood you have to know on which nesting level you do your query. My search should work on any level, say, volume title 1986 chapter 1.1 author marc chapter 1.1.3 title does not matter chapter 1.1.3.1 title abc chapter 1.1.3.2 title xyz should match by querying +author:marc +title:abc // or // +author:marc +title:xyz but // not // +title:abc +title:xyz (we'll have an unkown number of levels) 2) Do you return OCR text to the client? Or just search it? If just search it, you don't need to store it I'll want to get highlighted snippets. 3) If you do need to store it and return it, do you always have to return it? If not, you could look at lazy-loading the field (setting in solrconfig.xml). Let's see, perhaps this is a sorting problem which could be solved by setting field id and sort_... to docValues=ture. 4) Is OCR text or image? The stored fields are compressed by default, I wonder if the compression/decompression of a large image is an issue. Text. 5) JDK 8 apparently makes Lucene much happier (speed of some operations). Might be something to test if all else fails. Ok... Thanks, J. Barth Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Tue, Apr 29, 2014 at 3:28 PM, Jochen Barth ba...@ub.uni-heidelberg.de wrote: Dear reader, I'm trying to use solr for a hierarchical search: metadata from the higher-levelled elements is copied to the lower ones, and each element has the complete ocr text which it belongs to. At volume level, of course, we will have the complete ocr text in one doc and we need to store it for highlighting. My solr instance is configured like this: java -Xms12000m -Xmx12000m -jar start.jar [ imported with 4.7.0, performance tests with 4.8.0 ] Solr index files are of this size: 0.013gb .tip The index into the Term Dictionary 0.017gb .nvd Encodes length and boost factors for docs and fields 0.546gb .tim The term dictionary, stores term info 1.332gb .doc Contains the list of docs which contain each term along with frequency 4.943gb .pos Stores position information about where a term occurs in the index 12.743gb .tvd Contains information about each document that has term vectors 17.340gb .fdt The stored fields for documents ocr Configuring the ocr field as non-stored I'll get those performance measures (see docs/s) after warmup: jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 3.96 s bytes: 1.878 MB 64768 docs found; got 64768 docs 16353 docs/s; 0.474 MB/s ... and with ocr stored, even _not_ requesting ocr with fl=... with disabled documentCache class=solr.LRUCache ... / and enableLazyFieldLoadingfalse/enableLazyFieldLoading [ with documentCache and enableLazyFieldLoading results are even worser ] ... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51): jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 61.58 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1052 docs/s; 0.030 MB/s ... using solr-4.8.0 and oracle-jdk1.7.0_55 : jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 58.80 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1102 docs/s; 0.032 MB/s Is there any reason why stored vs non-stored is 16 times slower? Is there a way to store ocr field in a separate index or somethings like this? Kind regards, J. Barth -- J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580 pgp public key: http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc -- J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580 pgp public key: http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc
Re: Stored vs non-stored very large text fields
BTW: stored field compression: are all stored fields within a document are put into one compressed chunk, or by per-field basis? Kind regards, J. Barth Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Tue, Apr 29, 2014 at 3:28 PM, Jochen Barth ba...@ub.uni-heidelberg.de wrote: Dear reader, I'm trying to use solr for a hierarchical search: metadata from the higher-levelled elements is copied to the lower ones, and each element has the complete ocr text which it belongs to. At volume level, of course, we will have the complete ocr text in one doc and we need to store it for highlighting. My solr instance is configured like this: java -Xms12000m -Xmx12000m -jar start.jar [ imported with 4.7.0, performance tests with 4.8.0 ] Solr index files are of this size: 0.013gb .tip The index into the Term Dictionary 0.017gb .nvd Encodes length and boost factors for docs and fields 0.546gb .tim The term dictionary, stores term info 1.332gb .doc Contains the list of docs which contain each term along with frequency 4.943gb .pos Stores position information about where a term occurs in the index 12.743gb .tvd Contains information about each document that has term vectors 17.340gb .fdt The stored fields for documents ocr Configuring the ocr field as non-stored I'll get those performance measures (see docs/s) after warmup: jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 3.96 s bytes: 1.878 MB 64768 docs found; got 64768 docs 16353 docs/s; 0.474 MB/s ... and with ocr stored, even _not_ requesting ocr with fl=... with disabled documentCache class=solr.LRUCache ... / and enableLazyFieldLoadingfalse/enableLazyFieldLoading [ with documentCache and enableLazyFieldLoading results are even worser ] ... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51): jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=json q={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 61.58 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1052 docs/s; 0.030 MB/s ... using solr-4.8.0 and oracle-jdk1.7.0_55 : jb@serv7:~ perl solr-performance.pl zeit 6 http://127.0.0.1:58983/solr/collection1/select ?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29 fq=mashed_b%3Afalse fl=id sort=sort_name_s asc,id+asc rows=100 time: 58.80 s bytes: 1.878 MB 64768 docs found; got 64768 docs 1102 docs/s; 0.032 MB/s Is there any reason why stored vs non-stored is 16 times slower? Is there a way to store ocr field in a separate index or somethings like this? Kind regards, J. Barth -- J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580 pgp public key: http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc -- J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580 pgp public key: http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc
Re: Apache Solr - Pdf Indexing.
On Apr 29, 2014 2:52 PM, vignesh vignes...@ninestars.in wrote: Hi Team, I am indexing PDF using Apache Solr 3.6 . Passing around 3000 keywords using the OR operator and able to get the files containing the keywords. Kindly guide me to get the keyword list in a .PDF file. What do you mean? Do you want Solr search results in a PDF file? Why would a search engine provide such functionality? You can take the Solr XML/JSON results, and generate a PDF if you need that. Regards, Gora
Re: Delete fields from document using a wildcard
Thanks, Alex for the input. Let me provide a better example on what I'm trying to achieve. I have documents like this: doc field name=id100/field field name=2_1600_i1/field field name=5_1601_i5/field field name=112_1602_i7/field /doc The schema looks the usual way: dynamicField name=*_i type=intindexed=true stored=true/ The dynamic field pattern I'm using is this: id_day_i. Each day I want to add new fields for the current day and remove the fields for the oldest one. adddoc field name=id100/field !-- add fields for current day -- field name=251_1603_i update=set25/field !-- remove fields for oldest day -- field name=2_1600_i update=set null=true1/field /doc/add The problem is, I don't know the exact names of the fields I want to remove. All I know is that they end in *_1600_i. When removing fields from a document, I want to avoid querying SOLR to see what fields are actually present for the specific document. In this way, hopefully I can speed up the process. Querying to see the schema.xml is not going to help me much, since the field is defined a dynamic field *_i. This makes me think that expanding the documents client-side is not the best way to do it. Regarding the second approach, to expand the documents server-side. I took a look over the SOLR code and came upon the UpdateRequestProcessor.java class which had this interesting javadoc: * * This is a good place for subclassed update handlers to process the document before it is * * * indexed. You may wish to add/remove fields or check if the requested user is allowed to * * * update the given document...* As you can imagine, I have no expertise in SOLR code. How would you say it would be possible to retrieve the document and its fields for the given id and update the update/delete command to include the fields that match the pattern I'm giving (eg. *_1600_i)? Thanks, Costi On Tue, Apr 29, 2014 at 6:41 AM, Alexandre Rafalovitch arafa...@gmail.comwrote: Not out of the box, as far as I know. Custom UpdateRequestProcessor could possibly do some sort of expansion of the field name by verifying the actual schema. Not sure if API supports that level of flexibility. Or, for latest Solr, you can request the list of known field names via REST and do client-side expansion instead. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Tue, Apr 29, 2014 at 12:20 AM, Costi Muraru costimur...@gmail.com wrote: Hi guys, Would be possible, using Atomic Updates in SOLR4, to remove all fields matching a pattern? For instance something like: adddoc field name=id100/field *field name=*_name_i update=set null=true/field* /doc/add Or something similar to remove certain fields in all documents. Thanks, Costi
Apache Solr - Pdf Indexing.
Hi Team, I am indexing PDF using Apache Solr 3.6 . Passing around 3000 keywords using the OR operator (gardens OR flowers OR time OR train OR trees OR etc) able to get the files containing these keywords. But every .PDF file will not be containing all the keywords, some may contain (gardens, flowers and time) and some .pdf file may contain (trees). Kindly guide me to get the list of keywords matching every file. For Example. (Required Output). Id: xyz.pdf Matching Keywords : gardens, flowers, time. Id: abc.pdf Matching Keywords : train, trees. Id: ghi.pdf Matching Keywords : train, trees, time. Thanks Regards. Vignesh.V cid:image001.jpg@01CA4872.39B33D40 Ninestars Information Technologies Limited., 72, Greams Road, Thousand Lights, Chennai - 600 006. India. Landline : +91 44 2829 4226 / 36 / 56 X: 144 blocked::http://www.ninestars.in/ www.ninestars.in -- STOP Virus, STOP SPAM, SAVE Bandwidth! http://www.safentrix.com/adlink?cid=0 --
Solr does not recognize language
Dear all, I'm a new user of Solr. I've managed to index a bunch of documents (in fact, they are tweets) and everything works quite smoothly. Nevertheless it looks like Solr doesn't detect the language of my documents nor remove stopwords accordingly so I can extract the most frequent terms. I've added this piece of XML to my solrconfig.xml as well as the Tika lib jars. updateRequestProcessorChain name=langid processor class=org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory lst name=defaults str name=langid.fltext/str str name=langid.langFieldlang/str /lst /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain There is no error in the tomcat log file, so I have no clue of why this isn't working. Any hint on how to solve this problem will be much appreciated!
Re: Stored vs non-stored very large text fields
On 4/29/2014 4:20 AM, Jochen Barth wrote: BTW: stored field compression: are all stored fields within a document are put into one compressed chunk, or by per-field basis? Here's the issue that added the compression to Lucene: https://issues.apache.org/jira/browse/LUCENE-4226 It was made the default stored field format for Lucene, which also made it the default for Solr. At this time, there is no way to remove compression on Solr without writing custom code. I filed an issue to make it configurable, but I don't know how to do it. Nobody else has offered a solution either. One day I might find some time to take a look at the issue and see if I can solve it myself. https://issues.apache.org/jira/browse/SOLR-4375 Here's the author's blog post that goes into more detail than the LUCENE issue: http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene Thanks, Shawn
Re: Solr does not recognize language
Hi, Did you attach your chain to a UpdateRequestHandler? You can do it by adding update.chain=langid to the URL or defining it in a defaults section as follows lst name=defaults str name=update.chainlangid/str /lst On Tuesday, April 29, 2014 3:18 PM, Victor Pascual vic...@mobilemediacontent.com wrote: Dear all, I'm a new user of Solr. I've managed to index a bunch of documents (in fact, they are tweets) and everything works quite smoothly. Nevertheless it looks like Solr doesn't detect the language of my documents nor remove stopwords accordingly so I can extract the most frequent terms. I've added this piece of XML to my solrconfig.xml as well as the Tika lib jars. updateRequestProcessorChain name=langid processor class=org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory lst name=defaults str name=langid.fltext/str str name=langid.langFieldlang/str /lst /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain There is no error in the tomcat log file, so I have no clue of why this isn't working. Any hint on how to solve this problem will be much appreciated!
Re: Solr does not recognize language
Hi Ahmet, thanks for your reply. Adding update.chain=langid to my query doesn't work: IP:8080/solr/select/?q=*%3A*update.chain=langid Regarding defining the chain in an UpdateRequestHandler... sorry for the lame question but shall I paste those three lines to solrconfig.xml, or shall I add them somewhere else? There is not UpdateRequestHandler in my solrconfig. Thanks! On Tue, Apr 29, 2014 at 3:13 PM, Ahmet Arslan iori...@yahoo.com wrote: Hi, Did you attach your chain to a UpdateRequestHandler? You can do it by adding update.chain=langid to the URL or defining it in a defaults section as follows lst name=defaults str name=update.chainlangid/str /lst On Tuesday, April 29, 2014 3:18 PM, Victor Pascual vic...@mobilemediacontent.com wrote: Dear all, I'm a new user of Solr. I've managed to index a bunch of documents (in fact, they are tweets) and everything works quite smoothly. Nevertheless it looks like Solr doesn't detect the language of my documents nor remove stopwords accordingly so I can extract the most frequent terms. I've added this piece of XML to my solrconfig.xml as well as the Tika lib jars. updateRequestProcessorChain name=langid processor class=org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory lst name=defaults str name=langid.fltext/str str name=langid.langFieldlang/str /lst /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain There is no error in the tomcat log file, so I have no clue of why this isn't working. Any hint on how to solve this problem will be much appreciated!
Re: Delete fields from document using a wildcard
On 4/29/2014 5:25 AM, Costi Muraru wrote: The problem is, I don't know the exact names of the fields I want to remove. All I know is that they end in *_1600_i. When removing fields from a document, I want to avoid querying SOLR to see what fields are actually present for the specific document. In this way, hopefully I can speed up the process. Querying to see the schema.xml is not going to help me much, since the field is defined a dynamic field *_i. This makes me think that expanding the documents client-side is not the best way to do it. Unfortunately at this time, you'll have to query the document and go through the list of fields to determine which need to be deleted, then build a request that deleted them. I don't know how hard it is to accomplish this in Solr. Getting it implemented might require a bunch of people standing up and saying we want this! Thanks, Shawn
Re: [ANNOUNCE] Apache Solr 4.8.0 released
In which sense fields and types are now deprecated in schema.xml? Where can I found any pointer about this? On Mon, Apr 28, 2014 at 6:54 PM, Uwe Schindler uschind...@apache.orgwrote: 28 April 2014, Apache Solr™ 4.8.0 available The Lucene PMC is pleased to announce the release of Apache Solr 4.8.0 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.8.0 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html See the CHANGES.txt file included with the release for a full list of details. Solr 4.8.0 Release Highlights: * Apache Solr now requires Java 7 or greater (recommended is Oracle Java 7 or OpenJDK 7, minimum update 55; earlier versions have known JVM bugs affecting Solr). * Apache Solr is fully compatible with Java 8. * fields and types tags have been deprecated from schema.xml. There is no longer any reason to keep them in the schema file, they may be safely removed. This allows intermixing of fieldType, field and copyField definitions if desired. * The new {!complexphrase} query parser supports wildcards, ORs etc. inside Phrase Queries. * New Collections API CLUSTERSTATUS action reports the status of collections, shards, and replicas, and also lists collection aliases and cluster properties. * Added managed synonym and stopword filter factories, which enable synonym and stopword lists to be dynamically managed via REST API. * JSON updates now support nested child documents, enabling {!child} and {!parent} block join queries. * Added ExpandComponent to expand results collapsed by the CollapsingQParserPlugin, as well as the parent/child relationship of nested child documents. * Long-running Collections API tasks can now be executed asynchronously; the new REQUESTSTATUS action provides status. * Added a hl.qparser parameter to allow you to define a query parser for hl.q highlight queries. * In Solr single-node mode, cores can now be created using named configsets. * New DocExpirationUpdateProcessorFactory supports computing an expiration date for documents from the TTL expression, as well as automatically deleting expired documents on a periodic basis. Solr 4.8.0 also includes many other new features as well as numerous optimizations and bugfixes of the corresponding Apache Lucene release. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. - Uwe Schindler uschind...@apache.org Apache Lucene PMC Chair / Committer Bremen, Germany http://lucene.apache.org/
Re: [ANNOUNCE] Apache Solr 4.8.0 released
https://issues.apache.org/jira/browse/SOLR-5228 On Apr 29, 2014, at 10:27 AM, Flavio Pompermaier pomperma...@okkam.it wrote: In which sense fields and types are now deprecated in schema.xml? Where can I found any pointer about this? On Mon, Apr 28, 2014 at 6:54 PM, Uwe Schindler uschind...@apache.orgwrote: 28 April 2014, Apache Solr™ 4.8.0 available The Lucene PMC is pleased to announce the release of Apache Solr 4.8.0 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.8.0 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html See the CHANGES.txt file included with the release for a full list of details. Solr 4.8.0 Release Highlights: * Apache Solr now requires Java 7 or greater (recommended is Oracle Java 7 or OpenJDK 7, minimum update 55; earlier versions have known JVM bugs affecting Solr). * Apache Solr is fully compatible with Java 8. * fields and types tags have been deprecated from schema.xml. There is no longer any reason to keep them in the schema file, they may be safely removed. This allows intermixing of fieldType, field and copyField definitions if desired. * The new {!complexphrase} query parser supports wildcards, ORs etc. inside Phrase Queries. * New Collections API CLUSTERSTATUS action reports the status of collections, shards, and replicas, and also lists collection aliases and cluster properties. * Added managed synonym and stopword filter factories, which enable synonym and stopword lists to be dynamically managed via REST API. * JSON updates now support nested child documents, enabling {!child} and {!parent} block join queries. * Added ExpandComponent to expand results collapsed by the CollapsingQParserPlugin, as well as the parent/child relationship of nested child documents. * Long-running Collections API tasks can now be executed asynchronously; the new REQUESTSTATUS action provides status. * Added a hl.qparser parameter to allow you to define a query parser for hl.q highlight queries. * In Solr single-node mode, cores can now be created using named configsets. * New DocExpirationUpdateProcessorFactory supports computing an expiration date for documents from the TTL expression, as well as automatically deleting expired documents on a periodic basis. Solr 4.8.0 also includes many other new features as well as numerous optimizations and bugfixes of the corresponding Apache Lucene release. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. - Uwe Schindler uschind...@apache.org Apache Lucene PMC Chair / Committer Bremen, Germany http://lucene.apache.org/
Re: [ANNOUNCE] Apache Solr 4.8.0 released
Hello! You don't need the fields and types section anymore, you can just include type or field definition anywhere in the schema.xml section. You can find more in https://issues.apache.org/jira/browse/SOLR-5228 -- Regards, Rafał Kuć Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ In which sense fields and types are now deprecated in schema.xml? Where can I found any pointer about this? On Mon, Apr 28, 2014 at 6:54 PM, Uwe Schindler uschind...@apache.orgwrote: 28 April 2014, Apache Solr™ 4.8.0 available The Lucene PMC is pleased to announce the release of Apache Solr 4.8.0 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.8.0 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html See the CHANGES.txt file included with the release for a full list of details. Solr 4.8.0 Release Highlights: * Apache Solr now requires Java 7 or greater (recommended is Oracle Java 7 or OpenJDK 7, minimum update 55; earlier versions have known JVM bugs affecting Solr). * Apache Solr is fully compatible with Java 8. * fields and types tags have been deprecated from schema.xml. There is no longer any reason to keep them in the schema file, they may be safely removed. This allows intermixing of fieldType, field and copyField definitions if desired. * The new {!complexphrase} query parser supports wildcards, ORs etc. inside Phrase Queries. * New Collections API CLUSTERSTATUS action reports the status of collections, shards, and replicas, and also lists collection aliases and cluster properties. * Added managed synonym and stopword filter factories, which enable synonym and stopword lists to be dynamically managed via REST API. * JSON updates now support nested child documents, enabling {!child} and {!parent} block join queries. * Added ExpandComponent to expand results collapsed by the CollapsingQParserPlugin, as well as the parent/child relationship of nested child documents. * Long-running Collections API tasks can now be executed asynchronously; the new REQUESTSTATUS action provides status. * Added a hl.qparser parameter to allow you to define a query parser for hl.q highlight queries. * In Solr single-node mode, cores can now be created using named configsets. * New DocExpirationUpdateProcessorFactory supports computing an expiration date for documents from the TTL expression, as well as automatically deleting expired documents on a periodic basis. Solr 4.8.0 also includes many other new features as well as numerous optimizations and bugfixes of the corresponding Apache Lucene release. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. - Uwe Schindler uschind...@apache.org Apache Lucene PMC Chair / Committer Bremen, Germany http://lucene.apache.org/
Re: [ANNOUNCE] Apache Solr 4.8.0 released
Earlier, all fieldType tags were required to be nested inside a types tag. Similarly, all field and copyField tags were required to be nested inside a fields tag. Such nesting is no longer required and you can inter-mix field, fieldType and copyField tags as you like. Therefore, the fields and types tags are no longer required and can be removed. Even if you don't remove them, things will continue to work for some times until a major 5.0 release is made. On Tue, Apr 29, 2014 at 7:57 PM, Flavio Pompermaier pomperma...@okkam.itwrote: In which sense fields and types are now deprecated in schema.xml? Where can I found any pointer about this? On Mon, Apr 28, 2014 at 6:54 PM, Uwe Schindler uschind...@apache.org wrote: 28 April 2014, Apache Solr™ 4.8.0 available The Lucene PMC is pleased to announce the release of Apache Solr 4.8.0 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.8.0 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html See the CHANGES.txt file included with the release for a full list of details. Solr 4.8.0 Release Highlights: * Apache Solr now requires Java 7 or greater (recommended is Oracle Java 7 or OpenJDK 7, minimum update 55; earlier versions have known JVM bugs affecting Solr). * Apache Solr is fully compatible with Java 8. * fields and types tags have been deprecated from schema.xml. There is no longer any reason to keep them in the schema file, they may be safely removed. This allows intermixing of fieldType, field and copyField definitions if desired. * The new {!complexphrase} query parser supports wildcards, ORs etc. inside Phrase Queries. * New Collections API CLUSTERSTATUS action reports the status of collections, shards, and replicas, and also lists collection aliases and cluster properties. * Added managed synonym and stopword filter factories, which enable synonym and stopword lists to be dynamically managed via REST API. * JSON updates now support nested child documents, enabling {!child} and {!parent} block join queries. * Added ExpandComponent to expand results collapsed by the CollapsingQParserPlugin, as well as the parent/child relationship of nested child documents. * Long-running Collections API tasks can now be executed asynchronously; the new REQUESTSTATUS action provides status. * Added a hl.qparser parameter to allow you to define a query parser for hl.q highlight queries. * In Solr single-node mode, cores can now be created using named configsets. * New DocExpirationUpdateProcessorFactory supports computing an expiration date for documents from the TTL expression, as well as automatically deleting expired documents on a periodic basis. Solr 4.8.0 also includes many other new features as well as numerous optimizations and bugfixes of the corresponding Apache Lucene release. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. - Uwe Schindler uschind...@apache.org Apache Lucene PMC Chair / Committer Bremen, Germany http://lucene.apache.org/ -- Regards, Shalin Shekhar Mangar.
Solr Server Infrastructure Config
Hi, Can some one share or refer to get information on the SOLR server environment for production. Appx. We have 40 Collections, with appx size from 300MB to 8GB (for each Collection) and appx Total 100GB. The average increase of the size for total may be 2-5Gb / Year. To Get best performance for atleast 1000-1 Concurrent Users. Thanks Ravi
Re: [ANNOUNCE] Apache Solr 4.8.0 released
On 4/29/2014 8:27 AM, Flavio Pompermaier wrote: In which sense fields and types are now deprecated in schema.xml? Where can I found any pointer about this? https://issues.apache.org/jira/browse/SOLR-5936 Here is the patch for 4.8: https://issues.apache.org/jira/secure/attachment/12637716/SOLR-5936.branch_4x.patch This is the full list of fieldType classes that have been deprecated by this issue: BCDIntField BCDLongField BCDStrField DateField DoubleField FloatField IntField LongField SortableDoubleField SortableFloatField SortableIntField SortableLongField In schema.xml, these would show up preceded by solr. and would be in the class attribute of a fieldType. None of these types are used in the *main* example for 4.x versions. They do show up in some of the other examples in earlier releases. Those examples have been reworked to use the newer field types. Here's the javadoc for one of the types listed above, which shows the deprecation notice: http://lucene.apache.org/solr/4_8_0/solr-core/org/apache/solr/schema/LongField.html Thanks, Shawn
Re: Stored vs non-stored very large text fields
Dear Shawn, see attachment for my first brute force no-compression attempt. Kind regards, Jochen Zitat von Shawn Heisey s...@elyograg.org: On 4/29/2014 4:20 AM, Jochen Barth wrote: BTW: stored field compression: are all stored fields within a document are put into one compressed chunk, or by per-field basis? Here's the issue that added the compression to Lucene: https://issues.apache.org/jira/browse/LUCENE-4226 It was made the default stored field format for Lucene, which also made it the default for Solr. At this time, there is no way to remove compression on Solr without writing custom code. I filed an issue to make it configurable, but I don't know how to do it. Nobody else has offered a solution either. One day I might find some time to take a look at the issue and see if I can solve it myself. https://issues.apache.org/jira/browse/SOLR-4375 Here's the author's blog post that goes into more detail than the LUCENE issue: http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene Thanks, Shawn diff -c -r solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java *** solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java 2013-11-01 07:03:52.0 +0100 --- solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java 2014-04-29 13:58:27.0 +0200 *** *** 38,43 --- 38,44 import org.apache.lucene.codecs.lucene40.Lucene40NormsFormat; import org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoFormat; import org.apache.lucene.codecs.lucene40.Lucene40TermVectorsFormat; + import org.apache.lucene.codecs.lucene40.Lucene40StoredFieldsFormat; import org.apache.lucene.codecs.perfield.PerFieldPostingsFormat; import org.apache.lucene.index.SegmentInfo; import org.apache.lucene.store.Directory; *** *** 56,62 @Deprecated public class Lucene41Codec extends Codec { // TODO: slightly evil ! private final StoredFieldsFormat fieldsFormat = new CompressingStoredFieldsFormat(Lucene41StoredFields, CompressionMode.FAST, 1 14) { @Override public StoredFieldsWriter fieldsWriter(Directory directory, SegmentInfo si, IOContext context) throws IOException { throw new UnsupportedOperationException(this codec can only be used for reading); --- 57,63 @Deprecated public class Lucene41Codec extends Codec { // TODO: slightly evil ! private final StoredFieldsFormat fieldsFormat = new Lucene40StoredFieldsFormat() { @Override public StoredFieldsWriter fieldsWriter(Directory directory, SegmentInfo si, IOContext context) throws IOException { throw new UnsupportedOperationException(this codec can only be used for reading); diff -c -r solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java *** solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java 2013-11-01 07:03:52.0 +0100 --- solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java 2014-04-29 13:57:08.0 +0200 *** *** 32,38 import org.apache.lucene.codecs.TermVectorsFormat; import org.apache.lucene.codecs.lucene40.Lucene40LiveDocsFormat; import org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoFormat; ! import org.apache.lucene.codecs.lucene41.Lucene41StoredFieldsFormat; import org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat; import org.apache.lucene.codecs.perfield.PerFieldPostingsFormat; import org.apache.lucene.index.SegmentWriteState; --- 32,38 import org.apache.lucene.codecs.TermVectorsFormat; import org.apache.lucene.codecs.lucene40.Lucene40LiveDocsFormat; import org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoFormat; ! import org.apache.lucene.codecs.lucene40.Lucene40StoredFieldsFormat; import org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat; import org.apache.lucene.codecs.perfield.PerFieldPostingsFormat; import org.apache.lucene.index.SegmentWriteState; *** *** 53,59 // (it writes a minor version, etc). @Deprecated public class Lucene42Codec extends Codec { ! private final StoredFieldsFormat fieldsFormat = new Lucene41StoredFieldsFormat(); private final TermVectorsFormat vectorsFormat = new Lucene42TermVectorsFormat(); private final FieldInfosFormat fieldInfosFormat = new Lucene42FieldInfosFormat(); private final SegmentInfoFormat infosFormat = new Lucene40SegmentInfoFormat(); --- 53,59 // (it writes a minor version, etc). @Deprecated public class Lucene42Codec extends Codec { ! private final StoredFieldsFormat fieldsFormat = new Lucene40StoredFieldsFormat(); private final TermVectorsFormat vectorsFormat
Re: [ANNOUNCE] Apache Solr 4.8.0 released
On 4/29/2014 8:48 AM, Shawn Heisey wrote: On 4/29/2014 8:27 AM, Flavio Pompermaier wrote: In which sense fields and types are now deprecated in schema.xml? Where can I found any pointer about this? https://issues.apache.org/jira/browse/SOLR-5936 Here is the patch for 4.8: https://issues.apache.org/jira/secure/attachment/12637716/SOLR-5936.branch_4x.patch This is the full list of fieldType classes that have been deprecated by this issue: And now, seeing the other replies, I see that I didn't interpret your question properly. Thanks, Shawn
Re: Stemming not working with wildcard search
Can someone help me out with this issue? -- View this message in context: http://lucene.472066.n3.nabble.com/Stemming-not-working-with-wildcard-search-tp4133382p4133769.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Wildcard search not working with search term having special characters and digits
Can someone help me out with this issue please? -- View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-search-not-working-with-search-term-having-special-characters-and-digits-tp4133385p4133770.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr does not recognize language
Hi, solr/update should be used, not /solr/select curl 'http://localhost:8983/solr/update?commit=trueupdate.chain=langid' By the way don't you have following definition in your solrconfig.xml? requestHandler name=/update class=solr.UpdateRequestHandler lst name=defaults str name=update.chainlangid/str /lst /requestHandler On Tuesday, April 29, 2014 4:50 PM, Victor Pascual vic...@mobilemediacontent.com wrote: Hi Ahmet, thanks for your reply. Adding update.chain=langid to my query doesn't work: IP:8080/solr/select/?q=*%3A*update.chain=langid Regarding defining the chain in an UpdateRequestHandler... sorry for the lame question but shall I paste those three lines to solrconfig.xml, or shall I add them somewhere else? There is not UpdateRequestHandler in my solrconfig. Thanks! On Tue, Apr 29, 2014 at 3:13 PM, Ahmet Arslan iori...@yahoo.com wrote: Hi, Did you attach your chain to a UpdateRequestHandler? You can do it by adding update.chain=langid to the URL or defining it in a defaults section as follows lst name=defaults str name=update.chainlangid/str /lst On Tuesday, April 29, 2014 3:18 PM, Victor Pascual vic...@mobilemediacontent.com wrote: Dear all, I'm a new user of Solr. I've managed to index a bunch of documents (in fact, they are tweets) and everything works quite smoothly. Nevertheless it looks like Solr doesn't detect the language of my documents nor remove stopwords accordingly so I can extract the most frequent terms. I've added this piece of XML to my solrconfig.xml as well as the Tika lib jars. updateRequestProcessorChain name=langid processor class=org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory lst name=defaults str name=langid.fltext/str str name=langid.langFieldlang/str /lst /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain There is no error in the tomcat log file, so I have no clue of why this isn't working. Any hint on how to solve this problem will be much appreciated!
Re: Delete fields from document using a wildcard
I think this is useful as well. Can you open an issue? On Tue, Apr 29, 2014 at 7:53 PM, Shawn Heisey s...@elyograg.org wrote: On 4/29/2014 5:25 AM, Costi Muraru wrote: The problem is, I don't know the exact names of the fields I want to remove. All I know is that they end in *_1600_i. When removing fields from a document, I want to avoid querying SOLR to see what fields are actually present for the specific document. In this way, hopefully I can speed up the process. Querying to see the schema.xml is not going to help me much, since the field is defined a dynamic field *_i. This makes me think that expanding the documents client-side is not the best way to do it. Unfortunately at this time, you'll have to query the document and go through the list of fields to determine which need to be deleted, then build a request that deleted them. I don't know how hard it is to accomplish this in Solr. Getting it implemented might require a bunch of people standing up and saying we want this! Thanks, Shawn -- Regards, Shalin Shekhar Mangar.
Re: saving user actions on item in solr for later retrieval
Thank you, it was interesting and I have learned some new things in solr :) But the External File Field isn't a good option because the field is unsearchable which it very important to us. We think about the first option (updating document in solr) but preforming commit only each 10 minutes - If we would like to retrieve the value realtime we can use RealTimeGet. Maybe you have other suggestion? -- View this message in context: http://lucene.472066.n3.nabble.com/saving-user-actions-on-item-in-solr-for-later-retrieval-tp4133558p4133793.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: saving user actions on item in solr for later retrieval
Hi Nolim, Actually EFF is searchable. See my comments at the end of the page https://cwiki.apache.org/confluence/display/solr/Working+with+External+Files+and+Processes Ahmet On Tuesday, April 29, 2014 9:07 PM, nolim alony...@gmail.com wrote: Thank you, it was interesting and I have learned some new things in solr :) But the External File Field isn't a good option because the field is unsearchable which it very important to us. We think about the first option (updating document in solr) but preforming commit only each 10 minutes - If we would like to retrieve the value realtime we can use RealTimeGet. Maybe you have other suggestion? -- View this message in context: http://lucene.472066.n3.nabble.com/saving-user-actions-on-item-in-solr-for-later-retrieval-tp4133558p4133793.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr data directory contains index backups
None that I'm aware of. A bit of googling shows the accepted solution to be an external script via cron or something similar. I think I saw an issue open on Apache's Jira about this but can't find it now. Thanks, Greg On Apr 25, 2014, at 4:37 PM, solr2020 psgoms...@gmail.com wrote: Thanks Greg. Is there any Solr configuration to do this periodically if any unused index copy or snapshot exists in data directory.? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-data-directory-contains-index-backups-tp4132590p4133221.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Stored vs non-stored very large text fields
Something is really strange here: even when configuring fields id + sort_... to docValues=true -- so there's nothing to get from stored documents file -- performance is still terrible with ocr stored=true _even_ with my patch which stores uncompressed like solr4.0.0 (checked with strings -a on *.fdt). Just reading http://lucene.472066.n3.nabble.com/Can-Solr-handle-large-text-files-td3439504.html .. perhaps things will clear up soon (will check if spltting to index+non-stored and non-indexed+stored could help here) Kind regards, J. Barth Zitat von Shawn Heisey s...@elyograg.org: On 4/29/2014 4:20 AM, Jochen Barth wrote: BTW: stored field compression: are all stored fields within a document are put into one compressed chunk, or by per-field basis? Here's the issue that added the compression to Lucene: https://issues.apache.org/jira/browse/LUCENE-4226 It was made the default stored field format for Lucene, which also made it the default for Solr. At this time, there is no way to remove compression on Solr without writing custom code. I filed an issue to make it configurable, but I don't know how to do it. Nobody else has offered a solution either. One day I might find some time to take a look at the issue and see if I can solve it myself. https://issues.apache.org/jira/browse/SOLR-4375 Here's the author's blog post that goes into more detail than the LUCENE issue: http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene Thanks, Shawn
Re: Stored vs non-stored very large text fields
Ok, https://wiki.apache.org/solr/SolrPerformanceFactors states that: Retrieving the stored fields of a query result can be a significant expense. This cost is affected largely by the number of bytes stored per document--the higher byte count, the sparser the documents will be distributed on disk and more I/O is necessary to retrieve the fields (usually this is a concern when storing large fields, like the entire contents of a document). But in my case (with docValues=true) there should be no reason to access *.fdt. Kind regards, Jochen Zitat von Jochen Barth ba...@ub.uni-heidelberg.de: Something is really strange here: even when configuring fields id + sort_... to docValues=true -- so there's nothing to get from stored documents file -- performance is still terrible with ocr stored=true _even_ with my patch which stores uncompressed like solr4.0.0 (checked with strings -a on *.fdt). Just reading http://lucene.472066.n3.nabble.com/Can-Solr-handle-large-text-files-td3439504.html .. perhaps things will clear up soon (will check if spltting to index+non-stored and non-indexed+stored could help here) Kind regards, J. Barth Zitat von Shawn Heisey s...@elyograg.org: On 4/29/2014 4:20 AM, Jochen Barth wrote: BTW: stored field compression: are all stored fields within a document are put into one compressed chunk, or by per-field basis? Here's the issue that added the compression to Lucene: https://issues.apache.org/jira/browse/LUCENE-4226 It was made the default stored field format for Lucene, which also made it the default for Solr. At this time, there is no way to remove compression on Solr without writing custom code. I filed an issue to make it configurable, but I don't know how to do it. Nobody else has offered a solution either. One day I might find some time to take a look at the issue and see if I can solve it myself. https://issues.apache.org/jira/browse/SOLR-4375 Here's the author's blog post that goes into more detail than the LUCENE issue: http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene Thanks, Shawn
Re: Delete fields from document using a wildcard
I've opened an issue: https://issues.apache.org/jira/browse/SOLR-6034 Feedback in Jira is appreciated. On Tue, Apr 29, 2014 at 8:34 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: I think this is useful as well. Can you open an issue? On Tue, Apr 29, 2014 at 7:53 PM, Shawn Heisey s...@elyograg.org wrote: On 4/29/2014 5:25 AM, Costi Muraru wrote: The problem is, I don't know the exact names of the fields I want to remove. All I know is that they end in *_1600_i. When removing fields from a document, I want to avoid querying SOLR to see what fields are actually present for the specific document. In this way, hopefully I can speed up the process. Querying to see the schema.xml is not going to help me much, since the field is defined a dynamic field *_i. This makes me think that expanding the documents client-side is not the best way to do it. Unfortunately at this time, you'll have to query the document and go through the list of fields to determine which need to be deleted, then build a request that deleted them. I don't know how hard it is to accomplish this in Solr. Getting it implemented might require a bunch of people standing up and saying we want this! Thanks, Shawn -- Regards, Shalin Shekhar Mangar.
Re: Raw query parameters
You saved my life Shawn! Thanks! On Mon, Apr 28, 2014 at 11:54 PM, Shawn Heisey s...@elyograg.org wrote: On 4/28/2014 7:54 PM, Xavier Morera wrote: Would anyone be so kind to explain what are the Raw query parameters in Solr's admin UI. I can't find an explanation in either the reference guide nor wiki nor web search. The query API supports a lot more parameters than are shown on the admin UI. For instance, If you are doing a faceted search, there are only boxes for facet.query, facet.field, and facet.prefix ... but faceted search supports a lot more parameters (like facet.method, facet.limit, facet.mincount, facet.sort, etc). Raw Query Parameters gives you a way to use the entire query API, not just the few things that have UI input boxes. Thanks, Shawn -- *Xavier Morera* email: xav...@familiamorera.com CR: +(506) 8849 8866 US: +1 (305) 600 4919 skype: xmorera
Re: Indexing Big Data With or Without Solr
mark. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-Big-Data-With-or-Without-Solr-tp4131215p4133831.html Sent from the Solr - User mailing list archive at Nabble.com.
timeAllowed in not honoring
Hi, I am using solr 4.2 with the index size of 40GB, while querying to my index there are some queries which is taking the significant amount of time of about 22 seconds *in the case of minmatch of 50%*. So i added a parameter timeAllowed = 2000 in my query but this doesn't seems to be work. Please help me out. With Regards Aman Tandon
Re: timeAllowed in not honoring
On 4/29/2014 10:05 PM, Aman Tandon wrote: I am using solr 4.2 with the index size of 40GB, while querying to my index there are some queries which is taking the significant amount of time of about 22 seconds *in the case of minmatch of 50%*. So i added a parameter timeAllowed = 2000 in my query but this doesn't seems to be work. Please help me out. I remember reading that timeAllowed has some limitations about which stages of a query it can limit, particularly in the distributed case. These limitations mean that it cannot always limit the total time for a query. I do not remember precisely what those limitations are, and I cannot find whatever it was that I was reading. When I looked through my local list archive to see if you had ever mentioned how much RAM you have and what the size of your Solr heap is, there didn't seem to be anything. There's not enough information for me to know whether that 40GB is the amount of index data on a single SolrCloud server, or whether it's the total size of the index across all servers. If we leave timeAllowed alone for a moment and treat this purely as a performance problem, usually my questions revolve around figuring out whether you have enough RAM. Here's where that conversation ends up: http://wiki.apache.org/solr/SolrPerformanceProblems I think I've probably mentioned this to you before on another thread. Thanks, Shawn
search result not correct in solr
Hi I am trying to search with word Ribbing and i am also getting those result which have R-B or RB letter in their dsecription but when i am trying to search with Ribbin i m getting correct result...not getting any clue what to use in my solr schema.xml. Any guidance will be helpful. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/search-result-not-correct-in-solr-tp4133841.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: timeAllowed in not honoring
Shawn this is the first time i raised this problem. My heap size is 14GB and i am not using solr cloud currently, 40GB index is replicated from master to two slaves. I read somewhere that it return the partial results which is computed by the query in that specified amount of time which is defined by this timeAllowed parameter, but it doesn't seems to happen. Here is the link : http://wiki.apache.org/solr/CommonQueryParameters#timeAllowed *The time allowed for a search to finish. This value only applies to the search and not to requests in general. Time is in milliseconds. Values = 0 mean no time restriction. Partial results may be returned (if there are any). * With Regards Aman Tandon On Wed, Apr 30, 2014 at 10:05 AM, Shawn Heisey s...@elyograg.org wrote: On 4/29/2014 10:05 PM, Aman Tandon wrote: I am using solr 4.2 with the index size of 40GB, while querying to my index there are some queries which is taking the significant amount of time of about 22 seconds *in the case of minmatch of 50%*. So i added a parameter timeAllowed = 2000 in my query but this doesn't seems to be work. Please help me out. I remember reading that timeAllowed has some limitations about which stages of a query it can limit, particularly in the distributed case. These limitations mean that it cannot always limit the total time for a query. I do not remember precisely what those limitations are, and I cannot find whatever it was that I was reading. When I looked through my local list archive to see if you had ever mentioned how much RAM you have and what the size of your Solr heap is, there didn't seem to be anything. There's not enough information for me to know whether that 40GB is the amount of index data on a single SolrCloud server, or whether it's the total size of the index across all servers. If we leave timeAllowed alone for a moment and treat this purely as a performance problem, usually my questions revolve around figuring out whether you have enough RAM. Here's where that conversation ends up: http://wiki.apache.org/solr/SolrPerformanceProblems I think I've probably mentioned this to you before on another thread. Thanks, Shawn