shard query with duplicated documents cause inaccuate paginating

2014-04-29 Thread Jie Sun
When we have duplicated documents (same uniqueID) among the shards, the query
results could be non-deterministic, this is an known issue.

The consequence when we display the search results on our UI page with
paginating is: if user click the 'last page', it could display an empty page
since the total doc count returned by the query with dups is not accurate
(includes dups apparently).

Is there a known work around for this problem?

We tried the following 2 approaches but each of them problem:
1) use a query like:
curl -d q=*:*fl=message_idrows=1start=1999
http://[hostname]:8080/mywebapp/shards/[coreid]/select?
Since I am using a very large number for the 'rows', it will return the
accurate doc count, but it takes about 20 second to run this query on an
average customer with a little over 1 million rows returned, so the
performance is not acceptable.

2) use facet query:
curl -d
q=*:*fl=message_idfacet=truefacet.mincount=2rows=0facet.field=message_idindent=on
http://[hostname]:8080/[mywebapp]/shards/[coreid]/select?
the test shows this might not return accurate doc counts from time to time.

any suggestions what is the best work around to get an accurate doc count
with sharded query with dups, and efficient when run with large data set?

thanks
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/shard-query-with-duplicated-documents-cause-inaccuate-paginating-tp4133666.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Selectively hiding SOLR facets.

2014-04-29 Thread Iker Mtnz. Apellaniz
You could use facet.mincount parameter. default value is 0, setting it as N
would require a minimum appearance on the result Set

Iker


2014-04-29 4:56 GMT+02:00 atuldj.jadhav atuldj.jad...@gmail.com:

 Yes, but with my query *country:USA * it is returning me languages
 belonging to countries other than USA.

 Is there any way I can avoid such languages appearing in my facet filters?




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Selectively-hiding-SOLR-facets-tp4132770p4133638.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
/** @author imartinez*/
Person me = *new* Developer();
me.setName(*Iker Mtz de Apellaniz Anzuola*);
me.setTwit(@mitxino77 https://twitter.com/mitxino77);
me.setLocations({St Cugat, Barcelona, Kanpezu, Euskadi, *, World]});
me.setSkills({*SoftwareDeveloper, Curious, AmateurCook*});
*return* me;


Re: Selectively hiding SOLR facets.

2014-04-29 Thread Alexandre Rafalovitch
How would you know it if you did this manually? Solr does not know
that Dutch is not valid for USA. You need to give it some sort of
signal.

One way could be to have a dynamic field for a facet which includes
country name. So, you have language_USA, language_Belgium, etc. Then,
when you do country:USA, you also do facet count against language_USA.

You could even get super fancy with field aliases and other tricks,
though this may get messy if you do need it for each country.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Tue, Apr 29, 2014 at 9:56 AM, atuldj.jadhav atuldj.jad...@gmail.com wrote:
 Yes, but with my query *country:USA * it is returning me languages
 belonging to countries other than USA.

 Is there any way I can avoid such languages appearing in my facet filters?




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Selectively-hiding-SOLR-facets-tp4132770p4133638.html
 Sent from the Solr - User mailing list archive at Nabble.com.


How to reduce enumerating docs

2014-04-29 Thread ??????
Hi all,


My doc has two fileds namely length and fingerprint, which stand for 
the length and text of the doc. I have a custom SearchComponent that enum all 
the docs according to the term to search the fingerprint. That could be very 
slow because the number of docs is very huge and the operation is time consume. 
Since I only care about the docs with the length within a close range around 
that specified in the query, what's the right way to accelerate? Thanks


DocsEnum docsEnum = sub_reader.termDocsEnum(term);
if (docsEnum == null) {
  continue;
}
while ((doc = docsEnum.nextDoc()) != DocsEnum.NO_MORE_DOCS) {
// do something expensive
}

Stored vs non-stored very large text fields

2014-04-29 Thread Jochen Barth
Dear reader,

I'm trying to use solr for a hierarchical search:
metadata from the higher-levelled elements is copied to the lower ones,
and each element has the complete ocr text which it belongs to.

At volume level, of course, we will have the complete ocr text in one
doc and we need to store it for highlighting.

My solr instance is configured like this:
java -Xms12000m -Xmx12000m -jar start.jar
[ imported with 4.7.0, performance tests with 4.8.0 ]

Solr index files are of this size:
  0.013gb .tip The index into the Term Dictionary
  0.017gb .nvd Encodes length and boost factors for docs and fields
  0.546gb .tim The term dictionary, stores term info
  1.332gb .doc Contains the list of docs which contain each term along
with frequency
  4.943gb .pos Stores position information about where a term occurs in
the index
 12.743gb .tvd Contains information about each document that has term
vectors
 17.340gb .fdt The stored fields for documents ocr

Configuring the ocr field as non-stored I'll get those performance
measures (see docs/s) after warmup:

jb@serv7:~ perl solr-performance.pl zeit 6
http://127.0.0.1:58983/solr/collection1/select
?wt=json
q={%21q.op%3dAND}ocr%3A%28zeit%29
fq=mashed_b%3Afalse
fl=id
sort=sort_name_s asc,id+asc
rows=100
time: 3.96 s
bytes: 1.878 MB
64768 docs found; got 64768 docs
16353 docs/s; 0.474 MB/s

... and with ocr stored, even _not_ requesting ocr with fl=... with
disabled documentCache class=solr.LRUCache ... / and
enableLazyFieldLoadingfalse/enableLazyFieldLoading
[ with documentCache and enableLazyFieldLoading results are even worser ]

... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51):
jb@serv7:~ perl solr-performance.pl zeit 6
http://127.0.0.1:58983/solr/collection1/select
?wt=json
q={%21q.op%3dAND}ocr%3A%28zeit%29
fq=mashed_b%3Afalse
fl=id
sort=sort_name_s asc,id+asc
rows=100
time: 61.58 s
bytes: 1.878 MB
64768 docs found; got 64768 docs
1052 docs/s; 0.030 MB/s

... using solr-4.8.0 and oracle-jdk1.7.0_55 :
jb@serv7:~ perl solr-performance.pl zeit 6
http://127.0.0.1:58983/solr/collection1/select
?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29
fq=mashed_b%3Afalse
fl=id
sort=sort_name_s asc,id+asc
rows=100
time: 58.80 s
bytes: 1.878 MB
64768 docs found; got 64768 docs
1102 docs/s; 0.032 MB/s

Is there any reason why stored vs non-stored is 16 times slower?
Is there a way to store ocr field in a separate index or somethings
like this?

Kind regards,
J. Barth




-- 
J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580

pgp public key:
http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc


Re: How to reduce enumerating docs

2014-04-29 Thread Alexandre Rafalovitch
Can't you just specify the length range as a filter query? If your
length type is tint/tlong, Solr already has optimized code that uses
multiple resolutions depth to efficiently filter through the numbers.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Tue, Apr 29, 2014 at 3:23 PM, 郑华斌 huabin.zh...@qq.com wrote:
 Hi all,


 My doc has two fileds namely length and fingerprint, which stand for 
 the length and text of the doc. I have a custom SearchComponent that enum all 
 the docs according to the term to search the fingerprint. That could be very 
 slow because the number of docs is very huge and the operation is time 
 consume. Since I only care about the docs with the length within a close 
 range around that specified in the query, what's the right way to accelerate? 
 Thanks


 DocsEnum docsEnum = sub_reader.termDocsEnum(term);
 if (docsEnum == null) {
   continue;
 }
 while ((doc = docsEnum.nextDoc()) != DocsEnum.NO_MORE_DOCS) {
 // do something expensive
 }


Re: Stored vs non-stored very large text fields

2014-04-29 Thread Alexandre Rafalovitch
Couple of random thoughts:
1) The latest (4.8) Solr has support for nested documents, as well as
for expand components. Maybe that will let you have more efficient
architecture: http://heliosearch.org/expand-block-join/

2) Do you return OCR text to the client? Or just search it? If just
search it, you don't need to store it

3) If you do need to store it and return it, do you always have to
return it? If not, you could look at lazy-loading the field (setting
in solrconfig.xml).

4) Is OCR text or image? The stored fields are compressed by default,
I wonder if the compression/decompression of a large image is an
issue.

5) JDK 8 apparently makes Lucene much happier (speed of some
operations). Might be something to test if all else fails.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Tue, Apr 29, 2014 at 3:28 PM, Jochen Barth
ba...@ub.uni-heidelberg.de wrote:
 Dear reader,

 I'm trying to use solr for a hierarchical search:
 metadata from the higher-levelled elements is copied to the lower ones,
 and each element has the complete ocr text which it belongs to.

 At volume level, of course, we will have the complete ocr text in one
 doc and we need to store it for highlighting.

 My solr instance is configured like this:
 java -Xms12000m -Xmx12000m -jar start.jar
 [ imported with 4.7.0, performance tests with 4.8.0 ]

 Solr index files are of this size:
   0.013gb .tip The index into the Term Dictionary
   0.017gb .nvd Encodes length and boost factors for docs and fields
   0.546gb .tim The term dictionary, stores term info
   1.332gb .doc Contains the list of docs which contain each term along
 with frequency
   4.943gb .pos Stores position information about where a term occurs in
 the index
  12.743gb .tvd Contains information about each document that has term
 vectors
  17.340gb .fdt The stored fields for documents ocr

 Configuring the ocr field as non-stored I'll get those performance
 measures (see docs/s) after warmup:

 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=json
 q={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 3.96 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 16353 docs/s; 0.474 MB/s

 ... and with ocr stored, even _not_ requesting ocr with fl=... with
 disabled documentCache class=solr.LRUCache ... / and
 enableLazyFieldLoadingfalse/enableLazyFieldLoading
 [ with documentCache and enableLazyFieldLoading results are even worser ]

 ... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51):
 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=json
 q={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 61.58 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 1052 docs/s; 0.030 MB/s

 ... using solr-4.8.0 and oracle-jdk1.7.0_55 :
 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 58.80 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 1102 docs/s; 0.032 MB/s

 Is there any reason why stored vs non-stored is 16 times slower?
 Is there a way to store ocr field in a separate index or somethings
 like this?

 Kind regards,
 J. Barth




 --
 J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580

 pgp public key:
 http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc


Apache Solr - Pdf Indexing.

2014-04-29 Thread vignesh
Hi Team,

 

 I am indexing PDF using  Apache Solr 3.6 . Passing around 3000
keywords using the OR operator and able to get the files containing the
keywords. Kindly guide me to get the keyword list in a .PDF file.

 

Note : In Schema.xml have declared a unique tag id.

 

 

 

Thanks  Regards.

Vignesh.V

 

cid:image001.jpg@01CA4872.39B33D40

Ninestars Information Technologies Limited.,

72, Greams Road, Thousand Lights, Chennai - 600 006. India.

Landline : +91 44 2829 4226 / 36 / 56   X: 144

 blocked::http://www.ninestars.in/ www.ninestars.in 

 


--
STOP Virus, STOP SPAM, SAVE Bandwidth!
http://www.safentrix.com/adlink?cid=0
--

Re: How to reduce enumerating docs

2014-04-29 Thread ??????
Will the filter query execute before or after my custom search component?


In fact, I care about that, for example??if the following \docsEnum will 
contain 1M docs for term \aterm without the flter query, will it be less than 
1M in case that the filter query is present?


DocsEnum docsEnum = sub_reader.termDocsEnum(aterm);








-- Original --
From:  Alexandre Rafalovitch;arafa...@gmail.com;
Send time: Tuesday, Apr 29, 2014 5:13 PM
To: solr-usersolr-user@lucene.apache.org; 

Subject:  Re: How to reduce enumerating docs



Can't you just specify the length range as a filter query? If your
length type is tint/tlong, Solr already has optimized code that uses
multiple resolutions depth to efficiently filter through the numbers.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Tue, Apr 29, 2014 at 3:23 PM, ?? huabin.zh...@qq.com wrote:
 Hi all,


 My doc has two fileds namely length and fingerprint, which stand for 
 the length and text of the doc. I have a custom SearchComponent that enum all 
 the docs according to the term to search the fingerprint. That could be very 
 slow because the number of docs is very huge and the operation is time 
 consume. Since I only care about the docs with the length within a close 
 range around that specified in the query, what's the right way to accelerate? 
 Thanks


 DocsEnum docsEnum = sub_reader.termDocsEnum(term);
 if (docsEnum == null) {
   continue;
 }
 while ((doc = docsEnum.nextDoc()) != DocsEnum.NO_MORE_DOCS) {
 // do something expensive
 }
.

Re: Apache Solr - Pdf Indexing.

2014-04-29 Thread Alexandre Rafalovitch
Your question is not terribly clear. Are you having troubles indexing PDF
in general? Try the tutorial and specifically look for extract handler.

Or you already got PDF into the system but your 3000 Keyword query does not
match it? In which case it might be just that PDF extraction is limited by
definition. Try to have the extracted content stored (not just indexed) and
see whether the extracted text matches your expectations.

Otherwise, rephrase the query. Say what you expected, what you got instead
and where you are stuck.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr
proficiency


On Tue, Apr 29, 2014 at 4:23 PM, vignesh vignes...@ninestars.in wrote:

  Hi Team,



  I am indexing PDF using  Apache Solr 3.6 . Passing around
 3000 keywords using the OR operator and able to get the files containing
 the keywords. Kindly guide me to get the keyword list in a .PDF file.



 Note : In Schema.xml have declared a unique tag *“id”.*







 *Thanks  Regards.*

 *Vignesh.V*



 *[image: cid:image001.jpg@01CA4872.39B33D40]*

 Ninestars Information Technologies Limited.,

 72, Greams Road, Thousand Lights, Chennai - 600 006. India.

 Landline : +91 44 2829 4226 / 36 / 56   X: 144

 www.ninestars.in



 --
 STOP Virus, STOP SPAM, SAVE Bandwidth!
 www.safentrix.com http://www.safentrix.com/adlink?cid=0
 --




Re: Stored vs non-stored very large text fields

2014-04-29 Thread Jochen Barth
Am 29.04.2014 11:19, schrieb Alexandre Rafalovitch:
 Couple of random thoughts:
 1) The latest (4.8) Solr has support for nested documents, as well as
 for expand components. Maybe that will let you have more efficient
 architecture: http://heliosearch.org/expand-block-join/

Yes, I've seen this, but as far as I understood you have to know on
which nesting level you do your query.
My search should work on any level, say,

volume title 1986
chapter 1.1 author marc
chapter 1.1.3 title does not matter
chapter 1.1.3.1 title abc
chapter 1.1.3.2 title xyz

should match by querying +author:marc +title:abc // or // +author:marc
+title:xyz
but // not // +title:abc +title:xyz

(we'll have an unkown number of levels)


 2) Do you return OCR text to the client? Or just search it? If just
 search it, you don't need to store it

I'll want to get highlighted snippets.

 3) If you do need to store it and return it, do you always have to
 return it? If not, you could look at lazy-loading the field (setting
 in solrconfig.xml).

Let's see, perhaps this is a sorting problem which could be solved by
setting field id and sort_... to docValues=ture.

 4) Is OCR text or image? The stored fields are compressed by default,
 I wonder if the compression/decompression of a large image is an
 issue.

Text.


 5) JDK 8 apparently makes Lucene much happier (speed of some
 operations). Might be something to test if all else fails.

Ok...

Thanks,
J. Barth


 Regards,
Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr 
 proficiency
 
 
 On Tue, Apr 29, 2014 at 3:28 PM, Jochen Barth
 ba...@ub.uni-heidelberg.de wrote:
 Dear reader,

 I'm trying to use solr for a hierarchical search:
 metadata from the higher-levelled elements is copied to the lower ones,
 and each element has the complete ocr text which it belongs to.

 At volume level, of course, we will have the complete ocr text in one
 doc and we need to store it for highlighting.

 My solr instance is configured like this:
 java -Xms12000m -Xmx12000m -jar start.jar
 [ imported with 4.7.0, performance tests with 4.8.0 ]

 Solr index files are of this size:
   0.013gb .tip The index into the Term Dictionary
   0.017gb .nvd Encodes length and boost factors for docs and fields
   0.546gb .tim The term dictionary, stores term info
   1.332gb .doc Contains the list of docs which contain each term along
 with frequency
   4.943gb .pos Stores position information about where a term occurs in
 the index
  12.743gb .tvd Contains information about each document that has term
 vectors
  17.340gb .fdt The stored fields for documents ocr

 Configuring the ocr field as non-stored I'll get those performance
 measures (see docs/s) after warmup:

 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=json
 q={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 3.96 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 16353 docs/s; 0.474 MB/s

 ... and with ocr stored, even _not_ requesting ocr with fl=... with
 disabled documentCache class=solr.LRUCache ... / and
 enableLazyFieldLoadingfalse/enableLazyFieldLoading
 [ with documentCache and enableLazyFieldLoading results are even worser ]

 ... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51):
 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=json
 q={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 61.58 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 1052 docs/s; 0.030 MB/s

 ... using solr-4.8.0 and oracle-jdk1.7.0_55 :
 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 58.80 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 1102 docs/s; 0.032 MB/s

 Is there any reason why stored vs non-stored is 16 times slower?
 Is there a way to store ocr field in a separate index or somethings
 like this?

 Kind regards,
 J. Barth




 --
 J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580

 pgp public key:
 http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc

-- 
J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580

pgp public key:
http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc


Re: Stored vs non-stored very large text fields

2014-04-29 Thread Jochen Barth
BTW: stored field compression:
are all stored fields within a document are put into one compressed chunk,
or by per-field basis?

Kind regards,
J. Barth



 
 Regards,
Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr 
 proficiency
 
 
 On Tue, Apr 29, 2014 at 3:28 PM, Jochen Barth
 ba...@ub.uni-heidelberg.de wrote:
 Dear reader,

 I'm trying to use solr for a hierarchical search:
 metadata from the higher-levelled elements is copied to the lower ones,
 and each element has the complete ocr text which it belongs to.

 At volume level, of course, we will have the complete ocr text in one
 doc and we need to store it for highlighting.

 My solr instance is configured like this:
 java -Xms12000m -Xmx12000m -jar start.jar
 [ imported with 4.7.0, performance tests with 4.8.0 ]

 Solr index files are of this size:
   0.013gb .tip The index into the Term Dictionary
   0.017gb .nvd Encodes length and boost factors for docs and fields
   0.546gb .tim The term dictionary, stores term info
   1.332gb .doc Contains the list of docs which contain each term along
 with frequency
   4.943gb .pos Stores position information about where a term occurs in
 the index
  12.743gb .tvd Contains information about each document that has term
 vectors
  17.340gb .fdt The stored fields for documents ocr

 Configuring the ocr field as non-stored I'll get those performance
 measures (see docs/s) after warmup:

 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=json
 q={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 3.96 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 16353 docs/s; 0.474 MB/s

 ... and with ocr stored, even _not_ requesting ocr with fl=... with
 disabled documentCache class=solr.LRUCache ... / and
 enableLazyFieldLoadingfalse/enableLazyFieldLoading
 [ with documentCache and enableLazyFieldLoading results are even worser ]

 ... using solr-4.7.0 and ubuntu12.04 openjdk7 (...u51):
 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=json
 q={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 61.58 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 1052 docs/s; 0.030 MB/s

 ... using solr-4.8.0 and oracle-jdk1.7.0_55 :
 jb@serv7:~ perl solr-performance.pl zeit 6
 http://127.0.0.1:58983/solr/collection1/select
 ?wt=jsonq={%21q.op%3dAND}ocr%3A%28zeit%29
 fq=mashed_b%3Afalse
 fl=id
 sort=sort_name_s asc,id+asc
 rows=100
 time: 58.80 s
 bytes: 1.878 MB
 64768 docs found; got 64768 docs
 1102 docs/s; 0.032 MB/s

 Is there any reason why stored vs non-stored is 16 times slower?
 Is there a way to store ocr field in a separate index or somethings
 like this?

 Kind regards,
 J. Barth




 --
 J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580

 pgp public key:
 http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc

-- 
J. Barth * IT, Universitaetsbibliothek Heidelberg * 06221 / 54-2580

pgp public key:
http://digi.ub.uni-heidelberg.de/barth%40ub.uni-heidelberg.de.asc


Re: Apache Solr - Pdf Indexing.

2014-04-29 Thread Gora Mohanty
On Apr 29, 2014 2:52 PM, vignesh vignes...@ninestars.in wrote:

 Hi Team,



  I am indexing PDF using  Apache Solr 3.6 . Passing around
3000 keywords using the OR operator and able to get the files containing
the keywords. Kindly guide me to get the keyword list in a .PDF file.

What do you mean? Do you want Solr search results in a PDF file? Why would
a search engine provide such functionality? You can take the Solr XML/JSON
results, and generate a PDF if you need that.

Regards,
Gora


Re: Delete fields from document using a wildcard

2014-04-29 Thread Costi Muraru
Thanks, Alex for the input.

Let me provide a better example on what I'm trying to achieve. I have
documents like this:

doc
field name=id100/field
field name=2_1600_i1/field
field name=5_1601_i5/field
field name=112_1602_i7/field
/doc

The schema looks the usual way:
dynamicField name=*_i  type=intindexed=true  stored=true/
The dynamic field pattern I'm using is this: id_day_i.

Each day I want to add new fields for the current day and remove the fields
for the oldest one.

adddoc
  field name=id100/field

  !-- add fields for current day --
  field name=251_1603_i update=set25/field

  !-- remove fields for oldest day --
  field name=2_1600_i update=set null=true1/field
/doc/add

The problem is, I don't know the exact names of the fields I want to
remove. All I know is that they end in *_1600_i.

When removing fields from a document, I want to avoid querying SOLR to see
what fields are actually present for the specific document. In this way,
hopefully I can speed up the process. Querying to see the schema.xml is not
going to help me much, since the field is defined a dynamic field *_i. This
makes me think that expanding the documents client-side is not the best way
to do it.

Regarding the second approach, to expand the documents server-side. I took
a look over the SOLR code and came upon the UpdateRequestProcessor.java class
which had this interesting javadoc:

* * This is a good place for subclassed update handlers to process the
document before it is *
* * indexed.  You may wish to add/remove fields or check if the requested
user is allowed to *
* * update the given document...*

As you can imagine, I have no expertise in SOLR code. How would you say it
would be possible to retrieve the document and its fields for the given id
and update the update/delete command to include the fields that match the
pattern I'm giving (eg. *_1600_i)?

Thanks,
Costi


On Tue, Apr 29, 2014 at 6:41 AM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 Not out of the box, as far as I know.

 Custom UpdateRequestProcessor could possibly do some sort of expansion
 of the field name by verifying the actual schema. Not sure if API
 supports that level of flexibility. Or, for latest Solr, you can
 request the list of known field names via REST and do client-side
 expansion instead.

 Regards,
Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr
 proficiency


 On Tue, Apr 29, 2014 at 12:20 AM, Costi Muraru costimur...@gmail.com
 wrote:
  Hi guys,
 
  Would be possible, using Atomic Updates in SOLR4, to remove all fields
  matching a pattern? For instance something like:
 
  adddoc
field name=id100/field
*field name=*_name_i update=set null=true/field*
  /doc/add
 
  Or something similar to remove certain fields in all documents.
 
  Thanks,
  Costi



Apache Solr - Pdf Indexing.

2014-04-29 Thread vignesh
Hi Team,

 

 I am indexing PDF using  Apache Solr 3.6 . Passing around 3000
keywords using the OR operator (gardens OR flowers OR time OR train OR trees
OR etc) able to get the files containing these keywords. But every .PDF file
will not be containing all the keywords, some may contain (gardens, flowers
and time) and some .pdf file may contain (trees). Kindly guide me to get the
list of keywords matching every file.

 

For Example. (Required Output).

 

Id: xyz.pdf

Matching Keywords : gardens, flowers, time.

 

Id: abc.pdf

Matching Keywords : train, trees.

 

Id: ghi.pdf

Matching Keywords : train, trees, time.

 

 

Thanks  Regards.

Vignesh.V

 

cid:image001.jpg@01CA4872.39B33D40

Ninestars Information Technologies Limited.,

72, Greams Road, Thousand Lights, Chennai - 600 006. India.

Landline : +91 44 2829 4226 / 36 / 56   X: 144

 blocked::http://www.ninestars.in/ www.ninestars.in 

 


--
STOP Virus, STOP SPAM, SAVE Bandwidth!
http://www.safentrix.com/adlink?cid=0
--

Solr does not recognize language

2014-04-29 Thread Victor Pascual
Dear all,

I'm a new user of Solr. I've managed to index a bunch of documents (in
fact, they are tweets) and everything works quite smoothly.

Nevertheless it looks like Solr doesn't detect the language of my documents
nor remove stopwords accordingly so I can extract the most frequent terms.

I've added this piece of XML to my solrconfig.xml as well as the Tika lib
jars.

updateRequestProcessorChain name=langid
   processor
class=org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory
  lst name=defaults
str name=langid.fltext/str
str name=langid.langFieldlang/str
  /lst
/processor
processor class=solr.LogUpdateProcessorFactory /
   processor class=solr.RunUpdateProcessorFactory /
 /updateRequestProcessorChain

There is no error in the tomcat log file, so I have no clue of why this
isn't working.
Any hint on how to solve this problem will be much appreciated!


Re: Stored vs non-stored very large text fields

2014-04-29 Thread Shawn Heisey
On 4/29/2014 4:20 AM, Jochen Barth wrote:
 BTW: stored field compression:
 are all stored fields within a document are put into one compressed chunk,
 or by per-field basis?

Here's the issue that added the compression to Lucene:

https://issues.apache.org/jira/browse/LUCENE-4226

It was made the default stored field format for Lucene, which also made
it the default for Solr.  At this time, there is no way to remove
compression on Solr without writing custom code.  I filed an issue to
make it configurable, but I don't know how to do it.  Nobody else has
offered a solution either.  One day I might find some time to take a
look at the issue and see if I can solve it myself.

https://issues.apache.org/jira/browse/SOLR-4375

Here's the author's blog post that goes into more detail than the LUCENE
issue:

http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene

Thanks,
Shawn



Re: Solr does not recognize language

2014-04-29 Thread Ahmet Arslan
Hi,

Did you attach your chain to a UpdateRequestHandler?

You can do it by adding update.chain=langid to the URL or defining it in a 
defaults section as follows

lst name=defaults
     str name=update.chainlangid/str
   /lst



On Tuesday, April 29, 2014 3:18 PM, Victor Pascual 
vic...@mobilemediacontent.com wrote:
Dear all,

I'm a new user of Solr. I've managed to index a bunch of documents (in
fact, they are tweets) and everything works quite smoothly.

Nevertheless it looks like Solr doesn't detect the language of my documents
nor remove stopwords accordingly so I can extract the most frequent terms.

I've added this piece of XML to my solrconfig.xml as well as the Tika lib
jars.

    updateRequestProcessorChain name=langid
       processor
class=org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory
          lst name=defaults
            str name=langid.fltext/str
            str name=langid.langFieldlang/str
          /lst
        /processor
        processor class=solr.LogUpdateProcessorFactory /
       processor class=solr.RunUpdateProcessorFactory /
     /updateRequestProcessorChain

There is no error in the tomcat log file, so I have no clue of why this
isn't working.
Any hint on how to solve this problem will be much appreciated!


Re: Solr does not recognize language

2014-04-29 Thread Victor Pascual
Hi Ahmet,

thanks for your reply. Adding update.chain=langid to my query doesn't
work: IP:8080/solr/select/?q=*%3A*update.chain=langid
Regarding defining the chain in an UpdateRequestHandler... sorry for the
lame question but shall I paste those three lines to solrconfig.xml, or
shall I add them somewhere else?

There is not UpdateRequestHandler in my solrconfig.

Thanks!


On Tue, Apr 29, 2014 at 3:13 PM, Ahmet Arslan iori...@yahoo.com wrote:

 Hi,

 Did you attach your chain to a UpdateRequestHandler?

 You can do it by adding update.chain=langid to the URL or defining it in
 a defaults section as follows

 lst name=defaults
  str name=update.chainlangid/str
/lst



 On Tuesday, April 29, 2014 3:18 PM, Victor Pascual 
 vic...@mobilemediacontent.com wrote:
 Dear all,

 I'm a new user of Solr. I've managed to index a bunch of documents (in
 fact, they are tweets) and everything works quite smoothly.

 Nevertheless it looks like Solr doesn't detect the language of my documents
 nor remove stopwords accordingly so I can extract the most frequent terms.

 I've added this piece of XML to my solrconfig.xml as well as the Tika lib
 jars.

 updateRequestProcessorChain name=langid
processor

 class=org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory
   lst name=defaults
 str name=langid.fltext/str
 str name=langid.langFieldlang/str
   /lst
 /processor
 processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
  /updateRequestProcessorChain

 There is no error in the tomcat log file, so I have no clue of why this
 isn't working.
 Any hint on how to solve this problem will be much appreciated!



Re: Delete fields from document using a wildcard

2014-04-29 Thread Shawn Heisey
On 4/29/2014 5:25 AM, Costi Muraru wrote:
 The problem is, I don't know the exact names of the fields I want to
 remove. All I know is that they end in *_1600_i.
 
 When removing fields from a document, I want to avoid querying SOLR to see
 what fields are actually present for the specific document. In this way,
 hopefully I can speed up the process. Querying to see the schema.xml is not
 going to help me much, since the field is defined a dynamic field *_i. This
 makes me think that expanding the documents client-side is not the best way
 to do it.

Unfortunately at this time, you'll have to query the document and
go through the list of fields to determine which need to be deleted,
then build a request that deleted them.

I don't know how hard it is to accomplish this in Solr.  Getting it
implemented might require a bunch of people standing up and saying we
want this!

Thanks,
Shawn


Re: [ANNOUNCE] Apache Solr 4.8.0 released

2014-04-29 Thread Flavio Pompermaier
In which sense fields and types are now deprecated in schema.xml? Where can
I found any pointer about this?

On Mon, Apr 28, 2014 at 6:54 PM, Uwe Schindler uschind...@apache.orgwrote:

 28 April 2014, Apache Solr™ 4.8.0 available

 The Lucene PMC is pleased to announce the release of Apache Solr 4.8.0

 Solr is the popular, blazing fast, open source NoSQL search platform
 from the Apache Lucene project. Its major features include powerful
 full-text search, hit highlighting, faceted search, dynamic
 clustering, database integration, rich document (e.g., Word, PDF)
 handling, and geospatial search.  Solr is highly scalable, providing
 fault tolerant distributed search and indexing, and powers the search
 and navigation features of many of the world's largest internet sites.

 Solr 4.8.0 is available for immediate download at:
   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

 See the CHANGES.txt file included with the release for a full list of
 details.

 Solr 4.8.0 Release Highlights:

 * Apache Solr now requires Java 7 or greater (recommended is
   Oracle Java 7 or OpenJDK 7, minimum update 55; earlier versions
   have known JVM bugs affecting Solr).

 * Apache Solr is fully compatible with Java 8.

 * fields and types tags have been deprecated from schema.xml.
   There is no longer any reason to keep them in the schema file,
   they may be safely removed. This allows intermixing of fieldType,
   field and copyField definitions if desired.

 * The new {!complexphrase} query parser supports wildcards, ORs etc.
   inside Phrase Queries.

 * New Collections API CLUSTERSTATUS action reports the status of
   collections, shards, and replicas, and also lists collection
   aliases and cluster properties.

 * Added managed synonym and stopword filter factories, which enable
   synonym and stopword lists to be dynamically managed via REST API.

 * JSON updates now support nested child documents, enabling {!child}
   and {!parent} block join queries.

 * Added ExpandComponent to expand results collapsed by the
   CollapsingQParserPlugin, as well as the parent/child relationship
   of nested child documents.

 * Long-running Collections API tasks can now be executed
   asynchronously; the new REQUESTSTATUS action provides status.

 * Added a hl.qparser parameter to allow you to define a query parser
   for hl.q highlight queries.

 * In Solr single-node mode, cores can now be created using named
   configsets.

 * New DocExpirationUpdateProcessorFactory supports computing an
   expiration date for documents from the TTL expression, as well as
   automatically deleting expired documents on a periodic basis.

 Solr 4.8.0 also includes many other new features as well as numerous
 optimizations and bugfixes of the corresponding Apache Lucene release.

 Please report any feedback to the mailing lists
 (http://lucene.apache.org/solr/discussion.html)

 Note: The Apache Software Foundation uses an extensive mirroring network
 for distributing releases.  It is possible that the mirror you are using
 may not have replicated the release yet.  If that is the case, please
 try another mirror.  This also goes for Maven access.

 -
 Uwe Schindler
 uschind...@apache.org
 Apache Lucene PMC Chair / Committer
 Bremen, Germany
 http://lucene.apache.org/





Re: [ANNOUNCE] Apache Solr 4.8.0 released

2014-04-29 Thread Steve Rowe
https://issues.apache.org/jira/browse/SOLR-5228

On Apr 29, 2014, at 10:27 AM, Flavio Pompermaier pomperma...@okkam.it wrote:

 In which sense fields and types are now deprecated in schema.xml? Where can
 I found any pointer about this?
 
 On Mon, Apr 28, 2014 at 6:54 PM, Uwe Schindler uschind...@apache.orgwrote:
 
 28 April 2014, Apache Solr™ 4.8.0 available
 
 The Lucene PMC is pleased to announce the release of Apache Solr 4.8.0
 
 Solr is the popular, blazing fast, open source NoSQL search platform
 from the Apache Lucene project. Its major features include powerful
 full-text search, hit highlighting, faceted search, dynamic
 clustering, database integration, rich document (e.g., Word, PDF)
 handling, and geospatial search.  Solr is highly scalable, providing
 fault tolerant distributed search and indexing, and powers the search
 and navigation features of many of the world's largest internet sites.
 
 Solr 4.8.0 is available for immediate download at:
  http://lucene.apache.org/solr/mirrors-solr-latest-redir.html
 
 See the CHANGES.txt file included with the release for a full list of
 details.
 
 Solr 4.8.0 Release Highlights:
 
 * Apache Solr now requires Java 7 or greater (recommended is
  Oracle Java 7 or OpenJDK 7, minimum update 55; earlier versions
  have known JVM bugs affecting Solr).
 
 * Apache Solr is fully compatible with Java 8.
 
 * fields and types tags have been deprecated from schema.xml.
  There is no longer any reason to keep them in the schema file,
  they may be safely removed. This allows intermixing of fieldType,
  field and copyField definitions if desired.
 
 * The new {!complexphrase} query parser supports wildcards, ORs etc.
  inside Phrase Queries.
 
 * New Collections API CLUSTERSTATUS action reports the status of
  collections, shards, and replicas, and also lists collection
  aliases and cluster properties.
 
 * Added managed synonym and stopword filter factories, which enable
  synonym and stopword lists to be dynamically managed via REST API.
 
 * JSON updates now support nested child documents, enabling {!child}
  and {!parent} block join queries.
 
 * Added ExpandComponent to expand results collapsed by the
  CollapsingQParserPlugin, as well as the parent/child relationship
  of nested child documents.
 
 * Long-running Collections API tasks can now be executed
  asynchronously; the new REQUESTSTATUS action provides status.
 
 * Added a hl.qparser parameter to allow you to define a query parser
  for hl.q highlight queries.
 
 * In Solr single-node mode, cores can now be created using named
  configsets.
 
 * New DocExpirationUpdateProcessorFactory supports computing an
  expiration date for documents from the TTL expression, as well as
  automatically deleting expired documents on a periodic basis.
 
 Solr 4.8.0 also includes many other new features as well as numerous
 optimizations and bugfixes of the corresponding Apache Lucene release.
 
 Please report any feedback to the mailing lists
 (http://lucene.apache.org/solr/discussion.html)
 
 Note: The Apache Software Foundation uses an extensive mirroring network
 for distributing releases.  It is possible that the mirror you are using
 may not have replicated the release yet.  If that is the case, please
 try another mirror.  This also goes for Maven access.
 
 -
 Uwe Schindler
 uschind...@apache.org
 Apache Lucene PMC Chair / Committer
 Bremen, Germany
 http://lucene.apache.org/
 
 
 



Re: [ANNOUNCE] Apache Solr 4.8.0 released

2014-04-29 Thread Rafał Kuć
Hello!

You don't need the fields and types section anymore, you can just
include type or field definition anywhere in the schema.xml section.
You can find more in https://issues.apache.org/jira/browse/SOLR-5228

-- 
Regards,
 Rafał Kuć
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


 In which sense fields and types are now deprecated in schema.xml? Where can
 I found any pointer about this?

 On Mon, Apr 28, 2014 at 6:54 PM, Uwe Schindler uschind...@apache.orgwrote:

 28 April 2014, Apache Solr™ 4.8.0 available

 The Lucene PMC is pleased to announce the release of Apache Solr 4.8.0

 Solr is the popular, blazing fast, open source NoSQL search platform
 from the Apache Lucene project. Its major features include powerful
 full-text search, hit highlighting, faceted search, dynamic
 clustering, database integration, rich document (e.g., Word, PDF)
 handling, and geospatial search.  Solr is highly scalable, providing
 fault tolerant distributed search and indexing, and powers the search
 and navigation features of many of the world's largest internet sites.

 Solr 4.8.0 is available for immediate download at:
   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

 See the CHANGES.txt file included with the release for a full list of
 details.

 Solr 4.8.0 Release Highlights:

 * Apache Solr now requires Java 7 or greater (recommended is
   Oracle Java 7 or OpenJDK 7, minimum update 55; earlier versions
   have known JVM bugs affecting Solr).

 * Apache Solr is fully compatible with Java 8.

 * fields and types tags have been deprecated from schema.xml.
   There is no longer any reason to keep them in the schema file,
   they may be safely removed. This allows intermixing of fieldType,
   field and copyField definitions if desired.

 * The new {!complexphrase} query parser supports wildcards, ORs etc.
   inside Phrase Queries.

 * New Collections API CLUSTERSTATUS action reports the status of
   collections, shards, and replicas, and also lists collection
   aliases and cluster properties.

 * Added managed synonym and stopword filter factories, which enable
   synonym and stopword lists to be dynamically managed via REST API.

 * JSON updates now support nested child documents, enabling {!child}
   and {!parent} block join queries.

 * Added ExpandComponent to expand results collapsed by the
   CollapsingQParserPlugin, as well as the parent/child relationship
   of nested child documents.

 * Long-running Collections API tasks can now be executed
   asynchronously; the new REQUESTSTATUS action provides status.

 * Added a hl.qparser parameter to allow you to define a query parser
   for hl.q highlight queries.

 * In Solr single-node mode, cores can now be created using named
   configsets.

 * New DocExpirationUpdateProcessorFactory supports computing an
   expiration date for documents from the TTL expression, as well as
   automatically deleting expired documents on a periodic basis.

 Solr 4.8.0 also includes many other new features as well as numerous
 optimizations and bugfixes of the corresponding Apache Lucene release.

 Please report any feedback to the mailing lists
 (http://lucene.apache.org/solr/discussion.html)

 Note: The Apache Software Foundation uses an extensive mirroring network
 for distributing releases.  It is possible that the mirror you are using
 may not have replicated the release yet.  If that is the case, please
 try another mirror.  This also goes for Maven access.

 -
 Uwe Schindler
 uschind...@apache.org
 Apache Lucene PMC Chair / Committer
 Bremen, Germany
 http://lucene.apache.org/






Re: [ANNOUNCE] Apache Solr 4.8.0 released

2014-04-29 Thread Shalin Shekhar Mangar
Earlier, all fieldType tags were required to be nested inside a types
tag. Similarly, all field and copyField tags were required to be nested
inside a fields tag. Such nesting is no longer required and you can
inter-mix field, fieldType and copyField tags as you like. Therefore,
the fields and types tags are no longer required and can be removed.
Even if you don't remove them, things will continue to work for some times
until a major 5.0 release is made.


On Tue, Apr 29, 2014 at 7:57 PM, Flavio Pompermaier pomperma...@okkam.itwrote:

 In which sense fields and types are now deprecated in schema.xml? Where can
 I found any pointer about this?

 On Mon, Apr 28, 2014 at 6:54 PM, Uwe Schindler uschind...@apache.org
 wrote:

  28 April 2014, Apache Solr™ 4.8.0 available
 
  The Lucene PMC is pleased to announce the release of Apache Solr 4.8.0
 
  Solr is the popular, blazing fast, open source NoSQL search platform
  from the Apache Lucene project. Its major features include powerful
  full-text search, hit highlighting, faceted search, dynamic
  clustering, database integration, rich document (e.g., Word, PDF)
  handling, and geospatial search.  Solr is highly scalable, providing
  fault tolerant distributed search and indexing, and powers the search
  and navigation features of many of the world's largest internet sites.
 
  Solr 4.8.0 is available for immediate download at:
http://lucene.apache.org/solr/mirrors-solr-latest-redir.html
 
  See the CHANGES.txt file included with the release for a full list of
  details.
 
  Solr 4.8.0 Release Highlights:
 
  * Apache Solr now requires Java 7 or greater (recommended is
Oracle Java 7 or OpenJDK 7, minimum update 55; earlier versions
have known JVM bugs affecting Solr).
 
  * Apache Solr is fully compatible with Java 8.
 
  * fields and types tags have been deprecated from schema.xml.
There is no longer any reason to keep them in the schema file,
they may be safely removed. This allows intermixing of fieldType,
field and copyField definitions if desired.
 
  * The new {!complexphrase} query parser supports wildcards, ORs etc.
inside Phrase Queries.
 
  * New Collections API CLUSTERSTATUS action reports the status of
collections, shards, and replicas, and also lists collection
aliases and cluster properties.
 
  * Added managed synonym and stopword filter factories, which enable
synonym and stopword lists to be dynamically managed via REST API.
 
  * JSON updates now support nested child documents, enabling {!child}
and {!parent} block join queries.
 
  * Added ExpandComponent to expand results collapsed by the
CollapsingQParserPlugin, as well as the parent/child relationship
of nested child documents.
 
  * Long-running Collections API tasks can now be executed
asynchronously; the new REQUESTSTATUS action provides status.
 
  * Added a hl.qparser parameter to allow you to define a query parser
for hl.q highlight queries.
 
  * In Solr single-node mode, cores can now be created using named
configsets.
 
  * New DocExpirationUpdateProcessorFactory supports computing an
expiration date for documents from the TTL expression, as well as
automatically deleting expired documents on a periodic basis.
 
  Solr 4.8.0 also includes many other new features as well as numerous
  optimizations and bugfixes of the corresponding Apache Lucene release.
 
  Please report any feedback to the mailing lists
  (http://lucene.apache.org/solr/discussion.html)
 
  Note: The Apache Software Foundation uses an extensive mirroring network
  for distributing releases.  It is possible that the mirror you are using
  may not have replicated the release yet.  If that is the case, please
  try another mirror.  This also goes for Maven access.
 
  -
  Uwe Schindler
  uschind...@apache.org
  Apache Lucene PMC Chair / Committer
  Bremen, Germany
  http://lucene.apache.org/
 
 
 




-- 
Regards,
Shalin Shekhar Mangar.


Solr Server Infrastructure Config

2014-04-29 Thread EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions)
Hi, Can some one share or refer to get information on the SOLR server 
environment for production.

Appx. We have 40 Collections, with appx size from 300MB to 8GB (for each 
Collection) and appx Total 100GB. The average increase of the size for total 
may be 2-5Gb / Year.

To Get best performance for atleast 1000-1 Concurrent Users.

Thanks

Ravi


Re: [ANNOUNCE] Apache Solr 4.8.0 released

2014-04-29 Thread Shawn Heisey
On 4/29/2014 8:27 AM, Flavio Pompermaier wrote:
 In which sense fields and types are now deprecated in schema.xml? Where can
 I found any pointer about this?

https://issues.apache.org/jira/browse/SOLR-5936

Here is the patch for 4.8:

https://issues.apache.org/jira/secure/attachment/12637716/SOLR-5936.branch_4x.patch

This is the full list of fieldType classes that have been deprecated by
this issue:

BCDIntField
BCDLongField
BCDStrField
DateField
DoubleField
FloatField
IntField
LongField
SortableDoubleField
SortableFloatField
SortableIntField
SortableLongField

In schema.xml, these would show up preceded by solr. and would be in
the class attribute of a fieldType.

None of these types are used in the *main* example for 4.x versions. 
They do show up in some of the other examples in earlier releases. 
Those examples have been reworked to use the newer field types.

Here's the javadoc for one of the types listed above, which shows the
deprecation notice:

http://lucene.apache.org/solr/4_8_0/solr-core/org/apache/solr/schema/LongField.html

Thanks,
Shawn



Re: Stored vs non-stored very large text fields

2014-04-29 Thread Jochen Barth

Dear Shawn,

see attachment for my first brute force no-compression attempt.

Kind regards,
Jochen


Zitat von Shawn Heisey s...@elyograg.org:


On 4/29/2014 4:20 AM, Jochen Barth wrote:

BTW: stored field compression:
are all stored fields within a document are put into one compressed chunk,
or by per-field basis?


Here's the issue that added the compression to Lucene:

https://issues.apache.org/jira/browse/LUCENE-4226

It was made the default stored field format for Lucene, which also made
it the default for Solr.  At this time, there is no way to remove
compression on Solr without writing custom code.  I filed an issue to
make it configurable, but I don't know how to do it.  Nobody else has
offered a solution either.  One day I might find some time to take a
look at the issue and see if I can solve it myself.

https://issues.apache.org/jira/browse/SOLR-4375

Here's the author's blog post that goes into more detail than the LUCENE
issue:

http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene

Thanks,
Shawn



diff -c -r solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java
*** solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java	2013-11-01 07:03:52.0 +0100
--- solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene41/Lucene41Codec.java	2014-04-29 13:58:27.0 +0200
***
*** 38,43 
--- 38,44 
  import org.apache.lucene.codecs.lucene40.Lucene40NormsFormat;
  import org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoFormat;
  import org.apache.lucene.codecs.lucene40.Lucene40TermVectorsFormat;
+ import org.apache.lucene.codecs.lucene40.Lucene40StoredFieldsFormat;
  import org.apache.lucene.codecs.perfield.PerFieldPostingsFormat;
  import org.apache.lucene.index.SegmentInfo;
  import org.apache.lucene.store.Directory;
***
*** 56,62 
  @Deprecated
  public class Lucene41Codec extends Codec {
// TODO: slightly evil
!   private final StoredFieldsFormat fieldsFormat = new CompressingStoredFieldsFormat(Lucene41StoredFields, CompressionMode.FAST, 1  14) {
  @Override
  public StoredFieldsWriter fieldsWriter(Directory directory, SegmentInfo si, IOContext context) throws IOException {
throw new UnsupportedOperationException(this codec can only be used for reading);
--- 57,63 
  @Deprecated
  public class Lucene41Codec extends Codec {
// TODO: slightly evil
!   private final StoredFieldsFormat fieldsFormat = new Lucene40StoredFieldsFormat() {
  @Override
  public StoredFieldsWriter fieldsWriter(Directory directory, SegmentInfo si, IOContext context) throws IOException {
throw new UnsupportedOperationException(this codec can only be used for reading);
diff -c -r solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java
*** solr-4.8.0.original/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java	2013-11-01 07:03:52.0 +0100
--- solr-4.8.0/lucene/core/src/java/org/apache/lucene/codecs/lucene42/Lucene42Codec.java	2014-04-29 13:57:08.0 +0200
***
*** 32,38 
  import org.apache.lucene.codecs.TermVectorsFormat;
  import org.apache.lucene.codecs.lucene40.Lucene40LiveDocsFormat;
  import org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoFormat;
! import org.apache.lucene.codecs.lucene41.Lucene41StoredFieldsFormat;
  import org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat;
  import org.apache.lucene.codecs.perfield.PerFieldPostingsFormat;
  import org.apache.lucene.index.SegmentWriteState;
--- 32,38 
  import org.apache.lucene.codecs.TermVectorsFormat;
  import org.apache.lucene.codecs.lucene40.Lucene40LiveDocsFormat;
  import org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoFormat;
! import org.apache.lucene.codecs.lucene40.Lucene40StoredFieldsFormat;
  import org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat;
  import org.apache.lucene.codecs.perfield.PerFieldPostingsFormat;
  import org.apache.lucene.index.SegmentWriteState;
***
*** 53,59 
  // (it writes a minor version, etc).
  @Deprecated
  public class Lucene42Codec extends Codec {
!   private final StoredFieldsFormat fieldsFormat = new Lucene41StoredFieldsFormat();
private final TermVectorsFormat vectorsFormat = new Lucene42TermVectorsFormat();
private final FieldInfosFormat fieldInfosFormat = new Lucene42FieldInfosFormat();
private final SegmentInfoFormat infosFormat = new Lucene40SegmentInfoFormat();
--- 53,59 
  // (it writes a minor version, etc).
  @Deprecated
  public class Lucene42Codec extends Codec {
!   private final StoredFieldsFormat fieldsFormat = new Lucene40StoredFieldsFormat();
private final TermVectorsFormat vectorsFormat 

Re: [ANNOUNCE] Apache Solr 4.8.0 released

2014-04-29 Thread Shawn Heisey
On 4/29/2014 8:48 AM, Shawn Heisey wrote:
 On 4/29/2014 8:27 AM, Flavio Pompermaier wrote:
 In which sense fields and types are now deprecated in schema.xml? Where can
 I found any pointer about this?
 https://issues.apache.org/jira/browse/SOLR-5936

 Here is the patch for 4.8:

 https://issues.apache.org/jira/secure/attachment/12637716/SOLR-5936.branch_4x.patch

 This is the full list of fieldType classes that have been deprecated by
 this issue:

And now, seeing the other replies, I see that I didn't interpret your
question properly.

Thanks,
Shawn



Re: Stemming not working with wildcard search

2014-04-29 Thread Geepalem
Can someone help me out with this issue?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Stemming-not-working-with-wildcard-search-tp4133382p4133769.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Wildcard search not working with search term having special characters and digits

2014-04-29 Thread Geepalem
Can someone help me out with this issue please?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Wildcard-search-not-working-with-search-term-having-special-characters-and-digits-tp4133385p4133770.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr does not recognize language

2014-04-29 Thread Ahmet Arslan
Hi,

solr/update should be used, not /solr/select

curl 'http://localhost:8983/solr/update?commit=trueupdate.chain=langid' 

By the way don't you have following definition in your solrconfig.xml?

 requestHandler name=/update class=solr.UpdateRequestHandler  
       lst name=defaults
         str name=update.chainlangid/str
       /lst      
  /requestHandler



On Tuesday, April 29, 2014 4:50 PM, Victor Pascual 
vic...@mobilemediacontent.com wrote:
Hi Ahmet,

thanks for your reply. Adding update.chain=langid to my query doesn't
work: IP:8080/solr/select/?q=*%3A*update.chain=langid
Regarding defining the chain in an UpdateRequestHandler... sorry for the
lame question but shall I paste those three lines to solrconfig.xml, or
shall I add them somewhere else?

There is not UpdateRequestHandler in my solrconfig.

Thanks!



On Tue, Apr 29, 2014 at 3:13 PM, Ahmet Arslan iori...@yahoo.com wrote:

 Hi,

 Did you attach your chain to a UpdateRequestHandler?

 You can do it by adding update.chain=langid to the URL or defining it in
 a defaults section as follows

 lst name=defaults
      str name=update.chainlangid/str
    /lst



 On Tuesday, April 29, 2014 3:18 PM, Victor Pascual 
 vic...@mobilemediacontent.com wrote:
 Dear all,

 I'm a new user of Solr. I've managed to index a bunch of documents (in
 fact, they are tweets) and everything works quite smoothly.

 Nevertheless it looks like Solr doesn't detect the language of my documents
 nor remove stopwords accordingly so I can extract the most frequent terms.

 I've added this piece of XML to my solrconfig.xml as well as the Tika lib
 jars.

     updateRequestProcessorChain name=langid
        processor

 class=org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory
           lst name=defaults
             str name=langid.fltext/str
             str name=langid.langFieldlang/str
           /lst
         /processor
         processor class=solr.LogUpdateProcessorFactory /
        processor class=solr.RunUpdateProcessorFactory /
      /updateRequestProcessorChain

 There is no error in the tomcat log file, so I have no clue of why this
 isn't working.
 Any hint on how to solve this problem will be much appreciated!




Re: Delete fields from document using a wildcard

2014-04-29 Thread Shalin Shekhar Mangar
I think this is useful as well. Can you open an issue?


On Tue, Apr 29, 2014 at 7:53 PM, Shawn Heisey s...@elyograg.org wrote:

 On 4/29/2014 5:25 AM, Costi Muraru wrote:
  The problem is, I don't know the exact names of the fields I want to
  remove. All I know is that they end in *_1600_i.
 
  When removing fields from a document, I want to avoid querying SOLR to
 see
  what fields are actually present for the specific document. In this way,
  hopefully I can speed up the process. Querying to see the schema.xml is
 not
  going to help me much, since the field is defined a dynamic field *_i.
 This
  makes me think that expanding the documents client-side is not the best
 way
  to do it.

 Unfortunately at this time, you'll have to query the document and
 go through the list of fields to determine which need to be deleted,
 then build a request that deleted them.

 I don't know how hard it is to accomplish this in Solr.  Getting it
 implemented might require a bunch of people standing up and saying we
 want this!

 Thanks,
 Shawn




-- 
Regards,
Shalin Shekhar Mangar.


Re: saving user actions on item in solr for later retrieval

2014-04-29 Thread nolim
Thank you, it was interesting and I have learned some new things in solr :)

But the External File Field isn't a good option because the field is
unsearchable which it very important to us.
We think about the first option (updating document in solr) but preforming
commit only each 10 minutes - If we would like to retrieve the value
realtime we can use RealTimeGet.

Maybe you have other suggestion?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/saving-user-actions-on-item-in-solr-for-later-retrieval-tp4133558p4133793.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: saving user actions on item in solr for later retrieval

2014-04-29 Thread Ahmet Arslan
Hi Nolim,

Actually EFF is searchable. See my comments at the end of the page 

https://cwiki.apache.org/confluence/display/solr/Working+with+External+Files+and+Processes

Ahmet



On Tuesday, April 29, 2014 9:07 PM, nolim alony...@gmail.com wrote:
Thank you, it was interesting and I have learned some new things in solr :)

But the External File Field isn't a good option because the field is
unsearchable which it very important to us.
We think about the first option (updating document in solr) but preforming
commit only each 10 minutes - If we would like to retrieve the value
realtime we can use RealTimeGet.

Maybe you have other suggestion?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/saving-user-actions-on-item-in-solr-for-later-retrieval-tp4133558p4133793.html

Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr data directory contains index backups

2014-04-29 Thread Greg Walters
None that I'm aware of. A bit of googling shows the accepted solution to be an 
external script via cron or something similar. I think I saw an issue open on 
Apache's Jira about this but can't find it now.

Thanks,
Greg

On Apr 25, 2014, at 4:37 PM, solr2020 psgoms...@gmail.com wrote:

 Thanks Greg. Is there any Solr configuration to do this periodically if any
 unused index copy or snapshot exists in data directory.?
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-data-directory-contains-index-backups-tp4132590p4133221.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Stored vs non-stored very large text fields

2014-04-29 Thread Jochen Barth

Something is really strange here:

even when configuring fields id + sort_... to docValues=true -- so  
there's nothing to get from stored documents file -- performance is  
still terrible with ocr stored=true _even_ with my patch which stores  
uncompressed like solr4.0.0 (checked with strings -a on *.fdt).


Just reading  
http://lucene.472066.n3.nabble.com/Can-Solr-handle-large-text-files-td3439504.html .. perhaps things will clear up soon (will check if spltting to index+non-stored and non-indexed+stored could help  
here)



Kind regards,
J. Barth


Zitat von Shawn Heisey s...@elyograg.org:


On 4/29/2014 4:20 AM, Jochen Barth wrote:

BTW: stored field compression:
are all stored fields within a document are put into one compressed chunk,
or by per-field basis?


Here's the issue that added the compression to Lucene:

https://issues.apache.org/jira/browse/LUCENE-4226

It was made the default stored field format for Lucene, which also made
it the default for Solr.  At this time, there is no way to remove
compression on Solr without writing custom code.  I filed an issue to
make it configurable, but I don't know how to do it.  Nobody else has
offered a solution either.  One day I might find some time to take a
look at the issue and see if I can solve it myself.

https://issues.apache.org/jira/browse/SOLR-4375

Here's the author's blog post that goes into more detail than the LUCENE
issue:

http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene

Thanks,
Shawn





Re: Stored vs non-stored very large text fields

2014-04-29 Thread Jochen Barth

Ok, https://wiki.apache.org/solr/SolrPerformanceFactors

states that: Retrieving the stored fields of a query result can be a  
significant expense. This cost is affected largely by the number of  
bytes stored per document--the higher byte count, the sparser the  
documents will be distributed on disk and more I/O is necessary to  
retrieve the fields (usually this is a concern when storing large  
fields, like the entire contents of a document).


But in my case (with docValues=true) there should be no reason to  
access *.fdt.


Kind regards,
Jochen

Zitat von Jochen Barth ba...@ub.uni-heidelberg.de:


Something is really strange here:

even when configuring fields id + sort_... to docValues=true -- so  
there's nothing to get from stored documents file -- performance  
is still terrible with ocr stored=true _even_ with my patch which  
stores uncompressed like solr4.0.0 (checked with strings -a on *.fdt).


Just reading  
http://lucene.472066.n3.nabble.com/Can-Solr-handle-large-text-files-td3439504.html .. perhaps things will clear up soon (will check if spltting to index+non-stored and non-indexed+stored could help  
here)



Kind regards,
J. Barth


Zitat von Shawn Heisey s...@elyograg.org:


On 4/29/2014 4:20 AM, Jochen Barth wrote:

BTW: stored field compression:
are all stored fields within a document are put into one  
compressed chunk,

or by per-field basis?


Here's the issue that added the compression to Lucene:

https://issues.apache.org/jira/browse/LUCENE-4226

It was made the default stored field format for Lucene, which also made
it the default for Solr.  At this time, there is no way to remove
compression on Solr without writing custom code.  I filed an issue to
make it configurable, but I don't know how to do it.  Nobody else has
offered a solution either.  One day I might find some time to take a
look at the issue and see if I can solve it myself.

https://issues.apache.org/jira/browse/SOLR-4375

Here's the author's blog post that goes into more detail than the LUCENE
issue:

http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene

Thanks,
Shawn





Re: Delete fields from document using a wildcard

2014-04-29 Thread Costi Muraru
I've opened an issue: https://issues.apache.org/jira/browse/SOLR-6034
Feedback in Jira is appreciated.


On Tue, Apr 29, 2014 at 8:34 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 I think this is useful as well. Can you open an issue?


 On Tue, Apr 29, 2014 at 7:53 PM, Shawn Heisey s...@elyograg.org wrote:

  On 4/29/2014 5:25 AM, Costi Muraru wrote:
   The problem is, I don't know the exact names of the fields I want to
   remove. All I know is that they end in *_1600_i.
  
   When removing fields from a document, I want to avoid querying SOLR to
  see
   what fields are actually present for the specific document. In this
 way,
   hopefully I can speed up the process. Querying to see the schema.xml is
  not
   going to help me much, since the field is defined a dynamic field *_i.
  This
   makes me think that expanding the documents client-side is not the best
  way
   to do it.
 
  Unfortunately at this time, you'll have to query the document and
  go through the list of fields to determine which need to be deleted,
  then build a request that deleted them.
 
  I don't know how hard it is to accomplish this in Solr.  Getting it
  implemented might require a bunch of people standing up and saying we
  want this!
 
  Thanks,
  Shawn
 



 --
 Regards,
 Shalin Shekhar Mangar.



Re: Raw query parameters

2014-04-29 Thread Xavier Morera
You saved my life Shawn! Thanks!


On Mon, Apr 28, 2014 at 11:54 PM, Shawn Heisey s...@elyograg.org wrote:

 On 4/28/2014 7:54 PM, Xavier Morera wrote:
  Would anyone be so kind to explain what are the Raw query parameters
  in Solr's admin UI. I can't find an explanation in either the reference
  guide nor wiki nor web search.

 The query API supports a lot more parameters than are shown on the admin
 UI.  For instance, If you are doing a faceted search, there are only
 boxes for facet.query, facet.field, and facet.prefix ... but faceted
 search supports a lot more parameters (like facet.method, facet.limit,
 facet.mincount, facet.sort, etc).  Raw Query Parameters gives you a way
 to use the entire query API, not just the few things that have UI input
 boxes.

 Thanks,
 Shawn




-- 
*Xavier Morera*
email: xav...@familiamorera.com
CR: +(506) 8849 8866
US: +1 (305) 600 4919
skype: xmorera


Re: Indexing Big Data With or Without Solr

2014-04-29 Thread rulinma
mark. 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-Big-Data-With-or-Without-Solr-tp4131215p4133831.html
Sent from the Solr - User mailing list archive at Nabble.com.


timeAllowed in not honoring

2014-04-29 Thread Aman Tandon
Hi,

I am using solr 4.2 with the index size of 40GB, while querying to my index
there are some queries which is taking the significant amount of time of
about 22 seconds *in the case of minmatch of 50%*. So i added a parameter
timeAllowed = 2000 in my query but this doesn't seems to be work. Please
help me out.

With Regards
Aman Tandon


Re: timeAllowed in not honoring

2014-04-29 Thread Shawn Heisey
On 4/29/2014 10:05 PM, Aman Tandon wrote:
 I am using solr 4.2 with the index size of 40GB, while querying to my index
 there are some queries which is taking the significant amount of time of
 about 22 seconds *in the case of minmatch of 50%*. So i added a parameter
 timeAllowed = 2000 in my query but this doesn't seems to be work. Please
 help me out.

I remember reading that timeAllowed has some limitations about which
stages of a query it can limit, particularly in the distributed case.
These limitations mean that it cannot always limit the total time for a
query.  I do not remember precisely what those limitations are, and I
cannot find whatever it was that I was reading.

When I looked through my local list archive to see if you had ever
mentioned how much RAM you have and what the size of your Solr heap is,
there didn't seem to be anything.  There's not enough information for me
to know whether that 40GB is the amount of index data on a single
SolrCloud server, or whether it's the total size of the index across all
servers.

If we leave timeAllowed alone for a moment and treat this purely as a
performance problem, usually my questions revolve around figuring out
whether you have enough RAM.  Here's where that conversation ends up:

http://wiki.apache.org/solr/SolrPerformanceProblems

I think I've probably mentioned this to you before on another thread.

Thanks,
Shawn



search result not correct in solr

2014-04-29 Thread neha sinha
Hi I am trying to search with word Ribbing and i am also getting those result
which have R-B or RB letter in their dsecription but when i am trying to
search with Ribbin i m getting correct result...not getting any clue what to
use in my solr schema.xml.


Any guidance will be helpful.



Thanks




--
View this message in context: 
http://lucene.472066.n3.nabble.com/search-result-not-correct-in-solr-tp4133841.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: timeAllowed in not honoring

2014-04-29 Thread Aman Tandon
Shawn this is the first time i raised this problem.

My heap size is 14GB and  i am not using solr cloud currently, 40GB index
is replicated from master to two slaves.

I read somewhere that it return the partial results which is computed by
the query in that specified amount of time which is defined by this
timeAllowed parameter, but it doesn't seems to happen.

Here is the link :
http://wiki.apache.org/solr/CommonQueryParameters#timeAllowed

 *The time allowed for a search to finish. This value only applies to the
search and not to requests in general. Time is in milliseconds. Values = 0
mean no time restriction. Partial results may be returned (if there are
any). *



With Regards
Aman Tandon


On Wed, Apr 30, 2014 at 10:05 AM, Shawn Heisey s...@elyograg.org wrote:

 On 4/29/2014 10:05 PM, Aman Tandon wrote:
  I am using solr 4.2 with the index size of 40GB, while querying to my
 index
  there are some queries which is taking the significant amount of time of
  about 22 seconds *in the case of minmatch of 50%*. So i added a parameter
  timeAllowed = 2000 in my query but this doesn't seems to be work. Please
  help me out.

 I remember reading that timeAllowed has some limitations about which
 stages of a query it can limit, particularly in the distributed case.
 These limitations mean that it cannot always limit the total time for a
 query.  I do not remember precisely what those limitations are, and I
 cannot find whatever it was that I was reading.

 When I looked through my local list archive to see if you had ever
 mentioned how much RAM you have and what the size of your Solr heap is,
 there didn't seem to be anything.  There's not enough information for me
 to know whether that 40GB is the amount of index data on a single
 SolrCloud server, or whether it's the total size of the index across all
 servers.

 If we leave timeAllowed alone for a moment and treat this purely as a
 performance problem, usually my questions revolve around figuring out
 whether you have enough RAM.  Here's where that conversation ends up:

 http://wiki.apache.org/solr/SolrPerformanceProblems

 I think I've probably mentioned this to you before on another thread.

 Thanks,
 Shawn