IndexDocValues

2014-06-27 Thread Sandeep Khanzode
I came across this type when I checked this blog: 
http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/
 
The blog mentions that the IndexDocValues are created as sorting types indexed 
specifically for the purpose and reduce the overhead created by the FieldCache.

I could not locate this class in the Lucene 4.7.2 hierarchy. Is this replaced 
by somewhat similar SortedDocValuesField?

And are there any benchmarks that show the memory and sorting time using this 
field as opposed to sorting on a regular StringField. 
 
---
Thanks n Regards,
Sandeep Ramesh Khanzode

About lucene memory consumption

2014-06-27 Thread 308181687
Hi, all


   I fould that the memory consumption of ‍my lucene server is abnormal, and 
“jmap -histo ${pid}” show that the class of byte[] consume almost all of the 
memory. Is there memory leak in my app? Why so many  byte[] instances?
‍
 






The following is the top output of jmap:‍


 num #instances #bytes  class name
--
   1:   1786575 1831556144  [B
   2:704618   80078064  [C
   3:839932   33597280  java.util.LinkedHashMap$Entry
   4:686770   21976640  java.lang.String‍

‍









Thanks  Best Regards!

Searching on Large Indexes

2014-06-27 Thread Sandeep Khanzode
Hi,

I have an index that runs into 200-300GB. It is not frequently updated.

What are the best strategies to query on this index?
1.] Should I, at index time, split the content, like a hash based partition, 
into multiple separate smaller indexes and aggregate the results 
programmatically?
2.] Should I replicate this index and provide some sort of document ID, and 
search on each node for a specific range of document IDs?
3.] Is there any way I can split or move individual segments to different nodes 
and aggregate the results?

I am not fully aware of the large scale query strategies. Can you please share 
your findings or experiences? Thanks, 
 
---
Thanks n Regards,
Sandeep Ramesh Khanzode

Re: Searching on Large Indexes

2014-06-27 Thread Jigar Shah
Some points based on my experience.

You can think of SolrCloud implementation, if  you want to distribute your
index over multiple servers.

Use MMapDirectory locally for each Solr instance in cluster.
Hit warm-up query on sever start-up. So most of the documents will be
cached, you will start saving on Disk IO on subsequent requests.
For e.g. If you have 4 Solr instances with 64GB RAM on each. most of your
documents will stay in RAM for 200GB index, and this will give you better
performance.

To take advantage of multi-core system. You can increase Searcher Threads,
ideally up-to the cores you have on single instance.




On Fri, Jun 27, 2014 at 4:03 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi,

 I have an index that runs into 200-300GB. It is not frequently updated.

 What are the best strategies to query on this index?
 1.] Should I, at index time, split the content, like a hash based
 partition, into multiple separate smaller indexes and aggregate the results
 programmatically?
 2.] Should I replicate this index and provide some sort of document ID,
 and search on each node for a specific range of document IDs?
 3.] Is there any way I can split or move individual segments to different
 nodes and aggregate the results?

 I am not fully aware of the large scale query strategies. Can you please
 share your findings or experiences? Thanks,

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


Can Lucene based application be made to work with Scaled Elastic Beanstalk environemnt on Amazon Web Services

2014-06-27 Thread Paul Taylor

Hi

I have a simple WAR based web application that uses lucene created 
indexes to provide search results in a xml format.
It works fine locally but I want to deploy it using Elastic Beanstalk 
within Amazon Webservices


Problem 1 is that WAR definition doesn't seem to provide a location for 
data files (rather than config files) so when I deploy the WAR with EB 
it doesnt work at first because has no access to the data (lucene 
indexes) , however I solved this by connecting to the underlying EC2 
instance and copy the lucene indexes from S3 to the instance, and 
ensuring the file location  is defined in the Wars web.xml file.


Problem 2 is more problematic, Im looking at AWS and EB because I wanted 
a way to deploy the application with little ongoing admin overhead and I 
like the way EB does load balancing and auto scaling for you, starting 
and stopping additional instances as required to meet demand. However 
these automatically started instances will not have access to the index 
files.


Possible solutions could be

1. Is there a location I can store the data index within the WAR itself, 
the index is only 5GB so I do have space on my root disk to store the 
indexes in the WAR if there is a way to use them, Tomcat was also be 
need to unwar the file at deployement, I cant see if tomcat on AWSdoes this.


2. A way for EC2 instances to be started with data preloaded i some way

(BTW Im aware of CloudSearch but its not an avenue I want to go down)

Does anybody have any experience of this,please ?

Paul



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Searching on Large Indexes

2014-06-27 Thread Toke Eskildsen
On Fri, 2014-06-27 at 12:33 +0200, Sandeep Khanzode wrote:
 I have an index that runs into 200-300GB. It is not frequently updated.

not frequently means different things for different people. Could you
give an approximate time span? If it is updated monthly, you might
consider a full optimization after update.

 What are the best strategies to query on this index?

 1.] Should I, at index time, split the content, like a hash based
 partition, into multiple separate smaller indexes and aggregate the
 results programmatically?

Assuming you use multiple machines or independent storage for the
multiple indexes, this will bring down latency. Do this if your searches
are too slow.

  2.] Should I replicate this index and provide some
 sort of document ID, and search on each node for a specific range of
 document IDs?

I don't really follow that idea. Are your searches only ID-based?

Anyway, replication increases throughput. Do this if your server have
trouble keeping up with the full amount of work.

  3.] Is there any way I can split or move individual
 segments to different nodes and aggregate the results?

Copy the full index. Delete all documents in copy 1 that matches one
half of your ID-hash function, do the reverse for the other. As your
corpus is semi-randomly distributed, scores should be comparable between
the indexes so that the result sets can be easily merged.

But at Jigar says, you should consider switching to SolrCloud (or
ElasticSearch) which does all this for you.

 I am not fully aware of the large scale query strategies. Can you
 please share your findings or experiences?

Depends on what you mean by large scale. You have a running system -
what do you want? Scaling up? Lowering latency? Increasing throughput?
More complex queries?

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: About lucene memory consumption

2014-06-27 Thread Uwe Schindler
Hi,

The number of byte[] instances and the total size shows that each byte[] is 
approx. 1024 bytes long. This is exactly the size used by RAMDirectory for 
allocated heap blocks.
So the important question: Do you use RAMDirectory to hold your index? This is 
not recommended, it is better to use MMapDirectory. RAMDirectory is a class 
made for testing lucene, not for production (does not scale well, is not 
GC-friendly, and is therefore slow in most cases for large indexes). Also the 
index is not persisted to disk. If you want an in-memory index, use a linux 
tmpfs filesystem (ramdisk) and  write your index to it (and use MMapDirectory 
to access it).

To help you, give more information on how you use Lucene and its directory 
implementations.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: 308181687 [mailto:308181...@qq.com]
 Sent: Friday, June 27, 2014 10:42 AM
 To: java-user
 Subject: About lucene memory consumption
 
 Hi, all
 
 
I fould that the memory consumption of ‍my lucene server is abnormal, and
 “jmap -histo ${pid}” show that the class of byte[] consume almost all of the
 memory. Is there memory leak in my app? Why so many  byte[] instances?
 ‍
 
 
 
 
 
 
 
 The following is the top output of jmap:‍
 
 
  num #instances #bytes  class name
 --
1:   1786575 1831556144  [B
2:704618   80078064  [C
3:839932   33597280  java.util.LinkedHashMap$Entry
4:686770   21976640  java.lang.String‍
 
 ‍
 
 
 
 
 
 
 
 
 
 Thanks  Best Regards!


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re:RE: About lucene memory consumption

2014-06-27 Thread 308181687
Hi, 
  Thanks very much for your reply.
  Because we need near real time search, we decide to use NRTCachingDirectory 
instead of MMapDirectory. 
‍


 Code to create ‍Directory as follows :‍


 Directory ‍indexDir = FSDirectory.open(new File(indexDirName));
 NRTCachingDirectory cachedFSDir = new NRTCachingDirectory(indexDir, 5.0, 
60.0);‍

   


 But I think that NRTCachingDirectory will only use RAMDirectory for caching 
and use MMapDirectory to access index file on disk, right? The `top ` command 
seems prove this,  the VIRT  memory of lucene server is ‍28.5G, and ‍RES  
memory is only 5G.‍
‍


  PIDUSER  PR  NI  VIRT RES   SHR S  %CPU  %MEM   TIME+  COMMAND
 
 4004   root   20   0   28.5g   5.0g   49m S  2.0  65.6 140:34.50  
java‍




  
  Now our lucene server have indexed 2 million email and provide near real time 
search service, and some times we can not commit the index because of 
OutOfMemoryError, and we have to restart the JVM. By the way, we commit the 
index for every 1000 email document.‍


 Could you give me kindly give me some tips to solve this problem?




Thanks  Best Regards!






‍

‍

-- Original --
From:  Uwe Schindler;u...@thetaphi.de;
Date:  Fri, Jun 27, 2014 08:36 PM
To:  java-userjava-user@lucene.apache.org; 

Subject:  RE: About lucene memory consumption



Hi,

The number of byte[] instances and the total size shows that each byte[] is 
approx. 1024 bytes long. This is exactly the size used by RAMDirectory for 
allocated heap blocks.
So the important question: Do you use RAMDirectory to hold your index? This is 
not recommended, it is better to use MMapDirectory. RAMDirectory is a class 
made for testing lucene, not for production (does not scale well, is not 
GC-friendly, and is therefore slow in most cases for large indexes). Also the 
index is not persisted to disk. If you want an in-memory index, use a linux 
tmpfs filesystem (ramdisk) and  write your index to it (and use MMapDirectory 
to access it).

To help you, give more information on how you use Lucene and its directory 
implementations.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: 308181687 [mailto:308181...@qq.com]
 Sent: Friday, June 27, 2014 10:42 AM
 To: java-user
 Subject: About lucene memory consumption
 
 Hi, all
 
 
I fould that the memory consumption of ‍my lucene server is abnormal, and
 “jmap -histo ${pid}” show that the class of byte[] consume almost all of the
 memory. Is there memory leak in my app? Why so many  byte[] instances?
 ‍
 
 
 
 
 
 
 
 The following is the top output of jmap:‍
 
 
  num #instances #bytes  class name
 --
1:   1786575 1831556144  [B
2:704618   80078064  [C
3:839932   33597280  java.util.LinkedHashMap$Entry
4:686770   21976640  java.lang.String‍
 
 ‍
 
 
 
 
 
 
 
 
 
 Thanks  Best Regards!


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

.

Re:RE: About lucene memory consumption

2014-06-27 Thread Uwe Schindler
Could it be that you forgot to close older IndexReaders after getting a new NRT 
one? This would be a huge memory leak.

I recommend to use SearcherManager to handle real time reopen correctly.

Uwe

Am 27. Juni 2014 16:05:19 MESZ, schrieb 308181687 308181...@qq.com:
Hi, 
  Thanks very much for your reply.
Because we need near real time search, we decide to use
NRTCachingDirectory instead of MMapDirectory. 
‍


 Code to create ‍Directory as follows :‍


 Directory ‍indexDir = FSDirectory.open(new File(indexDirName));
NRTCachingDirectory cachedFSDir = new NRTCachingDirectory(indexDir,
5.0, 60.0);‍

   


But I think that NRTCachingDirectory will only use RAMDirectory for
caching and use MMapDirectory to access index file on disk, right? The
`top ` command seems prove this,  the VIRT  memory of lucene server is
‍28.5G, and ‍RES  memory is only 5G.‍
‍
   
PIDUSER  PR  NI  VIRT RES   SHR S  %CPU  %MEM   TIME+ 
COMMAND 
4004   root   20   0   28.5g   5.0g   49m S  2.0  65.6
140:34.50  java‍




  
Now our lucene server have indexed 2 million email and provide near
real time search service, and some times we can not commit the index
because of OutOfMemoryError, and we have to restart the JVM. By the
way, we commit the index for every 1000 email document.‍


 Could you give me kindly give me some tips to solve this problem?




Thanks  Best Regards!






‍

‍

-- Original --
From:  Uwe Schindler;u...@thetaphi.de;
Date:  Fri, Jun 27, 2014 08:36 PM
To:  java-userjava-user@lucene.apache.org; 

Subject:  RE: About lucene memory consumption



Hi,

The number of byte[] instances and the total size shows that each
byte[] is approx. 1024 bytes long. This is exactly the size used by
RAMDirectory for allocated heap blocks.
So the important question: Do you use RAMDirectory to hold your index?
This is not recommended, it is better to use MMapDirectory.
RAMDirectory is a class made for testing lucene, not for production
(does not scale well, is not GC-friendly, and is therefore slow in most
cases for large indexes). Also the index is not persisted to disk. If
you want an in-memory index, use a linux tmpfs filesystem (ramdisk) and
 write your index to it (and use MMapDirectory to access it).

To help you, give more information on how you use Lucene and its
directory implementations.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: 308181687 [mailto:308181...@qq.com]
 Sent: Friday, June 27, 2014 10:42 AM
 To: java-user
 Subject: About lucene memory consumption
 
 Hi, all
 
 
I fould that the memory consumption of ‍my lucene server is
abnormal, and
 “jmap -histo ${pid}” show that the class of byte[] consume almost all
of the
 memory. Is there memory leak in my app? Why so many  byte[]
instances?
 ‍
 
 
 
 
 
 
 
 The following is the top output of jmap:‍
 
 
  num #instances #bytes  class name
 --
1:   1786575 1831556144  [B
2:704618   80078064  [C
3:839932   33597280  java.util.LinkedHashMap$Entry
4:686770   21976640  java.lang.String‍
 
 ‍
 
 
 
 
 
 
 
 
 
 Thanks  Best Regards!


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

.

--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de

Re: Can Lucene based application be made to work with Scaled Elastic Beanstalk environemnt on Amazon Web Services

2014-06-27 Thread Tri Cao

I would just use S3 as a data push mechanism. In your servlet's init(), you
could download the index from S3 and unpack it to a local directory, then
initialize your Lucene searcher to that directory. 

Downloading from S3 to EC2 instances is free, and 5G would take a minute or two.
Also, if you pack the index inside your war file, the new instance has to 
download
that data anyway.

The big advantage is it also allows you to update your index without repacking 
your
deployment .war. Just upload the new index to the same location in S3, then 
restart
your webapp :)

Hope this helps,
Tri

On Jun 27, 2014, at 04:13 AM, Paul Taylor paul_t...@fastmail.fm wrote:

Hi

I have a simple WAR based web application that uses lucene created 
indexes to provide search results in a xml format.
It works fine locally but I want to deploy it using Elastic Beanstalk 
within Amazon Webservices


Problem 1 is that WAR definition doesn't seem to provide a location for 
data files (rather than config files) so when I deploy the WAR with EB 
it doesnt work at first because has no access to the data (lucene 
indexes) , however I solved this by connecting to the underlying EC2 
instance and copy the lucene indexes from S3 to the instance, and 
ensuring the file location is defined in the Wars web.xml file.


Problem 2 is more problematic, Im looking at AWS and EB because I wanted 
a way to deploy the application with little ongoing admin overhead and I 
like the way EB does load balancing and auto scaling for you, starting 
and stopping additional instances as required to meet demand. However 
these automatically started instances will not have access to the index 
files.


Possible solutions could be

1. Is there a location I can store the data index within the WAR itself, 
the index is only 5GB so I do have space on my root disk to store the 
indexes in the WAR if there is a way to use them, Tomcat was also be 
need to unwar the file at deployement, I cant see if tomcat on AWSdoes this.


2. A way for EC2 instances to be started with data preloaded i some way

(BTW Im aware of CloudSearch but its not an avenue I want to go down)

Does anybody have any experience of this,please ?

Paul



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org