Re: How best to handle a reasonable amount to data (25TB+)

2012-02-07 Thread Li Li
it's up to your machines. in our application, we indexs about
30,000,000(30M)docs/shard, and the response time is about 150ms. our
machine has about 48GB memory and about 25GB is allocated to solr and other
is used for disk cache in Linux.
if calculated by our application, indexing 1.25T docs will use 40+ machines.

On Mon, Feb 6, 2012 at 10:50 AM, Peter Miller <
peter.mil...@objectconsulting.com.au> wrote:

> Hi,
>
> I have a little bit of an unusual set of requirements, and I am looking
> for advice. I have researched the archives, and seen some relevant posts,
> but they are fairly old and not specifically a match, so I thought I would
> give this a try.
>
> We will eventually have about 50TB raw, non-searchable data and 25TB of
> search attributes to handle in Lucene, across about 1.25 trillion
> documents. The app is write once, read many. There are many document types
> involved that have to be able to be searched separately or together, with
> some common attributes, but also unique ones per type. I plan on using a
> JCP implementation that uses Lucene under the covers. The data itself is
> not searchable, only the attributes. I plan to hook the JCP repo
> (ModeShape) up to the OpenStack Object Storage on commodity hardware
> eventually with 5 machines, each with 24 x 2TB drives. This should allow
> for redundancy (3 copies), although I would suppose we would add bigger
> drives as we go on.
>
> Since there is such a lot of data to index (not outrageous amounts for
> these days, but a bit chunky), I was sort of assuming that the Lucene
> indexes would go on the object storage solution too, to handle availability
> and other infrastructure issues. Most of the searches would be
> date-constrained, so I thought that the indexes could be sharded by date.
>
> There would be a local disk index being built near real time on the JCP
> hardware that could be regularly merged in with the main indexes on the
> object storage, I suppose.
>
> Does that make sense, and would it work? Sorry, but this is just
> theoretical at the moment and I'm not experienced in Lucene, as you can no
> doubt tell.
>
> I came across a piece that was talking about Hardoop and distributed Solr,
> http://blog.mgm-tp.com/2010/09/hadoop-log-management-part4/, and I'm now
> wondering if that would be a superior approach? Or any other suggestions?
>
> Many Thanks,
> The Captn
>


RE: How best to handle a reasonable amount to data (25TB+)

2012-02-07 Thread Peter Miller
Well, I am sooo embarrassed: I haven't stuffed this badly in quite a while. But 
in the end, 13 shards is the right number. My calculator work was OK, my 
English usage atrocious.

I'm still interested in opinion on using object storage for (static) indexes, 
big enough that they won't all fit in memory.
 
That's a great resource you reference. Thanks so much, The Captn

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, 8 February 2012 1:38 PM
To: java-user@lucene.apache.org
Subject: Re: How best to handle a reasonable amount to data (25TB+)

I'm all confused. 100M X 13 shards = 1.3G records, not 1.25 T

But I get it 1.5 x 10^7 x 12 x 7 = 1.26 x 10 ^ 9 = 1.26 Billion, or am I off 
base again? But yes, at 100M records that would be 13 servers.

As for whether 100M documents/shard is reasonable... it depends (tm).
There are so many variables
that the *only* way is to try it with *your* data and *your* queries.
Otherwise it's just guessing. Are you
faceting? Sorting? Do you have 10 unique terms/field? 10M unique terms? 10B 
unique terms?
All that stuff goes in to the mix to determine how many documents a shard can 
hold and still get adequate performance.

Not to mention the question "what's the hardware"? A MacBook Air with 4G 
memory? A monster piece of metal with a bazillion gigs of memory and SSDs?

All that said, and especially with trunk, 100M documents/shard is quite 
possible. So is 10M docs/shard. And it's not even, really, the size of the 
documents that solely determines the requirements, it's this weird calculation 
of how many docs, how many unique terms/doc and how you're searching them. I 
expect your documents are quite small, so that may help. Some.

Try filling out the spreadsheet here:
http://www.lucidimagination.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/
and you'll swiftly find out how hard abstract estimations are

Best
Erick

On Tue, Feb 7, 2012 at 9:07 PM, Peter Miller 
 wrote:
> Oops again! Turns out I got to the right result earlier by the wrong means! I 
> found this reference (http://www.dejavutechnologies.com/faq-solr-lucene.html) 
> that states shards can be up to 100,000,000 documents. So, I'm back to 13 
> shards again. Phew!
>
> Now I'm just wondering if Cassandra/Lucandra would be a better option 
> anyways. If Cassandra offers some of the same advantage as OpenStack Swift 
> object store does, then it should be the way to go.
>
> Still looking for thoughts...
>
> Thanks, The Captn
>
> -Original Message-
> From: Peter Miller [mailto:peter.mil...@objectconsulting.com.au]
> Sent: Wednesday, 8 February 2012 12:20 PM
> To: java-user@lucene.apache.org
> Subject: RE: How best to handle a reasonable amount to data (25TB+)
>
> Whoops! Very poor basic maths, I should have written it down. I was thinking 
> 13 shards. But yes, 13,000 is a bit different. Now I'm in even more need of 
> help.
>
> How is "easy" - 15 million audit records a month, coming from several active 
> systems, and a requirement to keep and search across seven years of data.
>
> 
>
> Thanks a lot,
> The Captn
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Wednesday, 8 February 2012 12:39 AM
> To: java-user@lucene.apache.org
> Subject: Re: How best to handle a reasonable amount to data (25TB+)
>
> I'm curious what the nature of your data is such that you have 1.25 trillion 
> documents. Even at 100M/shard, you're still talking  12,500 shards. The 
> "laggard"
> problem will rear it's ugly
> head, not to mention the administration of that many machines will be, shall 
> we say, non-trivial...
>
> Best
> Erick
>
> On Mon, Feb 6, 2012 at 11:17 PM, Peter Miller 
>  wrote:
>> Thanks for the response. Actually, I am more concerned with trying to use an 
>> Object Store for the indexes. The next concern is the use of a local index 
>> versus the sharded ones, but I'm more relaxed about that now after thinking 
>> about it. I see that index shards could be up to 100 million documents, so 
>> that makes the 1.25 trillion number look reasonable.
>>
>> Any other thoughts?
>>
>> Thanks,
>> The Captn.
>>
>> -Original Message-
>> From: ppp c [mailto:peter.c.e...@gmail.com]
>> Sent: Monday, 6 February 2012 5:29 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: How best to handle a reasonable amount to data (25TB+)
>>
>> it sounds not an issue of lucene but the logic of your app.
>> if you're afraid too many docs in one index you can make multiple indexes.
>> And then search across them, then merge, then over.
>>
>> On Mon, Feb 6, 2012 at 10:50 AM, Peter Miller < 
>> peter.mil...@objectconsulting.com.au> wrote:
>>
>>> Hi,
>>>
>>> I have a little bit of an unusual set of requirements, and I am 
>>> looking for advice. I have researched the archives, and seen some 
>>> relevant posts, but they are fairly old and not specifically a 
>>> match, so I thought I would give this a try.
>>>
>>> We wi

Re: How best to handle a reasonable amount to data (25TB+)

2012-02-07 Thread Erick Erickson
I'm all confused. 100M X 13 shards = 1.3G records, not 1.25 T

But I get it 1.5 x 10^7 x 12 x 7 = 1.26 x 10 ^ 9 = 1.26 Billion, or am
I off base again? But yes, at 100M
records that would be 13 servers.

As for whether 100M documents/shard is reasonable... it depends (tm).
There are so many variables
that the *only* way is to try it with *your* data and *your* queries.
Otherwise it's just guessing. Are you
faceting? Sorting? Do you have 10 unique terms/field? 10M unique
terms? 10B unique terms?
All that stuff goes in to the mix to determine how many documents a
shard can hold and still get
adequate performance.

Not to mention the question "what's the hardware"? A MacBook Air with
4G memory? A monster
piece of metal with a bazillion gigs of memory and SSDs?

All that said, and especially with trunk, 100M documents/shard is
quite possible. So is
10M docs/shard. And it's not even, really, the size of the documents
that solely
determines the requirements, it's this weird calculation of how many
docs, how many
unique terms/doc and how you're searching them. I expect your documents are
quite small, so that may help. Some.

Try filling out the spreadsheet here:
http://www.lucidimagination.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/
and you'll swiftly find out how hard abstract estimations are

Best
Erick

On Tue, Feb 7, 2012 at 9:07 PM, Peter Miller
 wrote:
> Oops again! Turns out I got to the right result earlier by the wrong means! I 
> found this reference (http://www.dejavutechnologies.com/faq-solr-lucene.html) 
> that states shards can be up to 100,000,000 documents. So, I'm back to 13 
> shards again. Phew!
>
> Now I'm just wondering if Cassandra/Lucandra would be a better option 
> anyways. If Cassandra offers some of the same advantage as OpenStack Swift 
> object store does, then it should be the way to go.
>
> Still looking for thoughts...
>
> Thanks, The Captn
>
> -Original Message-
> From: Peter Miller [mailto:peter.mil...@objectconsulting.com.au]
> Sent: Wednesday, 8 February 2012 12:20 PM
> To: java-user@lucene.apache.org
> Subject: RE: How best to handle a reasonable amount to data (25TB+)
>
> Whoops! Very poor basic maths, I should have written it down. I was thinking 
> 13 shards. But yes, 13,000 is a bit different. Now I'm in even more need of 
> help.
>
> How is "easy" - 15 million audit records a month, coming from several active 
> systems, and a requirement to keep and search across seven years of data.
>
> 
>
> Thanks a lot,
> The Captn
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Wednesday, 8 February 2012 12:39 AM
> To: java-user@lucene.apache.org
> Subject: Re: How best to handle a reasonable amount to data (25TB+)
>
> I'm curious what the nature of your data is such that you have 1.25 trillion 
> documents. Even at 100M/shard, you're still talking  12,500 shards. The 
> "laggard"
> problem will rear it's ugly
> head, not to mention the administration of that many machines will be, shall 
> we say, non-trivial...
>
> Best
> Erick
>
> On Mon, Feb 6, 2012 at 11:17 PM, Peter Miller 
>  wrote:
>> Thanks for the response. Actually, I am more concerned with trying to use an 
>> Object Store for the indexes. The next concern is the use of a local index 
>> versus the sharded ones, but I'm more relaxed about that now after thinking 
>> about it. I see that index shards could be up to 100 million documents, so 
>> that makes the 1.25 trillion number look reasonable.
>>
>> Any other thoughts?
>>
>> Thanks,
>> The Captn.
>>
>> -Original Message-
>> From: ppp c [mailto:peter.c.e...@gmail.com]
>> Sent: Monday, 6 February 2012 5:29 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: How best to handle a reasonable amount to data (25TB+)
>>
>> it sounds not an issue of lucene but the logic of your app.
>> if you're afraid too many docs in one index you can make multiple indexes.
>> And then search across them, then merge, then over.
>>
>> On Mon, Feb 6, 2012 at 10:50 AM, Peter Miller < 
>> peter.mil...@objectconsulting.com.au> wrote:
>>
>>> Hi,
>>>
>>> I have a little bit of an unusual set of requirements, and I am
>>> looking for advice. I have researched the archives, and seen some
>>> relevant posts, but they are fairly old and not specifically a match,
>>> so I thought I would give this a try.
>>>
>>> We will eventually have about 50TB raw, non-searchable data and 25TB
>>> of search attributes to handle in Lucene, across about 1.25 trillion
>>> documents. The app is write once, read many. There are many document
>>> types involved that have to be able to be searched separately or
>>> together, with some common attributes, but also unique ones per type.
>>> I plan on using a JCP implementation that uses Lucene under the
>>> covers. The data itself is not searchable, only the attributes. I
>>> plan to hook the JCP repo
>>> (ModeShape) up to the OpenStack Object Storage on commodity hardware
>>> eventually w

Re: Why read past EOF

2012-02-07 Thread superruiye
public class PostponeCommitDeletionPolicy implements IndexDeletionPolicy {
private final static long deletionPostPone = 60;

public void onInit(List commits) {
// Note that commits.size() should normally be 1:
onCommit(commits);
}

/**
 * delete commits after deletePostPone ms.
 */
public void onCommit(List commits) {
// Note that commits.size() should normally be 2 (if not
// called by onInit above):
int size = commits.size();
try {
long lastCommitTimestamp = commits.get(commits.size() -
1).getTimestamp();
for (int i = 0; i < size - 1; i++) {
if (lastCommitTimestamp - 
commits.get(i).getTimestamp() >
deletionPostPone) {
commits.get(i).delete();
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
--
indexWriterConfig.setIndexDeletionPolicy(new
PostponeCommitDeletionPolicy());
--
and I use a time task(10 minutes) to reopen indexsearcher,but still  read
past EOF...the trace:
java.io.IOException: read past EOF
at
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:207)
at
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39)
at org.apache.lucene.store.DataInput.readInt(DataInput.java:84)
at
org.apache.lucene.store.BufferedIndexInput.readInt(BufferedIndexInput.java:153)
at
org.apache.lucene.index.TermVectorsReader.checkValidFormat(TermVectorsReader.java:197)
at
org.apache.lucene.index.TermVectorsReader.(TermVectorsReader.java:86)
at
org.apache.lucene.index.SegmentCoreReaders.openDocStores(SegmentCoreReaders.java:221)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:117)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:93)
at
org.apache.lucene.index.DirectoryReader.(DirectoryReader.java:113)
at
org.apache.lucene.index.ReadOnlyDirectoryReader.(ReadOnlyDirectoryReader.java:29)
at
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:81)
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:754)
at
org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:75)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:421)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:281)
at
org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:89)
at
com.ableskysearch.migration.timertask.ReopenIndexSearcherTask.runAsPeriod(ReopenIndexSearcherTask.java:40)


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Why-read-past-EOF-tp3639401p3724672.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: How best to handle a reasonable amount to data (25TB+)

2012-02-07 Thread Peter Miller
Oops again! Turns out I got to the right result earlier by the wrong means! I 
found this reference (http://www.dejavutechnologies.com/faq-solr-lucene.html) 
that states shards can be up to 100,000,000 documents. So, I'm back to 13 
shards again. Phew!

Now I'm just wondering if Cassandra/Lucandra would be a better option anyways. 
If Cassandra offers some of the same advantage as OpenStack Swift object store 
does, then it should be the way to go.

Still looking for thoughts...

Thanks, The Captn

-Original Message-
From: Peter Miller [mailto:peter.mil...@objectconsulting.com.au] 
Sent: Wednesday, 8 February 2012 12:20 PM
To: java-user@lucene.apache.org
Subject: RE: How best to handle a reasonable amount to data (25TB+)

Whoops! Very poor basic maths, I should have written it down. I was thinking 13 
shards. But yes, 13,000 is a bit different. Now I'm in even more need of help. 

How is "easy" - 15 million audit records a month, coming from several active 
systems, and a requirement to keep and search across seven years of data.



Thanks a lot,
The Captn

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Wednesday, 8 February 2012 12:39 AM
To: java-user@lucene.apache.org
Subject: Re: How best to handle a reasonable amount to data (25TB+)

I'm curious what the nature of your data is such that you have 1.25 trillion 
documents. Even at 100M/shard, you're still talking  12,500 shards. The 
"laggard"
problem will rear it's ugly
head, not to mention the administration of that many machines will be, shall we 
say, non-trivial...

Best
Erick

On Mon, Feb 6, 2012 at 11:17 PM, Peter Miller 
 wrote:
> Thanks for the response. Actually, I am more concerned with trying to use an 
> Object Store for the indexes. The next concern is the use of a local index 
> versus the sharded ones, but I'm more relaxed about that now after thinking 
> about it. I see that index shards could be up to 100 million documents, so 
> that makes the 1.25 trillion number look reasonable.
>
> Any other thoughts?
>
> Thanks,
> The Captn.
>
> -Original Message-
> From: ppp c [mailto:peter.c.e...@gmail.com]
> Sent: Monday, 6 February 2012 5:29 PM
> To: java-user@lucene.apache.org
> Subject: Re: How best to handle a reasonable amount to data (25TB+)
>
> it sounds not an issue of lucene but the logic of your app.
> if you're afraid too many docs in one index you can make multiple indexes.
> And then search across them, then merge, then over.
>
> On Mon, Feb 6, 2012 at 10:50 AM, Peter Miller < 
> peter.mil...@objectconsulting.com.au> wrote:
>
>> Hi,
>>
>> I have a little bit of an unusual set of requirements, and I am 
>> looking for advice. I have researched the archives, and seen some 
>> relevant posts, but they are fairly old and not specifically a match, 
>> so I thought I would give this a try.
>>
>> We will eventually have about 50TB raw, non-searchable data and 25TB 
>> of search attributes to handle in Lucene, across about 1.25 trillion 
>> documents. The app is write once, read many. There are many document 
>> types involved that have to be able to be searched separately or 
>> together, with some common attributes, but also unique ones per type.
>> I plan on using a JCP implementation that uses Lucene under the 
>> covers. The data itself is not searchable, only the attributes. I 
>> plan to hook the JCP repo
>> (ModeShape) up to the OpenStack Object Storage on commodity hardware 
>> eventually with 5 machines, each with 24 x 2TB drives. This should 
>> allow for redundancy (3 copies), although I would suppose we would 
>> add bigger drives as we go on.
>>
>> Since there is such a lot of data to index (not outrageous amounts 
>> for these days, but a bit chunky), I was sort of assuming that the 
>> Lucene indexes would go on the object storage solution too, to handle 
>> availability and other infrastructure issues. Most of the searches 
>> would be date-constrained, so I thought that the indexes could be sharded by 
>> date.
>>
>> There would be a local disk index being built near real time on the 
>> JCP hardware that could be regularly merged in with the main indexes 
>> on the object storage, I suppose.
>>
>> Does that make sense, and would it work? Sorry, but this is just 
>> theoretical at the moment and I'm not experienced in Lucene, as you 
>> can no doubt tell.
>>
>> I came across a piece that was talking about Hardoop and distributed 
>> Solr, http://blog.mgm-tp.com/2010/09/hadoop-log-management-part4/,
>> and I'm now wondering if that would be a superior approach? Or any other 
>> suggestions?
>>
>> Many Thanks,
>> The Captn
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For a

RE: How best to handle a reasonable amount to data (25TB+)

2012-02-07 Thread Peter Miller
Whoops! Very poor basic maths, I should have written it down. I was thinking 13 
shards. But yes, 13,000 is a bit different. Now I'm in even more need of help. 

How is "easy" - 15 million audit records a month, coming from several active 
systems, and a requirement to keep and search across seven years of data.



Thanks a lot,
The Captn

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, 8 February 2012 12:39 AM
To: java-user@lucene.apache.org
Subject: Re: How best to handle a reasonable amount to data (25TB+)

I'm curious what the nature of your data is such that you have 1.25 trillion 
documents. Even at 100M/shard, you're still talking  12,500 shards. The 
"laggard"
problem will rear it's ugly
head, not to mention the administration of that many machines will be, shall we 
say, non-trivial...

Best
Erick

On Mon, Feb 6, 2012 at 11:17 PM, Peter Miller 
 wrote:
> Thanks for the response. Actually, I am more concerned with trying to use an 
> Object Store for the indexes. The next concern is the use of a local index 
> versus the sharded ones, but I'm more relaxed about that now after thinking 
> about it. I see that index shards could be up to 100 million documents, so 
> that makes the 1.25 trillion number look reasonable.
>
> Any other thoughts?
>
> Thanks,
> The Captn.
>
> -Original Message-
> From: ppp c [mailto:peter.c.e...@gmail.com]
> Sent: Monday, 6 February 2012 5:29 PM
> To: java-user@lucene.apache.org
> Subject: Re: How best to handle a reasonable amount to data (25TB+)
>
> it sounds not an issue of lucene but the logic of your app.
> if you're afraid too many docs in one index you can make multiple indexes.
> And then search across them, then merge, then over.
>
> On Mon, Feb 6, 2012 at 10:50 AM, Peter Miller < 
> peter.mil...@objectconsulting.com.au> wrote:
>
>> Hi,
>>
>> I have a little bit of an unusual set of requirements, and I am 
>> looking for advice. I have researched the archives, and seen some 
>> relevant posts, but they are fairly old and not specifically a match, 
>> so I thought I would give this a try.
>>
>> We will eventually have about 50TB raw, non-searchable data and 25TB 
>> of search attributes to handle in Lucene, across about 1.25 trillion 
>> documents. The app is write once, read many. There are many document 
>> types involved that have to be able to be searched separately or 
>> together, with some common attributes, but also unique ones per type.
>> I plan on using a JCP implementation that uses Lucene under the 
>> covers. The data itself is not searchable, only the attributes. I 
>> plan to hook the JCP repo
>> (ModeShape) up to the OpenStack Object Storage on commodity hardware 
>> eventually with 5 machines, each with 24 x 2TB drives. This should 
>> allow for redundancy (3 copies), although I would suppose we would 
>> add bigger drives as we go on.
>>
>> Since there is such a lot of data to index (not outrageous amounts 
>> for these days, but a bit chunky), I was sort of assuming that the 
>> Lucene indexes would go on the object storage solution too, to handle 
>> availability and other infrastructure issues. Most of the searches 
>> would be date-constrained, so I thought that the indexes could be sharded by 
>> date.
>>
>> There would be a local disk index being built near real time on the 
>> JCP hardware that could be regularly merged in with the main indexes 
>> on the object storage, I suppose.
>>
>> Does that make sense, and would it work? Sorry, but this is just 
>> theoretical at the moment and I'm not experienced in Lucene, as you 
>> can no doubt tell.
>>
>> I came across a piece that was talking about Hardoop and distributed 
>> Solr, http://blog.mgm-tp.com/2010/09/hadoop-log-management-part4/, 
>> and I'm now wondering if that would be a superior approach? Or any other 
>> suggestions?
>>
>> Many Thanks,
>> The Captn
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Applying LUCENE-3653 patch to Lucene 3.0.3

2012-02-07 Thread Dhruv
Hi,

My company is using an older version of Lucene (3.0.3). In my profiling
results with 3.0.3, I have found that my app's threads were blocked due to
the issue mentioned at LUCENE-3653. Although I was able to use the 3.6 line
which fixes this problem, we are still in the process of conducting
performance regression testing with Lucene 3.5.3. I would like to apply the
LUCENE-3653 patch to the 3.0.3 line and compare the results. However, I
can't apply the patch because of a missing
"lucene/src/java/org/apache/lucene/index/SegmentCoreReaders.java" file in
the 3.0.3 line.

Can someone please point out a rough draft of what is involved in modifying
3.0.3 to just accomodate LUCENE-3653? Can I just strip out the changes
affecting SegmentCoreReader.java?

Thanks,
-Dhruv


Re: How best to handle a reasonable amount to data (25TB+)

2012-02-07 Thread Erick Erickson
I'm curious what the nature of your data is such that you have 1.25
trillion documents. Even
at 100M/shard, you're still talking  12,500 shards. The "laggard"
problem will rear it's ugly
head, not to mention the administration of that many machines will be,
shall we say, non-trivial...

Best
Erick

On Mon, Feb 6, 2012 at 11:17 PM, Peter Miller
 wrote:
> Thanks for the response. Actually, I am more concerned with trying to use an 
> Object Store for the indexes. The next concern is the use of a local index 
> versus the sharded ones, but I'm more relaxed about that now after thinking 
> about it. I see that index shards could be up to 100 million documents, so 
> that makes the 1.25 trillion number look reasonable.
>
> Any other thoughts?
>
> Thanks,
> The Captn.
>
> -Original Message-
> From: ppp c [mailto:peter.c.e...@gmail.com]
> Sent: Monday, 6 February 2012 5:29 PM
> To: java-user@lucene.apache.org
> Subject: Re: How best to handle a reasonable amount to data (25TB+)
>
> it sounds not an issue of lucene but the logic of your app.
> if you're afraid too many docs in one index you can make multiple indexes.
> And then search across them, then merge, then over.
>
> On Mon, Feb 6, 2012 at 10:50 AM, Peter Miller < 
> peter.mil...@objectconsulting.com.au> wrote:
>
>> Hi,
>>
>> I have a little bit of an unusual set of requirements, and I am
>> looking for advice. I have researched the archives, and seen some
>> relevant posts, but they are fairly old and not specifically a match,
>> so I thought I would give this a try.
>>
>> We will eventually have about 50TB raw, non-searchable data and 25TB
>> of search attributes to handle in Lucene, across about 1.25 trillion
>> documents. The app is write once, read many. There are many document
>> types involved that have to be able to be searched separately or
>> together, with some common attributes, but also unique ones per type.
>> I plan on using a JCP implementation that uses Lucene under the
>> covers. The data itself is not searchable, only the attributes. I plan
>> to hook the JCP repo
>> (ModeShape) up to the OpenStack Object Storage on commodity hardware
>> eventually with 5 machines, each with 24 x 2TB drives. This should
>> allow for redundancy (3 copies), although I would suppose we would add
>> bigger drives as we go on.
>>
>> Since there is such a lot of data to index (not outrageous amounts for
>> these days, but a bit chunky), I was sort of assuming that the Lucene
>> indexes would go on the object storage solution too, to handle
>> availability and other infrastructure issues. Most of the searches
>> would be date-constrained, so I thought that the indexes could be sharded by 
>> date.
>>
>> There would be a local disk index being built near real time on the
>> JCP hardware that could be regularly merged in with the main indexes
>> on the object storage, I suppose.
>>
>> Does that make sense, and would it work? Sorry, but this is just
>> theoretical at the moment and I'm not experienced in Lucene, as you
>> can no doubt tell.
>>
>> I came across a piece that was talking about Hardoop and distributed
>> Solr, http://blog.mgm-tp.com/2010/09/hadoop-log-management-part4/, and
>> I'm now wondering if that would be a superior approach? Or any other 
>> suggestions?
>>
>> Many Thanks,
>> The Captn
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Custom Payload Analyzer and Query

2012-02-07 Thread Ian Lea
How does searching with PayloadSpanUtil/PayloadTermQuery/etc work to
exclude/filter the matching terms based on the payload within a query
itself, the original question?

The javadocs for PayloadSpanUtil say that the IndexReader should only
contain doc of interest so not much use for a general query on a
normal index.  PayloadTermQuery and PayloadNearQuery factor in the
value of the payloads using Similarity.scorePayload(...).  Can you
return 0 from that to exclude docs?


--
Ian.


On Tue, Feb 7, 2012 at 9:11 AM, Tommaso Teofili
 wrote:
> 2012/2/6 Ian Lea 
>
>> Not sure if you got an answer to this or not.  Don't recall seeing one
>> and gmail threading says not.
>>
>> > Is the use of payloads I've described appropriate?
>>
>> Sounds OK to me, although I'm not sure why you can't store the
>> metadata as a Document Field.
>>
>> > Can I exclude/filter the matching terms based on the payload within a
>> query itself ?
>>
>> I think not.  Could if the metadata was an indexed Field.
>>
>
> What you may do is initially put your metadata inside the token type, then
> use the TypeTokenFilter to filter out some of them then "copy“ them inside
> the payloads using TypeAsPayloadTokenFilter and search with
> PayloadSpanUtil/PayloadTermQuery/etc.
>
> HTH,
> Tommaso
>
>
>>
>>
>>
>> --
>> Ian.
>>
>>
>> On Mon, Jan 30, 2012 at 10:24 PM,   wrote:
>> > I'm working on providing advanced searching for annotated Medical
>> > Documents (using UIMA).  In the context of an annotated document, I
>> > identify relevant medical terms, as well as the negation of certain
>> terms.
>> >  Following what I've read and seen in Lucene examples, I've been able to
>> > provide a search that takes into account the metadata contained in the
>> > payload.  Although very primitive, I've implemented a search which
>> returns
>> > the payloads (using PayloadSpanUtil), and then excludes those terms where
>> > the payload doesn't meet the criteria.
>> >
>> > Is the use of payloads I've described appropriate?  Can I exclude/filter
>> > the matching terms based on the payload within a query itself ?   Are
>> > there any examples that do this?
>> >
>> > Cheers,
>> > Kyley
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Custom Payload Analyzer and Query

2012-02-07 Thread Tommaso Teofili
2012/2/6 Ian Lea 

> Not sure if you got an answer to this or not.  Don't recall seeing one
> and gmail threading says not.
>
> > Is the use of payloads I've described appropriate?
>
> Sounds OK to me, although I'm not sure why you can't store the
> metadata as a Document Field.
>
> > Can I exclude/filter the matching terms based on the payload within a
> query itself ?
>
> I think not.  Could if the metadata was an indexed Field.
>

What you may do is initially put your metadata inside the token type, then
use the TypeTokenFilter to filter out some of them then "copy“ them inside
the payloads using TypeAsPayloadTokenFilter and search with
PayloadSpanUtil/PayloadTermQuery/etc.

HTH,
Tommaso


>
>
>
> --
> Ian.
>
>
> On Mon, Jan 30, 2012 at 10:24 PM,   wrote:
> > I'm working on providing advanced searching for annotated Medical
> > Documents (using UIMA).  In the context of an annotated document, I
> > identify relevant medical terms, as well as the negation of certain
> terms.
> >  Following what I've read and seen in Lucene examples, I've been able to
> > provide a search that takes into account the metadata contained in the
> > payload.  Although very primitive, I've implemented a search which
> returns
> > the payloads (using PayloadSpanUtil), and then excludes those terms where
> > the payload doesn't meet the criteria.
> >
> > Is the use of payloads I've described appropriate?  Can I exclude/filter
> > the matching terms based on the payload within a query itself ?   Are
> > there any examples that do this?
> >
> > Cheers,
> > Kyley
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>