Re: iterate through each document in Solr

2013-05-06 Thread Dmitry Kan
Hi Ming,

Quoting my anwser on a diff. thread (
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201210.mbox/%3ccaonbidbuzzsaqctdhtlxlgeoori_ghrjbt-84bm0zb-fsps...@mail.gmail.com%3E
):

> > [code]
> > Directory indexDir = FSDirectory.open(new File(pathToDir));
> > IndexReader input = IndexReader.open(indexDir, true);
> >
> > FieldSelector fieldSelector = new SetBasedFieldSelector(
> > null, // to retrive all stored fields
> > Collections.emptySet());
> >
> > int maxDoc = input.maxDoc();
> > for (int i = 0; i < maxDoc; i++) {
> > if (input.isDeleted(i)) {
> > // deleted document found, retrieve it
> > Document document = input.document(i, fieldSelector);
> > // analyze its field values here...
> > }
> > }
> > [/code]

Have a look here for code of a complete standalone example. It does
different thing with the Lucene index, so *do not* run it on your
index.

Dmitry



On Mon, May 6, 2013 at 7:36 PM, Mingfeng Yang  wrote:

> Hi Dmitry,
>
> My index is not sharded, and since its size is so big, sharding won't help
> much on the paging issue.  Do you know any API which can help read from
> lucene binary index directly? I will be nice if we can just scan
> through the docs directly.
>
> Thanks!
> Ming-
>
>
> On Mon, May 6, 2013 at 3:33 AM, Dmitry Kan  wrote:
>
> > Are you doing it once? Is your index sharded? If so, can you ask each
> shard
> > individually?
> > Another way would be to do it on Lucene level, i.e. read from the binary
> > indices (API exists).
> >
> > Dmitry
> >
> >
> > On Mon, May 6, 2013 at 5:48 AM, Mingfeng Yang 
> > wrote:
> >
> > > Dear Solr Users,
> > >
> > > Does anyone know what is the best way to iterate through each document
> > in a
> > > Solr index with billion entries?
> > >
> > > I tried to use  select?q=*:*&start=xx&rows=500  to get 500 docs each
> time
> > > and then change start value, but it got very slow after getting through
> > > about 10 million docs.
> > >
> > > Thanks,
> > > Ming-
> > >
> >
>


Re: iterate through each document in Solr

2013-05-06 Thread Mingfeng Yang
Andre,

Thanks for the info!  Unfortunately, my solr is on 3.6 version, and looks
like those options are not available. :(

Ming-


On Mon, May 6, 2013 at 5:32 AM, Andre Bois-Crettez wrote:

> On 05/06/2013 06:03 AM, Michael Sokolov wrote:
>
>> On 5/5/13 7:48 PM, Mingfeng Yang wrote:
>>
>>> Dear Solr Users,
>>>
>>> Does anyone know what is the best way to iterate through each document
>>> in a
>>> Solr index with billion entries?
>>>
>>> I tried to use  select?q=*:*&start=xx&rows=500  to get 500 docs each time
>>> and then change start value, but it got very slow after getting through
>>> about 10 million docs.
>>>
>>> Thanks,
>>> Ming-
>>>
>>>  You need to use a unique and stable sort key and get documents>
>> sortkey.  For example, if you have a unique key, retrieve documents
>> ordered by the unique key, and for each batch get documents>  max (key)
>> from the previous batch
>>
>> -Mike
>>
>>  There is more details on the wiki :
> http://wiki.apache.org/solr/**CommonQueryParameters#pageDoc_**
> and_pageScore
>
>
> --
> André Bois-Crettez
>
> Search technology, Kelkoo
> http://www.kelkoo.com/
>
>
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 8, rue du Sentier 75002 Paris
> 425 093 069 RCS Paris
>
> Ce message et les pièces jointes sont confidentiels et établis à
> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
> destinataire de ce message, merci de le détruire et d'en avertir
> l'expéditeur.
>


Re: iterate through each document in Solr

2013-05-06 Thread Mingfeng Yang
Hi Dmitry,

My index is not sharded, and since its size is so big, sharding won't help
much on the paging issue.  Do you know any API which can help read from
lucene binary index directly? I will be nice if we can just scan
through the docs directly.

Thanks!
Ming-


On Mon, May 6, 2013 at 3:33 AM, Dmitry Kan  wrote:

> Are you doing it once? Is your index sharded? If so, can you ask each shard
> individually?
> Another way would be to do it on Lucene level, i.e. read from the binary
> indices (API exists).
>
> Dmitry
>
>
> On Mon, May 6, 2013 at 5:48 AM, Mingfeng Yang 
> wrote:
>
> > Dear Solr Users,
> >
> > Does anyone know what is the best way to iterate through each document
> in a
> > Solr index with billion entries?
> >
> > I tried to use  select?q=*:*&start=xx&rows=500  to get 500 docs each time
> > and then change start value, but it got very slow after getting through
> > about 10 million docs.
> >
> > Thanks,
> > Ming-
> >
>


Re: iterate through each document in Solr

2013-05-06 Thread Andre Bois-Crettez

On 05/06/2013 06:03 AM, Michael Sokolov wrote:

On 5/5/13 7:48 PM, Mingfeng Yang wrote:

Dear Solr Users,

Does anyone know what is the best way to iterate through each document in a
Solr index with billion entries?

I tried to use  select?q=*:*&start=xx&rows=500  to get 500 docs each time
and then change start value, but it got very slow after getting through
about 10 million docs.

Thanks,
Ming-


You need to use a unique and stable sort key and get documents>
sortkey.  For example, if you have a unique key, retrieve documents
ordered by the unique key, and for each batch get documents>  max (key)
from the previous batch

-Mike


There is more details on the wiki :
http://wiki.apache.org/solr/CommonQueryParameters#pageDoc_and_pageScore


--
André Bois-Crettez

Search technology, Kelkoo
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: iterate through each document in Solr

2013-05-06 Thread Dmitry Kan
Are you doing it once? Is your index sharded? If so, can you ask each shard
individually?
Another way would be to do it on Lucene level, i.e. read from the binary
indices (API exists).

Dmitry


On Mon, May 6, 2013 at 5:48 AM, Mingfeng Yang  wrote:

> Dear Solr Users,
>
> Does anyone know what is the best way to iterate through each document in a
> Solr index with billion entries?
>
> I tried to use  select?q=*:*&start=xx&rows=500  to get 500 docs each time
> and then change start value, but it got very slow after getting through
> about 10 million docs.
>
> Thanks,
> Ming-
>


Re: iterate through each document in Solr

2013-05-05 Thread Michael Sokolov

On 5/5/13 7:48 PM, Mingfeng Yang wrote:

Dear Solr Users,

Does anyone know what is the best way to iterate through each document in a
Solr index with billion entries?

I tried to use  select?q=*:*&start=xx&rows=500  to get 500 docs each time
and then change start value, but it got very slow after getting through
about 10 million docs.

Thanks,
Ming-

You need to use a unique and stable sort key and get documents > 
sortkey.  For example, if you have a unique key, retrieve documents 
ordered by the unique key, and for each batch get documents > max (key) 
from the previous batch


-Mike


iterate through each document in Solr

2013-05-05 Thread Mingfeng Yang
Dear Solr Users,

Does anyone know what is the best way to iterate through each document in a
Solr index with billion entries?

I tried to use  select?q=*:*&start=xx&rows=500  to get 500 docs each time
and then change start value, but it got very slow after getting through
about 10 million docs.

Thanks,
Ming-