Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-28 Thread Erick Erickson
Shawn had an interesting idea on another thread. It depends
on having basically an identity field (which I see how to do
manually, but don't see how to make work as a new field type
in a distributed environment). And it's brilliantly simple, just
a range query identity:{ TO *]&sort=identity asc.

Then just keep replacing the NNN with the max value you
got in the last packet. Of course this doesn't work if you
need to rank the results, but as a way to process all documents
in a corpus it seems to work. It's certainly not a general solution
to deep paging, but for the limited data dump case...

You could keep from processing the same doc twice (let's
say a doc gets updated and the identity field gets bumped)
by getting the min and max at the start of the dump.

But life is complicated. Sgggh. Doesn't work for M/R jobs
that compose an index from pieces either.

FWIW,
Erick


On Sun, Jul 28, 2013 at 1:28 AM, Mikhail Khludnev
 wrote:
> On Sun, Jul 28, 2013 at 1:25 AM, Yonik Seeley  wrote:
>
>>
>> Which part is problematic... the creation of the DocList (the search),
>>
> Literally DocList is a copy of TopDocs. Creating TopDocs is not a search,
> but ranking.
> And ranking costs is log(rows+start) beside of numFound, which the search
> takes.
> Interesting that we still pay that log() even if ask for collecting docs
> as-is with _docid_
>
>
>> or it's memory requirements (an int per doc)?
>>
> TopXxxCollector as well as XxxComparators allocates same [rows+start]
>
> it's clear that after we have deep paging, we need to handle heaps just
> with size of rows (without start).
> It's fairly ok, if we use Solr like site navigation engine, but it's
> 'sub-optimal' for data analytic use-cases, where we need something like
> SELECT * FROM ... in rdbms. In this case any memory allocation on billions
> docs index is a bummer. That's why I'm asking about removing heap based
> collector/comparator.
>
>
>> -Yonik
>> http://lucidworks.com
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
>  


Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
On Sun, Jul 28, 2013 at 1:25 AM, Yonik Seeley  wrote:

>
> Which part is problematic... the creation of the DocList (the search),
>
Literally DocList is a copy of TopDocs. Creating TopDocs is not a search,
but ranking.
And ranking costs is log(rows+start) beside of numFound, which the search
takes.
Interesting that we still pay that log() even if ask for collecting docs
as-is with _docid_


> or it's memory requirements (an int per doc)?
>
TopXxxCollector as well as XxxComparators allocates same [rows+start]

it's clear that after we have deep paging, we need to handle heaps just
with size of rows (without start).
It's fairly ok, if we use Solr like site navigation engine, but it's
'sub-optimal' for data analytic use-cases, where we need something like
SELECT * FROM ... in rdbms. In this case any memory allocation on billions
docs index is a bummer. That's why I'm asking about removing heap based
collector/comparator.


> -Yonik
> http://lucidworks.com
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Yonik Seeley
On Sat, Jul 27, 2013 at 5:05 PM, Mikhail Khludnev
 wrote:
> anyway, even if writer pulls docs one by one, it doesn't allow to stream a
> billion of them. Solr writes out DocList, which is really problematic even
> in deep-paging scenarios.

Which part is problematic... the creation of the DocList (the search),
or it's memory requirements (an int per doc)?

-Yonik
http://lucidworks.com


Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
Hello,

Please find below


> Let me just explain better what I found when I dug inside solr: documents
> (results of the query) are loaded before they are passed into a writer - so
> the writers are expecting to encounter the solr documents, but these
> documents were loaded by one of the components before rendering them - so
> it is kinda 'hard-coded'.

there is the code
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java#L445which
pulls documents into document's cache
to achieve your goal you can try to remove documents cache, or disable lazy
fields loading.


> But if solr was NOT loading these docs before
> passing them to a writer, writer can load them instead (hence lazy loading,
> but the difference is in numbers - it could deal with hundreds of thousands
> of docs, instead of few thousands now).
>

anyway, even if writer pulls docs one by one, it doesn't allow to stream a
billion of them. Solr writes out DocList, which is really problematic even
in deep-paging scenarios.


>
>
> roman
>
>
> On Sat, Jul 27, 2013 at 3:52 PM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
>
> > Roman,
> >
> > Let me briefly explain  the design
> >
> > special RequestParser stores servlet output stream into the context
> > https://github.com/m-khl/solr-patches/compare/streaming#L7R22
> >
> > then special component injects special PostFilter/DelegatingCollector
> which
> > writes right into output
> > https://github.com/m-khl/solr-patches/compare/streaming#L2R146
> >
> > here is how it streams the doc, you see it's lazy enough
> > https://github.com/m-khl/solr-patches/compare/streaming#L2R181
> >
> > I mention that it disables later collectors
> > https://github.com/m-khl/solr-patches/compare/streaming#L2R57
> > hence, no facets with streaming, yet as well as memory consumption.
> >
> > This test shows how it works
> > https://github.com/m-khl/solr-patches/compare/streaming#L15R115
> >
> > all other code purposed for distributed search.
> >
> >
> >
> > On Sat, Jul 27, 2013 at 4:44 PM, Roman Chyla 
> > wrote:
> >
> > > Mikhail,
> > > If your solution gives lazy loading of solr docs /and thus streaming of
> > > huge result lists/ it should be big YES!
> > > Roman
> > > On 27 Jul 2013 07:55, "Mikhail Khludnev" 
> > > wrote:
> > >
> > > > Otis,
> > > > You gave links to 'deep paging' when I asked about response
> streaming.
> > > > Let me understand. From my POV, deep paging is a special case for
> > regular
> > > > search scenarios. We definitely need it in Solr. However, if we are
> > > talking
> > > > about data analytic like problems, when we need to select an
> "endless"
> > > > stream of responses (or store them in file as Roman did), 'deep
> paging'
> > > is
> > > > a suboptimal hack.
> > > > What's your vision on this?
> > > >
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > 
> >  
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Yonik Seeley
On Sat, Jul 27, 2013 at 4:30 PM, Roman Chyla  wrote:
> Let me just explain better what I found when I dug inside solr: documents
> (results of the query) are loaded before they are passed into a writer - so
> the writers are expecting to encounter the solr documents, but these
> documents were loaded by one of the components before rendering them

Hmmm, are you saying that it looks like documents are not being streamed?
Solr was designed to stream documents from a single server from day 1...
currently all that is collected up-front is the list of internal
docids (an int[]) and the stored fields are loaded and streamed back
one by one.

Of course it's certainly possible that someone introduced a bug, so we
should investigate if you think you see non-streaming action from a
single server.  Distributed is a different can of worms ;-)

-Yonik
http://lucidworks.com


Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Roman Chyla
Hi Mikhail,

I can see it is lazy-loading, but I can't judge how much complex it becomes
(presumably, the filter dispatching mechanism is doing also other things -
it is there not only for streaming).

Let me just explain better what I found when I dug inside solr: documents
(results of the query) are loaded before they are passed into a writer - so
the writers are expecting to encounter the solr documents, but these
documents were loaded by one of the components before rendering them - so
it is kinda 'hard-coded'. But if solr was NOT loading these docs before
passing them to a writer, writer can load them instead (hence lazy loading,
but the difference is in numbers - it could deal with hundreds of thousands
of docs, instead of few thousands now).

I see one crucial point: this could work without any new handler/servlet -
solr would just gain a new parameter, something like: 'lazy=true' ;) and
people can use whatever 'wt' they did before

disclaimer: i don't know whether that would break other stuff, I only know
that I am using the same idea to dump what i need without breaking things
(so far...;-)) - but obviously, i didn't want to patch solr core

roman


On Sat, Jul 27, 2013 at 3:52 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Roman,
>
> Let me briefly explain  the design
>
> special RequestParser stores servlet output stream into the context
> https://github.com/m-khl/solr-patches/compare/streaming#L7R22
>
> then special component injects special PostFilter/DelegatingCollector which
> writes right into output
> https://github.com/m-khl/solr-patches/compare/streaming#L2R146
>
> here is how it streams the doc, you see it's lazy enough
> https://github.com/m-khl/solr-patches/compare/streaming#L2R181
>
> I mention that it disables later collectors
> https://github.com/m-khl/solr-patches/compare/streaming#L2R57
> hence, no facets with streaming, yet as well as memory consumption.
>
> This test shows how it works
> https://github.com/m-khl/solr-patches/compare/streaming#L15R115
>
> all other code purposed for distributed search.
>
>
>
> On Sat, Jul 27, 2013 at 4:44 PM, Roman Chyla 
> wrote:
>
> > Mikhail,
> > If your solution gives lazy loading of solr docs /and thus streaming of
> > huge result lists/ it should be big YES!
> > Roman
> > On 27 Jul 2013 07:55, "Mikhail Khludnev" 
> > wrote:
> >
> > > Otis,
> > > You gave links to 'deep paging' when I asked about response streaming.
> > > Let me understand. From my POV, deep paging is a special case for
> regular
> > > search scenarios. We definitely need it in Solr. However, if we are
> > talking
> > > about data analytic like problems, when we need to select an "endless"
> > > stream of responses (or store them in file as Roman did), 'deep paging'
> > is
> > > a suboptimal hack.
> > > What's your vision on this?
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
>  
>


Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
Roman,

Let me briefly explain  the design

special RequestParser stores servlet output stream into the context
https://github.com/m-khl/solr-patches/compare/streaming#L7R22

then special component injects special PostFilter/DelegatingCollector which
writes right into output
https://github.com/m-khl/solr-patches/compare/streaming#L2R146

here is how it streams the doc, you see it's lazy enough
https://github.com/m-khl/solr-patches/compare/streaming#L2R181

I mention that it disables later collectors
https://github.com/m-khl/solr-patches/compare/streaming#L2R57
hence, no facets with streaming, yet as well as memory consumption.

This test shows how it works
https://github.com/m-khl/solr-patches/compare/streaming#L15R115

all other code purposed for distributed search.



On Sat, Jul 27, 2013 at 4:44 PM, Roman Chyla  wrote:

> Mikhail,
> If your solution gives lazy loading of solr docs /and thus streaming of
> huge result lists/ it should be big YES!
> Roman
> On 27 Jul 2013 07:55, "Mikhail Khludnev" 
> wrote:
>
> > Otis,
> > You gave links to 'deep paging' when I asked about response streaming.
> > Let me understand. From my POV, deep paging is a special case for regular
> > search scenarios. We definitely need it in Solr. However, if we are
> talking
> > about data analytic like problems, when we need to select an "endless"
> > stream of responses (or store them in file as Roman did), 'deep paging'
> is
> > a suboptimal hack.
> > What's your vision on this?
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Roman Chyla
Mikhail,
If your solution gives lazy loading of solr docs /and thus streaming of
huge result lists/ it should be big YES!
Roman
On 27 Jul 2013 07:55, "Mikhail Khludnev"  wrote:

> Otis,
> You gave links to 'deep paging' when I asked about response streaming.
> Let me understand. From my POV, deep paging is a special case for regular
> search scenarios. We definitely need it in Solr. However, if we are talking
> about data analytic like problems, when we need to select an "endless"
> stream of responses (or store them in file as Roman did), 'deep paging' is
> a suboptimal hack.
> What's your vision on this?
>


Re: paging vs streaming. spawn from (Processing a lot of results in Solr)

2013-07-27 Thread Mikhail Khludnev
Otis,
You gave links to 'deep paging' when I asked about response streaming.
Let me understand. From my POV, deep paging is a special case for regular
search scenarios. We definitely need it in Solr. However, if we are talking
about data analytic like problems, when we need to select an "endless"
stream of responses (or store them in file as Roman did), 'deep paging' is
a suboptimal hack.
What's your vision on this?


Re: Processing a lot of results in Solr

2013-07-25 Thread Otis Gospodnetic
Mikhail,

Yes, +1.
This question comes up a few times a year.  Grant created a JIRA issue
for this many moons ago.

https://issues.apache.org/jira/browse/LUCENE-2127
https://issues.apache.org/jira/browse/SOLR-1726

Otis
--
Solr & ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Wed, Jul 24, 2013 at 9:58 PM, Mikhail Khludnev
 wrote:
> fwiw,
> i did some prototype with the following differences:
> - it streams straight to the socket output stream
> - it streams on-going during collecting, without necessity to store a
> bitset.
> It might have some limited extreme usage. Is there anyone interested?
>
>
> On Wed, Jul 24, 2013 at 7:19 PM, Roman Chyla  wrote:
>
>> On Tue, Jul 23, 2013 at 10:05 PM, Matt Lieber  wrote:
>>
>> > That sounds like a satisfactory solution for the time being -
>> > I am assuming you dump the data from Solr in a csv format?
>> >
>>
>> JSON
>>
>>
>> > How did you implement the streaming processor ? (what tool did you use
>> for
>> > this? Not familiar with that)
>> >
>>
>> this is what dumps the docs:
>>
>> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/response/JSONDumper.java
>>
>> it is called by one of our batch processors, which can pass it a bitset of
>> recs
>>
>> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchProviderDumpIndex.java
>>
>> as far as streaming is concerned, we were all very nicely surprised, a few
>> GB file (on local network) took ridiculously short time - in fact, a
>> colleague of mine was assuming it is not working, until we looked into the
>> downloaded file ;-), you may want to look at line 463
>>
>> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchHandler.java
>>
>> roman
>>
>>
>> > You say it takes a few minutes only to dump the data - how long does it
>> to
>> > stream it back in, are performances acceptable (~ within minutes) ?
>> >
>> > Thanks,
>> > Matt
>> >
>> > On 7/23/13 6:57 PM, "Roman Chyla"  wrote:
>> >
>> > >Hello Matt,
>> > >
>> > >You can consider writing a batch processing handler, which receives a
>> > >query
>> > >and instead of sending results back, it writes them into a file which is
>> > >then available for streaming (it has its own UUID). I am dumping many
>> GBs
>> > >of data from solr in few minutes - your query + streaming writer can go
>> > >very long way :)
>> > >
>> > >roman
>> > >
>> > >
>> > >On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber 
>> wrote:
>> > >
>> > >> Hello Solr users,
>> > >>
>> > >> Question regarding processing a lot of docs returned from a query; I
>> > >> potentially have millions of documents returned back from a query.
>> What
>> > >>is
>> > >> the common design to deal with this ?
>> > >>
>> > >> 2 ideas I have are:
>> > >> - create a client service that is multithreaded to handled this
>> > >> - Use the Solr "pagination" to retrieve a batch of rows at a time
>> > >>("start,
>> > >> rows" in Solr Admin console )
>> > >>
>> > >> Any other ideas that I may be missing ?
>> > >>
>> > >> Thanks,
>> > >> Matt
>> > >>
>> > >>
>> > >> 
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> NOTE: This message may contain information that is confidential,
>> > >> proprietary, privileged or otherwise protected by law. The message is
>> > >> intended solely for the named addressee. If received in error, please
>> > >> destroy and notify the sender. Any use of this email is prohibited
>> when
>> > >> received in error. Impetus does not represent, warrant and/or
>> guarantee,
>> > >> that the integrity of this communication has been maintained nor that
>> > >>the
>> > >> communication is free of errors, virus, interception or interference.
>> > >>
>> >
>> >
>> > 
>> >
>> >
>> >
>> >
>> >
>> >
>> > NOTE: This message may contain information that is confidential,
>> > proprietary, privileged or otherwise protected by law. The message is
>> > intended solely for the named addressee. If received in error, please
>> > destroy and notify the sender. Any use of this email is prohibited when
>> > received in error. Impetus does not represent, warrant and/or guarantee,
>> > that the integrity of this communication has been maintained nor that the
>> > communication is free of errors, virus, interception or interference.
>> >
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
>  


Re: Processing a lot of results in Solr

2013-07-24 Thread Chris Hostetter

: Subject: Processing a lot of results in Solr
: Message-ID: 
: In-Reply-To: <1374612243070-4079869.p...@n3.nabble.com>

https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.



-Hoss


Re: Processing a lot of results in Solr

2013-07-24 Thread Mikhail Khludnev
fwiw,
i did some prototype with the following differences:
- it streams straight to the socket output stream
- it streams on-going during collecting, without necessity to store a
bitset.
It might have some limited extreme usage. Is there anyone interested?


On Wed, Jul 24, 2013 at 7:19 PM, Roman Chyla  wrote:

> On Tue, Jul 23, 2013 at 10:05 PM, Matt Lieber  wrote:
>
> > That sounds like a satisfactory solution for the time being -
> > I am assuming you dump the data from Solr in a csv format?
> >
>
> JSON
>
>
> > How did you implement the streaming processor ? (what tool did you use
> for
> > this? Not familiar with that)
> >
>
> this is what dumps the docs:
>
> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/response/JSONDumper.java
>
> it is called by one of our batch processors, which can pass it a bitset of
> recs
>
> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchProviderDumpIndex.java
>
> as far as streaming is concerned, we were all very nicely surprised, a few
> GB file (on local network) took ridiculously short time - in fact, a
> colleague of mine was assuming it is not working, until we looked into the
> downloaded file ;-), you may want to look at line 463
>
> https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchHandler.java
>
> roman
>
>
> > You say it takes a few minutes only to dump the data - how long does it
> to
> > stream it back in, are performances acceptable (~ within minutes) ?
> >
> > Thanks,
> > Matt
> >
> > On 7/23/13 6:57 PM, "Roman Chyla"  wrote:
> >
> > >Hello Matt,
> > >
> > >You can consider writing a batch processing handler, which receives a
> > >query
> > >and instead of sending results back, it writes them into a file which is
> > >then available for streaming (it has its own UUID). I am dumping many
> GBs
> > >of data from solr in few minutes - your query + streaming writer can go
> > >very long way :)
> > >
> > >roman
> > >
> > >
> > >On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber 
> wrote:
> > >
> > >> Hello Solr users,
> > >>
> > >> Question regarding processing a lot of docs returned from a query; I
> > >> potentially have millions of documents returned back from a query.
> What
> > >>is
> > >> the common design to deal with this ?
> > >>
> > >> 2 ideas I have are:
> > >> - create a client service that is multithreaded to handled this
> > >> - Use the Solr "pagination" to retrieve a batch of rows at a time
> > >>("start,
> > >> rows" in Solr Admin console )
> > >>
> > >> Any other ideas that I may be missing ?
> > >>
> > >> Thanks,
> > >> Matt
> > >>
> > >>
> > >> 
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> NOTE: This message may contain information that is confidential,
> > >> proprietary, privileged or otherwise protected by law. The message is
> > >> intended solely for the named addressee. If received in error, please
> > >> destroy and notify the sender. Any use of this email is prohibited
> when
> > >> received in error. Impetus does not represent, warrant and/or
> guarantee,
> > >> that the integrity of this communication has been maintained nor that
> > >>the
> > >> communication is free of errors, virus, interception or interference.
> > >>
> >
> >
> > 
> >
> >
> >
> >
> >
> >
> > NOTE: This message may contain information that is confidential,
> > proprietary, privileged or otherwise protected by law. The message is
> > intended solely for the named addressee. If received in error, please
> > destroy and notify the sender. Any use of this email is prohibited when
> > received in error. Impetus does not represent, warrant and/or guarantee,
> > that the integrity of this communication has been maintained nor that the
> > communication is free of errors, virus, interception or interference.
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Re: Processing a lot of results in Solr

2013-07-24 Thread Roman Chyla
On Tue, Jul 23, 2013 at 10:05 PM, Matt Lieber  wrote:

> That sounds like a satisfactory solution for the time being -
> I am assuming you dump the data from Solr in a csv format?
>

JSON


> How did you implement the streaming processor ? (what tool did you use for
> this? Not familiar with that)
>

this is what dumps the docs:
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/response/JSONDumper.java

it is called by one of our batch processors, which can pass it a bitset of
recs
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchProviderDumpIndex.java

as far as streaming is concerned, we were all very nicely surprised, a few
GB file (on local network) took ridiculously short time - in fact, a
colleague of mine was assuming it is not working, until we looked into the
downloaded file ;-), you may want to look at line 463
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/handler/batch/BatchHandler.java

roman


> You say it takes a few minutes only to dump the data - how long does it to
> stream it back in, are performances acceptable (~ within minutes) ?
>
> Thanks,
> Matt
>
> On 7/23/13 6:57 PM, "Roman Chyla"  wrote:
>
> >Hello Matt,
> >
> >You can consider writing a batch processing handler, which receives a
> >query
> >and instead of sending results back, it writes them into a file which is
> >then available for streaming (it has its own UUID). I am dumping many GBs
> >of data from solr in few minutes - your query + streaming writer can go
> >very long way :)
> >
> >roman
> >
> >
> >On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber  wrote:
> >
> >> Hello Solr users,
> >>
> >> Question regarding processing a lot of docs returned from a query; I
> >> potentially have millions of documents returned back from a query. What
> >>is
> >> the common design to deal with this ?
> >>
> >> 2 ideas I have are:
> >> - create a client service that is multithreaded to handled this
> >> - Use the Solr "pagination" to retrieve a batch of rows at a time
> >>("start,
> >> rows" in Solr Admin console )
> >>
> >> Any other ideas that I may be missing ?
> >>
> >> Thanks,
> >> Matt
> >>
> >>
> >> 
> >>
> >>
> >>
> >>
> >>
> >>
> >> NOTE: This message may contain information that is confidential,
> >> proprietary, privileged or otherwise protected by law. The message is
> >> intended solely for the named addressee. If received in error, please
> >> destroy and notify the sender. Any use of this email is prohibited when
> >> received in error. Impetus does not represent, warrant and/or guarantee,
> >> that the integrity of this communication has been maintained nor that
> >>the
> >> communication is free of errors, virus, interception or interference.
> >>
>
>
> 
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>


Re: Processing a lot of results in Solr

2013-07-24 Thread Roman Chyla
Mikhail,
It is a slightly hacked JSONWriter - actually, while poking around, I have
discovered that dumping big hitsets would be possible - the main hurdle
right now, is that writer is expecting to receive docuemnts with fields
loaded, but if it received something that loads docs lazily, you could
stream thousands and thousands of recs just as it is done with the normal
response - standard operation. Well, people may cry this is not how SOLR is
meant to operate ;-)

roman


On Wed, Jul 24, 2013 at 5:28 AM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Roman,
>
> Can you disclosure how that streaming writer works? What does it stream
> docList or docSet?
>
> Thanks
>
>
> On Wed, Jul 24, 2013 at 5:57 AM, Roman Chyla 
> wrote:
>
> > Hello Matt,
> >
> > You can consider writing a batch processing handler, which receives a
> query
> > and instead of sending results back, it writes them into a file which is
> > then available for streaming (it has its own UUID). I am dumping many GBs
> > of data from solr in few minutes - your query + streaming writer can go
> > very long way :)
> >
> > roman
> >
> >
> > On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber 
> wrote:
> >
> > > Hello Solr users,
> > >
> > > Question regarding processing a lot of docs returned from a query; I
> > > potentially have millions of documents returned back from a query. What
> > is
> > > the common design to deal with this ?
> > >
> > > 2 ideas I have are:
> > > - create a client service that is multithreaded to handled this
> > > - Use the Solr "pagination" to retrieve a batch of rows at a time
> > ("start,
> > > rows" in Solr Admin console )
> > >
> > > Any other ideas that I may be missing ?
> > >
> > > Thanks,
> > > Matt
> > >
> > >
> > > 
> > >
> > >
> > >
> > >
> > >
> > >
> > > NOTE: This message may contain information that is confidential,
> > > proprietary, privileged or otherwise protected by law. The message is
> > > intended solely for the named addressee. If received in error, please
> > > destroy and notify the sender. Any use of this email is prohibited when
> > > received in error. Impetus does not represent, warrant and/or
> guarantee,
> > > that the integrity of this communication has been maintained nor that
> the
> > > communication is free of errors, virus, interception or interference.
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
>  
>


Re: Processing a lot of results in Solr

2013-07-24 Thread Mikhail Khludnev
Roman,

Can you disclosure how that streaming writer works? What does it stream
docList or docSet?

Thanks


On Wed, Jul 24, 2013 at 5:57 AM, Roman Chyla  wrote:

> Hello Matt,
>
> You can consider writing a batch processing handler, which receives a query
> and instead of sending results back, it writes them into a file which is
> then available for streaming (it has its own UUID). I am dumping many GBs
> of data from solr in few minutes - your query + streaming writer can go
> very long way :)
>
> roman
>
>
> On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber  wrote:
>
> > Hello Solr users,
> >
> > Question regarding processing a lot of docs returned from a query; I
> > potentially have millions of documents returned back from a query. What
> is
> > the common design to deal with this ?
> >
> > 2 ideas I have are:
> > - create a client service that is multithreaded to handled this
> > - Use the Solr "pagination" to retrieve a batch of rows at a time
> ("start,
> > rows" in Solr Admin console )
> >
> > Any other ideas that I may be missing ?
> >
> > Thanks,
> > Matt
> >
> >
> > 
> >
> >
> >
> >
> >
> >
> > NOTE: This message may contain information that is confidential,
> > proprietary, privileged or otherwise protected by law. The message is
> > intended solely for the named addressee. If received in error, please
> > destroy and notify the sender. Any use of this email is prohibited when
> > received in error. Impetus does not represent, warrant and/or guarantee,
> > that the integrity of this communication has been maintained nor that the
> > communication is free of errors, virus, interception or interference.
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics


 


Re: Processing a lot of results in Solr

2013-07-23 Thread Matt Lieber
That sounds like a satisfactory solution for the time being -
I am assuming you dump the data from Solr in a csv format?
How did you implement the streaming processor ? (what tool did you use for
this? Not familiar with that)
You say it takes a few minutes only to dump the data - how long does it to
stream it back in, are performances acceptable (~ within minutes) ?

Thanks,
Matt

On 7/23/13 6:57 PM, "Roman Chyla"  wrote:

>Hello Matt,
>
>You can consider writing a batch processing handler, which receives a
>query
>and instead of sending results back, it writes them into a file which is
>then available for streaming (it has its own UUID). I am dumping many GBs
>of data from solr in few minutes - your query + streaming writer can go
>very long way :)
>
>roman
>
>
>On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber  wrote:
>
>> Hello Solr users,
>>
>> Question regarding processing a lot of docs returned from a query; I
>> potentially have millions of documents returned back from a query. What
>>is
>> the common design to deal with this ?
>>
>> 2 ideas I have are:
>> - create a client service that is multithreaded to handled this
>> - Use the Solr "pagination" to retrieve a batch of rows at a time
>>("start,
>> rows" in Solr Admin console )
>>
>> Any other ideas that I may be missing ?
>>
>> Thanks,
>> Matt
>>
>>
>> 
>>
>>
>>
>>
>>
>>
>> NOTE: This message may contain information that is confidential,
>> proprietary, privileged or otherwise protected by law. The message is
>> intended solely for the named addressee. If received in error, please
>> destroy and notify the sender. Any use of this email is prohibited when
>> received in error. Impetus does not represent, warrant and/or guarantee,
>> that the integrity of this communication has been maintained nor that
>>the
>> communication is free of errors, virus, interception or interference.
>>









NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


Re: Processing a lot of results in Solr

2013-07-23 Thread Roman Chyla
Hello Matt,

You can consider writing a batch processing handler, which receives a query
and instead of sending results back, it writes them into a file which is
then available for streaming (it has its own UUID). I am dumping many GBs
of data from solr in few minutes - your query + streaming writer can go
very long way :)

roman


On Tue, Jul 23, 2013 at 5:04 PM, Matt Lieber  wrote:

> Hello Solr users,
>
> Question regarding processing a lot of docs returned from a query; I
> potentially have millions of documents returned back from a query. What is
> the common design to deal with this ?
>
> 2 ideas I have are:
> - create a client service that is multithreaded to handled this
> - Use the Solr "pagination" to retrieve a batch of rows at a time ("start,
> rows" in Solr Admin console )
>
> Any other ideas that I may be missing ?
>
> Thanks,
> Matt
>
>
> 
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>


Re: Processing a lot of results in Solr

2013-07-23 Thread Timothy Potter
Hi Matt,

This feature is commonly known as deep paging and Lucene and Solr have
issues with it ... take a look at
http://solr.pl/en/2011/07/18/deep-paging-problem/ as a potential
starting point using filters to bucketize a result set into sets of
sub result sets.

Cheers,
Tim

On Tue, Jul 23, 2013 at 3:04 PM, Matt Lieber  wrote:
> Hello Solr users,
>
> Question regarding processing a lot of docs returned from a query; I
> potentially have millions of documents returned back from a query. What is
> the common design to deal with this ?
>
> 2 ideas I have are:
> - create a client service that is multithreaded to handled this
> - Use the Solr "pagination" to retrieve a batch of rows at a time ("start,
> rows" in Solr Admin console )
>
> Any other ideas that I may be missing ?
>
> Thanks,
> Matt
>
>
> 
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential, proprietary, 
> privileged or otherwise protected by law. The message is intended solely for 
> the named addressee. If received in error, please destroy and notify the 
> sender. Any use of this email is prohibited when received in error. Impetus 
> does not represent, warrant and/or guarantee, that the integrity of this 
> communication has been maintained nor that the communication is free of 
> errors, virus, interception or interference.


Processing a lot of results in Solr

2013-07-23 Thread Matt Lieber
Hello Solr users,

Question regarding processing a lot of docs returned from a query; I
potentially have millions of documents returned back from a query. What is
the common design to deal with this ?

2 ideas I have are:
- create a client service that is multithreaded to handled this
- Use the Solr "pagination" to retrieve a batch of rows at a time ("start,
rows" in Solr Admin console )

Any other ideas that I may be missing ?

Thanks,
Matt









NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.