Streaming Docs, Terms, TermVectors

2009-05-30 Thread Grant Ingersoll
Anyone have any thoughts on what is involved with streaming lots of  
results out of Solr?


For instance, if I wanted to get something like 1M docs out of Solr  
(or more) via *:* query, how can I tractably do this?  Likewise, if I  
wanted to return all the terms in the index or all the Term Vectors.


Obviously, it is impossible to load all of these things into memory  
and then create a response, so I was wondering if anyone had any ideas  
on how to stream them.


Thanks,
Grant


Re: Streaming Docs, Terms, TermVectors

2009-05-30 Thread Dietrich Featherston
I was actually curious about the same thing.  Perhaps an endpoint  
reference could be passed in the request where the documents can be  
sent asynchronously, such as a jms topic.


solr/query?q=*:*epr=/my/topiceprtype=jms

Then we would need to consider how to break up the response, how to  
cancel a running query, etc.


Is this along the lines of what you're looking for?  I would be  
interested in looking at how the request/response contract changes and  
what types of endpoint references would be supported.


Thanks,
D





On May 30, 2009, at 12:45 PM, Grant Ingersoll gsing...@apache.org  
wrote:


Anyone have any thoughts on what is involved with streaming lots of  
results out of Solr?


For instance, if I wanted to get something like 1M docs out of Solr  
(or more) via *:* query, how can I tractably do this?  Likewise, if  
I wanted to return all the terms in the index or all the Term Vectors.


Obviously, it is impossible to load all of these things into memory  
and then create a response, so I was wondering if anyone had any  
ideas on how to stream them.


Thanks,
Grant


Re: Streaming Docs, Terms, TermVectors

2009-05-30 Thread Kaktu Chakarabati
For a streaming-like solution, it is possible infact to have a working
buffer in-memory that emits chunks on an http connection which is kept alive
by the server until the full response has been sent.
This is quite similar for example to how video streaming protocols which can
operate on top of HTTP work ( cf. a more general discussion on
http://ajaxpatterns.org/HTTP_Streaming#In_A_Blink ).
Another (non-mutually exclusive) possibility is to introduce a novel binary
format for the transmission of such data ( i.e a new wt=.. type ) over
http (or any other comm. protocol) so that data can be more effectively
compressed and made to better fit into memory.
One such format which has been widely circulating and already has many open
source projects implementing it is Adobe's AMF (
http://osflash.org/documentation/amf ). It is however a proprietary format
so i'm not sure whether it is incorporable under apache foundation terms.

-Chak


On Sat, May 30, 2009 at 9:58 AM, Dietrich Featherston 
d...@dfeatherston.comwrote:

 I was actually curious about the same thing.  Perhaps an endpoint reference
 could be passed in the request where the documents can be sent
 asynchronously, such as a jms topic.

 solr/query?q=*:*epr=/my/topiceprtype=jms

 Then we would need to consider how to break up the response, how to cancel
 a running query, etc.

 Is this along the lines of what you're looking for?  I would be interested
 in looking at how the request/response contract changes and what types of
 endpoint references would be supported.

 Thanks,
 D






 On May 30, 2009, at 12:45 PM, Grant Ingersoll gsing...@apache.org wrote:

  Anyone have any thoughts on what is involved with streaming lots of
 results out of Solr?

 For instance, if I wanted to get something like 1M docs out of Solr (or
 more) via *:* query, how can I tractably do this?  Likewise, if I wanted to
 return all the terms in the index or all the Term Vectors.

 Obviously, it is impossible to load all of these things into memory and
 then create a response, so I was wondering if anyone had any ideas on how to
 stream them.

 Thanks,
 Grant




Re: Streaming Docs, Terms, TermVectors

2009-05-30 Thread Walter Underwood
Don't stream, request chunks of 10 or 100 at a time. It works fine and
you don't have to write or test any new code. In addition, it works
well with HTTP caches, so if two clients want to get the same data,
the second can get it from the cache.

We do that at Netflix. Each front-end box does a series of queries
to get all the movie titles, then loads them into a local index for
autocomplete.

wunder

On 5/30/09 11:01 AM, Kaktu Chakarabati jimmoe...@gmail.com wrote:

 For a streaming-like solution, it is possible infact to have a working
 buffer in-memory that emits chunks on an http connection which is kept alive
 by the server until the full response has been sent.
 This is quite similar for example to how video streaming protocols which can
 operate on top of HTTP work ( cf. a more general discussion on
 http://ajaxpatterns.org/HTTP_Streaming#In_A_Blink ).
 Another (non-mutually exclusive) possibility is to introduce a novel binary
 format for the transmission of such data ( i.e a new wt=.. type ) over
 http (or any other comm. protocol) so that data can be more effectively
 compressed and made to better fit into memory.
 One such format which has been widely circulating and already has many open
 source projects implementing it is Adobe's AMF (
 http://osflash.org/documentation/amf ). It is however a proprietary format
 so i'm not sure whether it is incorporable under apache foundation terms.
 
 -Chak
 
 
 On Sat, May 30, 2009 at 9:58 AM, Dietrich Featherston
 d...@dfeatherston.comwrote:
 
 I was actually curious about the same thing.  Perhaps an endpoint reference
 could be passed in the request where the documents can be sent
 asynchronously, such as a jms topic.
 
 solr/query?q=*:*epr=/my/topiceprtype=jms
 
 Then we would need to consider how to break up the response, how to cancel
 a running query, etc.
 
 Is this along the lines of what you're looking for?  I would be interested
 in looking at how the request/response contract changes and what types of
 endpoint references would be supported.
 
 Thanks,
 D
 
 On May 30, 2009, at 12:45 PM, Grant Ingersoll gsing...@apache.org wrote:
 
  Anyone have any thoughts on what is involved with streaming lots of
 results out of Solr?
 
 For instance, if I wanted to get something like 1M docs out of Solr (or
 more) via *:* query, how can I tractably do this?  Likewise, if I wanted to
 return all the terms in the index or all the Term Vectors.
 
 Obviously, it is impossible to load all of these things into memory and
 then create a response, so I was wondering if anyone had any ideas on how to
 stream them.
 
 Thanks,
 Grant
 
 



Re: Streaming Docs, Terms, TermVectors

2009-05-30 Thread Yonik Seeley
On a single server, Solr already does streaming of returned
documents... the stored fields of selected docs are retrieved one at a
time as they are written to the socket.  The servlet container already
handles sending out chunked encoding for large responses too.

-Yonik
http://www.lucidimagination.com

On Sat, May 30, 2009 at 12:45 PM, Grant Ingersoll gsing...@apache.org wrote:
 Anyone have any thoughts on what is involved with streaming lots of results
 out of Solr?

 For instance, if I wanted to get something like 1M docs out of Solr (or
 more) via *:* query, how can I tractably do this?  Likewise, if I wanted to
 return all the terms in the index or all the Term Vectors.

 Obviously, it is impossible to load all of these things into memory and then
 create a response, so I was wondering if anyone had any ideas on how to
 stream them.

 Thanks,
 Grant