Re: solr /export handler - behavior during close()

2017-06-27 Thread Joel Bernstein
Ok, I'll fix the ParallelStream to set the stream context though when
creating the SolrStreams. Thanks for pointing this out.

Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Jun 27, 2017 at 1:46 PM, Susmit Shukla 
wrote:

> Hi Joel,
>
> I was on solr 6.3 branch. I see HttpClient deprecated methods are all fixed
> in master.
> I had forgot to mention that I used a custom SolrClientCache to have higher
> limits for maxConnectionPerHost settings thats why I saw difference in
> behavior. SolrClientCache also looks configurable with a new constructor on
> master branch.
>
> I guess it is all good going forward on master.
>
> Thanks,
> Susmit
>
> On Tue, Jun 27, 2017 at 10:14 AM, Joel Bernstein 
> wrote:
>
> > Ok, I see where it's not set the stream context. This needs to be fixed.
> >
> > I'm curious about where you're seeing deprecated methods in the
> > HttpClientUtil? I was reviewing the master version of HttpClientUtil and
> > didn't see any deprecations in my IDE.
> >
> > I'm wondering if you're using an older version of HttpClientUtil then I
> > used when I was testing SOLR-10698?
> >
> > You also mentioned that the SolrStream and the SolrClientCache were using
> > the same approach to create the client. In that case changing the
> > ParallelStream to set the streamContext shouldn't have any effect on the
> > close() issue.
> >
> >
> >
> >
> >
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Sun, Jun 25, 2017 at 10:48 AM, Susmit Shukla  >
> > wrote:
> >
> > > Hi Joel,
> > >
> > > Looked at the fix for SOLR-10698, there could be 2 potential issues
> > >
> > > - Parallel Stream does not set stream context on newly created
> > SolrStreams
> > > in open() method.
> > >
> > > - This results in creation of new uncached HttpSolrClient in open()
> > method
> > > of SolrStream. This client is created using deprecated methods of http
> > > client library (HttpClientUtil.createClient) and behaves differently on
> > > close() than the one created using HttpClientBuilder API.
> SolrClientCache
> > > too uses the same deprecated API
> > >
> > > This test case shows the problem
> > >
> > > ParallelStream ps = new parallelStream(tupleStream,...)
> > >
> > > while(true){
> > >
> > > read();
> > >
> > > break after 2 iterations
> > >
> > > }
> > >
> > > ps.close()
> > >
> > > //close() reads through the end of tupleStream.
> > >
> > > I tried with HttpClient created by *org**.**apache**.**http**.**
> impl**.*
> > > *client**.HttpClientBuilder.create()* and close() is working for that.
> > >
> > >
> > > Thanks,
> > >
> > > Susmit
> > >
> > > On Wed, May 17, 2017 at 7:33 AM, Susmit Shukla <
> shukla.sus...@gmail.com>
> > > wrote:
> > >
> > > > Thanks Joel, will try that.
> > > > Binary response would be more performant.
> > > > I observed the server sends responses in 32 kb chunks and the client
> > > reads
> > > > it with 8 kb buffer on inputstream. I don't know if changing that can
> > > > impact anything on performance. Even if buffer size is increased on
> > > > httpclient, it can't override the hardcoded 8kb buffer on
> > > > sun.nio.cs.StreamDecoder
> > > >
> > > > Thanks,
> > > > Susmit
> > > >
> > > > On Wed, May 17, 2017 at 5:49 AM, Joel Bernstein 
> > > > wrote:
> > > >
> > > >> Susmit,
> > > >>
> > > >> You could wrap a LimitStream around the outside of all the
> relational
> > > >> algebra. For example:
> > > >>
> > > >> parallel(limit((intersect(intersect(search, search), union(search,
> > > >> search)
> > > >>
> > > >> In this scenario the limit would happen on the workers.
> > > >>
> > > >> As far as the worker/replica ratio. This will depend on how heavy
> the
> > > >> export is. If it's a light export, small number of fields, mostly
> > > numeric,
> > > >> simple sort params, then I've seen a ratio of 5 (workers) to 1
> > (replica)
> > > >> work well. This will basically saturate the CPU on the replica. But
> > > >> heavier
> > > >> exports will saturate the replicas with fewer workers.
> > > >>
> > > >> Also I tend to use Direct DocValues to get the best performance. I'm
> > not
> > > >> sure how much difference this makes, but it should eliminate the
> > > >> compression overhead fetching the data from the DocValues.
> > > >>
> > > >> Varun's suggestion of using the binary transport will provide a nice
> > > >> performance increase as well. But you'll need to upgrade. You may
> need
> > > to
> > > >> do that anyway as the fix on the early stream close will be on a
> later
> > > >> version that was refactored to support the binary transport.
> > > >>
> > > >> Joel Bernstein
> > > >> http://joelsolr.blogspot.com/
> > > >>
> > > >> On Tue, May 16, 2017 at 8:03 PM, Joel Bernstein  >
> > > >> wrote:
> > > >>
> > > >> > Yep, saw it. I'll comment on the ticket for what I believe needs
> to
> > be
> > > >> > done.
> > > >> >
> > > >> > Joel Bernstein
> > > >> > 

Re: solr /export handler - behavior during close()

2017-06-27 Thread Susmit Shukla
Hi Joel,

I was on solr 6.3 branch. I see HttpClient deprecated methods are all fixed
in master.
I had forgot to mention that I used a custom SolrClientCache to have higher
limits for maxConnectionPerHost settings thats why I saw difference in
behavior. SolrClientCache also looks configurable with a new constructor on
master branch.

I guess it is all good going forward on master.

Thanks,
Susmit

On Tue, Jun 27, 2017 at 10:14 AM, Joel Bernstein  wrote:

> Ok, I see where it's not set the stream context. This needs to be fixed.
>
> I'm curious about where you're seeing deprecated methods in the
> HttpClientUtil? I was reviewing the master version of HttpClientUtil and
> didn't see any deprecations in my IDE.
>
> I'm wondering if you're using an older version of HttpClientUtil then I
> used when I was testing SOLR-10698?
>
> You also mentioned that the SolrStream and the SolrClientCache were using
> the same approach to create the client. In that case changing the
> ParallelStream to set the streamContext shouldn't have any effect on the
> close() issue.
>
>
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Sun, Jun 25, 2017 at 10:48 AM, Susmit Shukla 
> wrote:
>
> > Hi Joel,
> >
> > Looked at the fix for SOLR-10698, there could be 2 potential issues
> >
> > - Parallel Stream does not set stream context on newly created
> SolrStreams
> > in open() method.
> >
> > - This results in creation of new uncached HttpSolrClient in open()
> method
> > of SolrStream. This client is created using deprecated methods of http
> > client library (HttpClientUtil.createClient) and behaves differently on
> > close() than the one created using HttpClientBuilder API. SolrClientCache
> > too uses the same deprecated API
> >
> > This test case shows the problem
> >
> > ParallelStream ps = new parallelStream(tupleStream,...)
> >
> > while(true){
> >
> > read();
> >
> > break after 2 iterations
> >
> > }
> >
> > ps.close()
> >
> > //close() reads through the end of tupleStream.
> >
> > I tried with HttpClient created by *org**.**apache**.**http**.**impl**.*
> > *client**.HttpClientBuilder.create()* and close() is working for that.
> >
> >
> > Thanks,
> >
> > Susmit
> >
> > On Wed, May 17, 2017 at 7:33 AM, Susmit Shukla 
> > wrote:
> >
> > > Thanks Joel, will try that.
> > > Binary response would be more performant.
> > > I observed the server sends responses in 32 kb chunks and the client
> > reads
> > > it with 8 kb buffer on inputstream. I don't know if changing that can
> > > impact anything on performance. Even if buffer size is increased on
> > > httpclient, it can't override the hardcoded 8kb buffer on
> > > sun.nio.cs.StreamDecoder
> > >
> > > Thanks,
> > > Susmit
> > >
> > > On Wed, May 17, 2017 at 5:49 AM, Joel Bernstein 
> > > wrote:
> > >
> > >> Susmit,
> > >>
> > >> You could wrap a LimitStream around the outside of all the relational
> > >> algebra. For example:
> > >>
> > >> parallel(limit((intersect(intersect(search, search), union(search,
> > >> search)
> > >>
> > >> In this scenario the limit would happen on the workers.
> > >>
> > >> As far as the worker/replica ratio. This will depend on how heavy the
> > >> export is. If it's a light export, small number of fields, mostly
> > numeric,
> > >> simple sort params, then I've seen a ratio of 5 (workers) to 1
> (replica)
> > >> work well. This will basically saturate the CPU on the replica. But
> > >> heavier
> > >> exports will saturate the replicas with fewer workers.
> > >>
> > >> Also I tend to use Direct DocValues to get the best performance. I'm
> not
> > >> sure how much difference this makes, but it should eliminate the
> > >> compression overhead fetching the data from the DocValues.
> > >>
> > >> Varun's suggestion of using the binary transport will provide a nice
> > >> performance increase as well. But you'll need to upgrade. You may need
> > to
> > >> do that anyway as the fix on the early stream close will be on a later
> > >> version that was refactored to support the binary transport.
> > >>
> > >> Joel Bernstein
> > >> http://joelsolr.blogspot.com/
> > >>
> > >> On Tue, May 16, 2017 at 8:03 PM, Joel Bernstein 
> > >> wrote:
> > >>
> > >> > Yep, saw it. I'll comment on the ticket for what I believe needs to
> be
> > >> > done.
> > >> >
> > >> > Joel Bernstein
> > >> > http://joelsolr.blogspot.com/
> > >> >
> > >> > On Tue, May 16, 2017 at 8:00 PM, Varun Thacker 
> > >> wrote:
> > >> >
> > >> >> Hi Joel,Susmit
> > >> >>
> > >> >> I created https://issues.apache.org/jira/browse/SOLR-10698 to
> track
> > >> the
> > >> >> issue
> > >> >>
> > >> >> @Susmit looking at the stack trace I see the expression is using
> > >> >> JSONTupleStream
> > >> >> . I wonder if you tried using JavabinTupleStreamParser could it
> help
> > >> >> improve performance ?
> > >> >>
> > >> >> On Tue, May 16, 2017 at 9:39 AM, Susmit 

Re: solr /export handler - behavior during close()

2017-06-27 Thread Joel Bernstein
Ok, I see where it's not set the stream context. This needs to be fixed.

I'm curious about where you're seeing deprecated methods in the
HttpClientUtil? I was reviewing the master version of HttpClientUtil and
didn't see any deprecations in my IDE.

I'm wondering if you're using an older version of HttpClientUtil then I
used when I was testing SOLR-10698?

You also mentioned that the SolrStream and the SolrClientCache were using
the same approach to create the client. In that case changing the
ParallelStream to set the streamContext shouldn't have any effect on the
close() issue.








Joel Bernstein
http://joelsolr.blogspot.com/

On Sun, Jun 25, 2017 at 10:48 AM, Susmit Shukla 
wrote:

> Hi Joel,
>
> Looked at the fix for SOLR-10698, there could be 2 potential issues
>
> - Parallel Stream does not set stream context on newly created SolrStreams
> in open() method.
>
> - This results in creation of new uncached HttpSolrClient in open() method
> of SolrStream. This client is created using deprecated methods of http
> client library (HttpClientUtil.createClient) and behaves differently on
> close() than the one created using HttpClientBuilder API. SolrClientCache
> too uses the same deprecated API
>
> This test case shows the problem
>
> ParallelStream ps = new parallelStream(tupleStream,...)
>
> while(true){
>
> read();
>
> break after 2 iterations
>
> }
>
> ps.close()
>
> //close() reads through the end of tupleStream.
>
> I tried with HttpClient created by *org**.**apache**.**http**.**impl**.*
> *client**.HttpClientBuilder.create()* and close() is working for that.
>
>
> Thanks,
>
> Susmit
>
> On Wed, May 17, 2017 at 7:33 AM, Susmit Shukla 
> wrote:
>
> > Thanks Joel, will try that.
> > Binary response would be more performant.
> > I observed the server sends responses in 32 kb chunks and the client
> reads
> > it with 8 kb buffer on inputstream. I don't know if changing that can
> > impact anything on performance. Even if buffer size is increased on
> > httpclient, it can't override the hardcoded 8kb buffer on
> > sun.nio.cs.StreamDecoder
> >
> > Thanks,
> > Susmit
> >
> > On Wed, May 17, 2017 at 5:49 AM, Joel Bernstein 
> > wrote:
> >
> >> Susmit,
> >>
> >> You could wrap a LimitStream around the outside of all the relational
> >> algebra. For example:
> >>
> >> parallel(limit((intersect(intersect(search, search), union(search,
> >> search)
> >>
> >> In this scenario the limit would happen on the workers.
> >>
> >> As far as the worker/replica ratio. This will depend on how heavy the
> >> export is. If it's a light export, small number of fields, mostly
> numeric,
> >> simple sort params, then I've seen a ratio of 5 (workers) to 1 (replica)
> >> work well. This will basically saturate the CPU on the replica. But
> >> heavier
> >> exports will saturate the replicas with fewer workers.
> >>
> >> Also I tend to use Direct DocValues to get the best performance. I'm not
> >> sure how much difference this makes, but it should eliminate the
> >> compression overhead fetching the data from the DocValues.
> >>
> >> Varun's suggestion of using the binary transport will provide a nice
> >> performance increase as well. But you'll need to upgrade. You may need
> to
> >> do that anyway as the fix on the early stream close will be on a later
> >> version that was refactored to support the binary transport.
> >>
> >> Joel Bernstein
> >> http://joelsolr.blogspot.com/
> >>
> >> On Tue, May 16, 2017 at 8:03 PM, Joel Bernstein 
> >> wrote:
> >>
> >> > Yep, saw it. I'll comment on the ticket for what I believe needs to be
> >> > done.
> >> >
> >> > Joel Bernstein
> >> > http://joelsolr.blogspot.com/
> >> >
> >> > On Tue, May 16, 2017 at 8:00 PM, Varun Thacker 
> >> wrote:
> >> >
> >> >> Hi Joel,Susmit
> >> >>
> >> >> I created https://issues.apache.org/jira/browse/SOLR-10698 to track
> >> the
> >> >> issue
> >> >>
> >> >> @Susmit looking at the stack trace I see the expression is using
> >> >> JSONTupleStream
> >> >> . I wonder if you tried using JavabinTupleStreamParser could it help
> >> >> improve performance ?
> >> >>
> >> >> On Tue, May 16, 2017 at 9:39 AM, Susmit Shukla <
> >> shukla.sus...@gmail.com>
> >> >> wrote:
> >> >>
> >> >> > Hi Joel,
> >> >> >
> >> >> > queries can be arbitrarily nested with AND/OR/NOT joins e.g.
> >> >> >
> >> >> > (intersect(intersect(search, search), union(search, search))). If I
> >> cut
> >> >> off
> >> >> > the innermost stream with a limit, the complete intersection would
> >> not
> >> >> > happen at upper levels. Also would the limit stream have same
> effect
> >> as
> >> >> > using /select handler with rows parameter?
> >> >> > I am trying to force input stream close through reflection, just to
> >> see
> >> >> if
> >> >> > it gives performance gains.
> >> >> >
> >> >> > 2) would experiment with null streams. Is workers = number of
> >> replicas
> >> >> in
> >> >> > 

Re: solr /export handler - behavior during close()

2017-06-25 Thread Susmit Shukla
Hi Joel,

Looked at the fix for SOLR-10698, there could be 2 potential issues

- Parallel Stream does not set stream context on newly created SolrStreams
in open() method.

- This results in creation of new uncached HttpSolrClient in open() method
of SolrStream. This client is created using deprecated methods of http
client library (HttpClientUtil.createClient) and behaves differently on
close() than the one created using HttpClientBuilder API. SolrClientCache
too uses the same deprecated API

This test case shows the problem

ParallelStream ps = new parallelStream(tupleStream,...)

while(true){

read();

break after 2 iterations

}

ps.close()

//close() reads through the end of tupleStream.

I tried with HttpClient created by *org**.**apache**.**http**.**impl**.*
*client**.HttpClientBuilder.create()* and close() is working for that.


Thanks,

Susmit

On Wed, May 17, 2017 at 7:33 AM, Susmit Shukla 
wrote:

> Thanks Joel, will try that.
> Binary response would be more performant.
> I observed the server sends responses in 32 kb chunks and the client reads
> it with 8 kb buffer on inputstream. I don't know if changing that can
> impact anything on performance. Even if buffer size is increased on
> httpclient, it can't override the hardcoded 8kb buffer on
> sun.nio.cs.StreamDecoder
>
> Thanks,
> Susmit
>
> On Wed, May 17, 2017 at 5:49 AM, Joel Bernstein 
> wrote:
>
>> Susmit,
>>
>> You could wrap a LimitStream around the outside of all the relational
>> algebra. For example:
>>
>> parallel(limit((intersect(intersect(search, search), union(search,
>> search)
>>
>> In this scenario the limit would happen on the workers.
>>
>> As far as the worker/replica ratio. This will depend on how heavy the
>> export is. If it's a light export, small number of fields, mostly numeric,
>> simple sort params, then I've seen a ratio of 5 (workers) to 1 (replica)
>> work well. This will basically saturate the CPU on the replica. But
>> heavier
>> exports will saturate the replicas with fewer workers.
>>
>> Also I tend to use Direct DocValues to get the best performance. I'm not
>> sure how much difference this makes, but it should eliminate the
>> compression overhead fetching the data from the DocValues.
>>
>> Varun's suggestion of using the binary transport will provide a nice
>> performance increase as well. But you'll need to upgrade. You may need to
>> do that anyway as the fix on the early stream close will be on a later
>> version that was refactored to support the binary transport.
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Tue, May 16, 2017 at 8:03 PM, Joel Bernstein 
>> wrote:
>>
>> > Yep, saw it. I'll comment on the ticket for what I believe needs to be
>> > done.
>> >
>> > Joel Bernstein
>> > http://joelsolr.blogspot.com/
>> >
>> > On Tue, May 16, 2017 at 8:00 PM, Varun Thacker 
>> wrote:
>> >
>> >> Hi Joel,Susmit
>> >>
>> >> I created https://issues.apache.org/jira/browse/SOLR-10698 to track
>> the
>> >> issue
>> >>
>> >> @Susmit looking at the stack trace I see the expression is using
>> >> JSONTupleStream
>> >> . I wonder if you tried using JavabinTupleStreamParser could it help
>> >> improve performance ?
>> >>
>> >> On Tue, May 16, 2017 at 9:39 AM, Susmit Shukla <
>> shukla.sus...@gmail.com>
>> >> wrote:
>> >>
>> >> > Hi Joel,
>> >> >
>> >> > queries can be arbitrarily nested with AND/OR/NOT joins e.g.
>> >> >
>> >> > (intersect(intersect(search, search), union(search, search))). If I
>> cut
>> >> off
>> >> > the innermost stream with a limit, the complete intersection would
>> not
>> >> > happen at upper levels. Also would the limit stream have same effect
>> as
>> >> > using /select handler with rows parameter?
>> >> > I am trying to force input stream close through reflection, just to
>> see
>> >> if
>> >> > it gives performance gains.
>> >> >
>> >> > 2) would experiment with null streams. Is workers = number of
>> replicas
>> >> in
>> >> > data collection a good thumb rule? is parallelstream performance
>> upper
>> >> > bounded by number of replicas?
>> >> >
>> >> > Thanks,
>> >> > Susmit
>> >> >
>> >> > On Tue, May 16, 2017 at 5:59 AM, Joel Bernstein 
>> >> > wrote:
>> >> >
>> >> > > Your approach looks OK. The single sharded worker collection is
>> only
>> >> > needed
>> >> > > if you were using CloudSolrStream to send the initial Streaming
>> >> > Expression
>> >> > > to the /stream handler. You are not doing this, so you're approach
>> is
>> >> > fine.
>> >> > >
>> >> > > Here are some thoughts on what you described:
>> >> > >
>> >> > > 1) If you are closing the parallel stream after the top 1000
>> results,
>> >> > then
>> >> > > try wrapping the intersect in a LimitStream. This stream doesn't
>> exist
>> >> > yet
>> >> > > so it will be a custom stream. The LimitStream can return the EOF
>> >> tuple
>> >> > > after it reads N tuples. This will cause the worker nodes to 

Re: solr /export handler - behavior during close()

2017-05-17 Thread Susmit Shukla
Thanks Joel, will try that.
Binary response would be more performant.
I observed the server sends responses in 32 kb chunks and the client reads
it with 8 kb buffer on inputstream. I don't know if changing that can
impact anything on performance. Even if buffer size is increased on
httpclient, it can't override the hardcoded 8kb buffer on
sun.nio.cs.StreamDecoder

Thanks,
Susmit

On Wed, May 17, 2017 at 5:49 AM, Joel Bernstein  wrote:

> Susmit,
>
> You could wrap a LimitStream around the outside of all the relational
> algebra. For example:
>
> parallel(limit((intersect(intersect(search, search), union(search,
> search)
>
> In this scenario the limit would happen on the workers.
>
> As far as the worker/replica ratio. This will depend on how heavy the
> export is. If it's a light export, small number of fields, mostly numeric,
> simple sort params, then I've seen a ratio of 5 (workers) to 1 (replica)
> work well. This will basically saturate the CPU on the replica. But heavier
> exports will saturate the replicas with fewer workers.
>
> Also I tend to use Direct DocValues to get the best performance. I'm not
> sure how much difference this makes, but it should eliminate the
> compression overhead fetching the data from the DocValues.
>
> Varun's suggestion of using the binary transport will provide a nice
> performance increase as well. But you'll need to upgrade. You may need to
> do that anyway as the fix on the early stream close will be on a later
> version that was refactored to support the binary transport.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Tue, May 16, 2017 at 8:03 PM, Joel Bernstein 
> wrote:
>
> > Yep, saw it. I'll comment on the ticket for what I believe needs to be
> > done.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Tue, May 16, 2017 at 8:00 PM, Varun Thacker 
> wrote:
> >
> >> Hi Joel,Susmit
> >>
> >> I created https://issues.apache.org/jira/browse/SOLR-10698 to track the
> >> issue
> >>
> >> @Susmit looking at the stack trace I see the expression is using
> >> JSONTupleStream
> >> . I wonder if you tried using JavabinTupleStreamParser could it help
> >> improve performance ?
> >>
> >> On Tue, May 16, 2017 at 9:39 AM, Susmit Shukla  >
> >> wrote:
> >>
> >> > Hi Joel,
> >> >
> >> > queries can be arbitrarily nested with AND/OR/NOT joins e.g.
> >> >
> >> > (intersect(intersect(search, search), union(search, search))). If I
> cut
> >> off
> >> > the innermost stream with a limit, the complete intersection would not
> >> > happen at upper levels. Also would the limit stream have same effect
> as
> >> > using /select handler with rows parameter?
> >> > I am trying to force input stream close through reflection, just to
> see
> >> if
> >> > it gives performance gains.
> >> >
> >> > 2) would experiment with null streams. Is workers = number of replicas
> >> in
> >> > data collection a good thumb rule? is parallelstream performance upper
> >> > bounded by number of replicas?
> >> >
> >> > Thanks,
> >> > Susmit
> >> >
> >> > On Tue, May 16, 2017 at 5:59 AM, Joel Bernstein 
> >> > wrote:
> >> >
> >> > > Your approach looks OK. The single sharded worker collection is only
> >> > needed
> >> > > if you were using CloudSolrStream to send the initial Streaming
> >> > Expression
> >> > > to the /stream handler. You are not doing this, so you're approach
> is
> >> > fine.
> >> > >
> >> > > Here are some thoughts on what you described:
> >> > >
> >> > > 1) If you are closing the parallel stream after the top 1000
> results,
> >> > then
> >> > > try wrapping the intersect in a LimitStream. This stream doesn't
> exist
> >> > yet
> >> > > so it will be a custom stream. The LimitStream can return the EOF
> >> tuple
> >> > > after it reads N tuples. This will cause the worker nodes to close
> the
> >> > > underlying stream and cause the Broken Pipe exception to occur at
> the
> >> > > /export handler, which will stop the /export.
> >> > >
> >> > > Here is the basic approach:
> >> > >
> >> > > parallel(limit(intersect(search, search)))
> >> > >
> >> > >
> >> > > 2) It can be tricky to understand where the bottleneck lies when
> using
> >> > the
> >> > > ParallelStream for parallel relational algebra. You can use the
> >> > NullStream
> >> > > to get an understanding of why performance is not increasing when
> you
> >> > > increase the workers. Here is the basic approach:
> >> > >
> >> > > parallel(null(intersect(search, search)))
> >> > >
> >> > > The NullStream will eat all the tuples on the workers and return a
> >> single
> >> > > tuple with the tuple count and the time taken to run the expression.
> >> So
> >> > > you'll get one tuple from each worker. This will eliminate any
> >> bottleneck
> >> > > on tuples returning through the ParallelStream and you can focus on
> >> the
> >> > > performance of the intersect and the /export handler.
> >> > >
> >> > > 

Re: solr /export handler - behavior during close()

2017-05-17 Thread Joel Bernstein
Susmit,

You could wrap a LimitStream around the outside of all the relational
algebra. For example:

parallel(limit((intersect(intersect(search, search), union(search,
search)

In this scenario the limit would happen on the workers.

As far as the worker/replica ratio. This will depend on how heavy the
export is. If it's a light export, small number of fields, mostly numeric,
simple sort params, then I've seen a ratio of 5 (workers) to 1 (replica)
work well. This will basically saturate the CPU on the replica. But heavier
exports will saturate the replicas with fewer workers.

Also I tend to use Direct DocValues to get the best performance. I'm not
sure how much difference this makes, but it should eliminate the
compression overhead fetching the data from the DocValues.

Varun's suggestion of using the binary transport will provide a nice
performance increase as well. But you'll need to upgrade. You may need to
do that anyway as the fix on the early stream close will be on a later
version that was refactored to support the binary transport.

Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, May 16, 2017 at 8:03 PM, Joel Bernstein  wrote:

> Yep, saw it. I'll comment on the ticket for what I believe needs to be
> done.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Tue, May 16, 2017 at 8:00 PM, Varun Thacker  wrote:
>
>> Hi Joel,Susmit
>>
>> I created https://issues.apache.org/jira/browse/SOLR-10698 to track the
>> issue
>>
>> @Susmit looking at the stack trace I see the expression is using
>> JSONTupleStream
>> . I wonder if you tried using JavabinTupleStreamParser could it help
>> improve performance ?
>>
>> On Tue, May 16, 2017 at 9:39 AM, Susmit Shukla 
>> wrote:
>>
>> > Hi Joel,
>> >
>> > queries can be arbitrarily nested with AND/OR/NOT joins e.g.
>> >
>> > (intersect(intersect(search, search), union(search, search))). If I cut
>> off
>> > the innermost stream with a limit, the complete intersection would not
>> > happen at upper levels. Also would the limit stream have same effect as
>> > using /select handler with rows parameter?
>> > I am trying to force input stream close through reflection, just to see
>> if
>> > it gives performance gains.
>> >
>> > 2) would experiment with null streams. Is workers = number of replicas
>> in
>> > data collection a good thumb rule? is parallelstream performance upper
>> > bounded by number of replicas?
>> >
>> > Thanks,
>> > Susmit
>> >
>> > On Tue, May 16, 2017 at 5:59 AM, Joel Bernstein 
>> > wrote:
>> >
>> > > Your approach looks OK. The single sharded worker collection is only
>> > needed
>> > > if you were using CloudSolrStream to send the initial Streaming
>> > Expression
>> > > to the /stream handler. You are not doing this, so you're approach is
>> > fine.
>> > >
>> > > Here are some thoughts on what you described:
>> > >
>> > > 1) If you are closing the parallel stream after the top 1000 results,
>> > then
>> > > try wrapping the intersect in a LimitStream. This stream doesn't exist
>> > yet
>> > > so it will be a custom stream. The LimitStream can return the EOF
>> tuple
>> > > after it reads N tuples. This will cause the worker nodes to close the
>> > > underlying stream and cause the Broken Pipe exception to occur at the
>> > > /export handler, which will stop the /export.
>> > >
>> > > Here is the basic approach:
>> > >
>> > > parallel(limit(intersect(search, search)))
>> > >
>> > >
>> > > 2) It can be tricky to understand where the bottleneck lies when using
>> > the
>> > > ParallelStream for parallel relational algebra. You can use the
>> > NullStream
>> > > to get an understanding of why performance is not increasing when you
>> > > increase the workers. Here is the basic approach:
>> > >
>> > > parallel(null(intersect(search, search)))
>> > >
>> > > The NullStream will eat all the tuples on the workers and return a
>> single
>> > > tuple with the tuple count and the time taken to run the expression.
>> So
>> > > you'll get one tuple from each worker. This will eliminate any
>> bottleneck
>> > > on tuples returning through the ParallelStream and you can focus on
>> the
>> > > performance of the intersect and the /export handler.
>> > >
>> > > Then experiment with:
>> > >
>> > > 1) Increasing the number of parallel workers.
>> > > 2) Increasing the number of replicas in the data collections.
>> > >
>> > > And watch the timing information coming back from the NullStream
>> tuples.
>> > If
>> > > increasing the workers is not improving performance then the
>> bottleneck
>> > may
>> > > be in the /export handler. So try increasing replicas and see if that
>> > > improves performance. Different partitions of the streams will be
>> served
>> > by
>> > > different replicas.
>> > >
>> > > If performance doesn't improve with the NullStream after increasing
>> both
>> > > workers and replicas then we know the bottleneck is the network.
>> > >
>> > > Joel 

Re: solr /export handler - behavior during close()

2017-05-16 Thread Joel Bernstein
Yep, saw it. I'll comment on the ticket for what I believe needs to be done.

Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, May 16, 2017 at 8:00 PM, Varun Thacker  wrote:

> Hi Joel,Susmit
>
> I created https://issues.apache.org/jira/browse/SOLR-10698 to track the
> issue
>
> @Susmit looking at the stack trace I see the expression is using
> JSONTupleStream
> . I wonder if you tried using JavabinTupleStreamParser could it help
> improve performance ?
>
> On Tue, May 16, 2017 at 9:39 AM, Susmit Shukla 
> wrote:
>
> > Hi Joel,
> >
> > queries can be arbitrarily nested with AND/OR/NOT joins e.g.
> >
> > (intersect(intersect(search, search), union(search, search))). If I cut
> off
> > the innermost stream with a limit, the complete intersection would not
> > happen at upper levels. Also would the limit stream have same effect as
> > using /select handler with rows parameter?
> > I am trying to force input stream close through reflection, just to see
> if
> > it gives performance gains.
> >
> > 2) would experiment with null streams. Is workers = number of replicas in
> > data collection a good thumb rule? is parallelstream performance upper
> > bounded by number of replicas?
> >
> > Thanks,
> > Susmit
> >
> > On Tue, May 16, 2017 at 5:59 AM, Joel Bernstein 
> > wrote:
> >
> > > Your approach looks OK. The single sharded worker collection is only
> > needed
> > > if you were using CloudSolrStream to send the initial Streaming
> > Expression
> > > to the /stream handler. You are not doing this, so you're approach is
> > fine.
> > >
> > > Here are some thoughts on what you described:
> > >
> > > 1) If you are closing the parallel stream after the top 1000 results,
> > then
> > > try wrapping the intersect in a LimitStream. This stream doesn't exist
> > yet
> > > so it will be a custom stream. The LimitStream can return the EOF tuple
> > > after it reads N tuples. This will cause the worker nodes to close the
> > > underlying stream and cause the Broken Pipe exception to occur at the
> > > /export handler, which will stop the /export.
> > >
> > > Here is the basic approach:
> > >
> > > parallel(limit(intersect(search, search)))
> > >
> > >
> > > 2) It can be tricky to understand where the bottleneck lies when using
> > the
> > > ParallelStream for parallel relational algebra. You can use the
> > NullStream
> > > to get an understanding of why performance is not increasing when you
> > > increase the workers. Here is the basic approach:
> > >
> > > parallel(null(intersect(search, search)))
> > >
> > > The NullStream will eat all the tuples on the workers and return a
> single
> > > tuple with the tuple count and the time taken to run the expression. So
> > > you'll get one tuple from each worker. This will eliminate any
> bottleneck
> > > on tuples returning through the ParallelStream and you can focus on the
> > > performance of the intersect and the /export handler.
> > >
> > > Then experiment with:
> > >
> > > 1) Increasing the number of parallel workers.
> > > 2) Increasing the number of replicas in the data collections.
> > >
> > > And watch the timing information coming back from the NullStream
> tuples.
> > If
> > > increasing the workers is not improving performance then the bottleneck
> > may
> > > be in the /export handler. So try increasing replicas and see if that
> > > improves performance. Different partitions of the streams will be
> served
> > by
> > > different replicas.
> > >
> > > If performance doesn't improve with the NullStream after increasing
> both
> > > workers and replicas then we know the bottleneck is the network.
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Mon, May 15, 2017 at 10:37 PM, Susmit Shukla <
> shukla.sus...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi Joel,
> > > >
> > > > Regarding the implementation, I am wrapping the topmost TupleStream
> in
> > a
> > > > ParallelStream and execute it on the worker cluster (one of the
> joined
> > > > cluster doubles up as worker cluster). ParallelStream does submit the
> > > query
> > > > to /stream handler.
> > > > for #2, for e.g. I am creating 2 CloudSolrStreams , wrapping them in
> > > > IntersectStream and wrapping that in ParallelStream and reading out
> the
> > > > tuples from parallel stream. close() is called on parallelStream. I
> do
> > > have
> > > > custom streams but that is similar to intersectStream.
> > > > I am on solr 6.3.1
> > > > The 2 solr clusters serving the join queries are having many shards.
> > > Worker
> > > > collection is also multi sharded and is one from the main clusters,
> so
> > do
> > > > you imply I should be using a single sharded "worker" collection?
> Would
> > > the
> > > > joins execute faster?
> > > > On a side note, increasing the workers beyond 1 was not improving the
> > > > execution times but was degrading if number was 3 and above. That is
> > > > counter intuitive since the joins 

Re: solr /export handler - behavior during close()

2017-05-16 Thread Varun Thacker
Hi Joel,Susmit

I created https://issues.apache.org/jira/browse/SOLR-10698 to track the
issue

@Susmit looking at the stack trace I see the expression is using
JSONTupleStream
. I wonder if you tried using JavabinTupleStreamParser could it help
improve performance ?

On Tue, May 16, 2017 at 9:39 AM, Susmit Shukla 
wrote:

> Hi Joel,
>
> queries can be arbitrarily nested with AND/OR/NOT joins e.g.
>
> (intersect(intersect(search, search), union(search, search))). If I cut off
> the innermost stream with a limit, the complete intersection would not
> happen at upper levels. Also would the limit stream have same effect as
> using /select handler with rows parameter?
> I am trying to force input stream close through reflection, just to see if
> it gives performance gains.
>
> 2) would experiment with null streams. Is workers = number of replicas in
> data collection a good thumb rule? is parallelstream performance upper
> bounded by number of replicas?
>
> Thanks,
> Susmit
>
> On Tue, May 16, 2017 at 5:59 AM, Joel Bernstein 
> wrote:
>
> > Your approach looks OK. The single sharded worker collection is only
> needed
> > if you were using CloudSolrStream to send the initial Streaming
> Expression
> > to the /stream handler. You are not doing this, so you're approach is
> fine.
> >
> > Here are some thoughts on what you described:
> >
> > 1) If you are closing the parallel stream after the top 1000 results,
> then
> > try wrapping the intersect in a LimitStream. This stream doesn't exist
> yet
> > so it will be a custom stream. The LimitStream can return the EOF tuple
> > after it reads N tuples. This will cause the worker nodes to close the
> > underlying stream and cause the Broken Pipe exception to occur at the
> > /export handler, which will stop the /export.
> >
> > Here is the basic approach:
> >
> > parallel(limit(intersect(search, search)))
> >
> >
> > 2) It can be tricky to understand where the bottleneck lies when using
> the
> > ParallelStream for parallel relational algebra. You can use the
> NullStream
> > to get an understanding of why performance is not increasing when you
> > increase the workers. Here is the basic approach:
> >
> > parallel(null(intersect(search, search)))
> >
> > The NullStream will eat all the tuples on the workers and return a single
> > tuple with the tuple count and the time taken to run the expression. So
> > you'll get one tuple from each worker. This will eliminate any bottleneck
> > on tuples returning through the ParallelStream and you can focus on the
> > performance of the intersect and the /export handler.
> >
> > Then experiment with:
> >
> > 1) Increasing the number of parallel workers.
> > 2) Increasing the number of replicas in the data collections.
> >
> > And watch the timing information coming back from the NullStream tuples.
> If
> > increasing the workers is not improving performance then the bottleneck
> may
> > be in the /export handler. So try increasing replicas and see if that
> > improves performance. Different partitions of the streams will be served
> by
> > different replicas.
> >
> > If performance doesn't improve with the NullStream after increasing both
> > workers and replicas then we know the bottleneck is the network.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Mon, May 15, 2017 at 10:37 PM, Susmit Shukla  >
> > wrote:
> >
> > > Hi Joel,
> > >
> > > Regarding the implementation, I am wrapping the topmost TupleStream in
> a
> > > ParallelStream and execute it on the worker cluster (one of the joined
> > > cluster doubles up as worker cluster). ParallelStream does submit the
> > query
> > > to /stream handler.
> > > for #2, for e.g. I am creating 2 CloudSolrStreams , wrapping them in
> > > IntersectStream and wrapping that in ParallelStream and reading out the
> > > tuples from parallel stream. close() is called on parallelStream. I do
> > have
> > > custom streams but that is similar to intersectStream.
> > > I am on solr 6.3.1
> > > The 2 solr clusters serving the join queries are having many shards.
> > Worker
> > > collection is also multi sharded and is one from the main clusters, so
> do
> > > you imply I should be using a single sharded "worker" collection? Would
> > the
> > > joins execute faster?
> > > On a side note, increasing the workers beyond 1 was not improving the
> > > execution times but was degrading if number was 3 and above. That is
> > > counter intuitive since the joins are huge and putting more workers
> > should
> > > have improved the performance.
> > >
> > > Thanks,
> > > Susmit
> > >
> > >
> > > On Mon, May 15, 2017 at 6:47 AM, Joel Bernstein 
> > > wrote:
> > >
> > > > Ok please do report any issues you run into. This is quite a good bug
> > > > report.
> > > >
> > > > I reviewed the code and I believe I see the problem. The problem
> seems
> > to
> > > > be that output code from the /stream handler is 

Re: solr /export handler - behavior during close()

2017-05-16 Thread Susmit Shukla
Hi Joel,

queries can be arbitrarily nested with AND/OR/NOT joins e.g.

(intersect(intersect(search, search), union(search, search))). If I cut off
the innermost stream with a limit, the complete intersection would not
happen at upper levels. Also would the limit stream have same effect as
using /select handler with rows parameter?
I am trying to force input stream close through reflection, just to see if
it gives performance gains.

2) would experiment with null streams. Is workers = number of replicas in
data collection a good thumb rule? is parallelstream performance upper
bounded by number of replicas?

Thanks,
Susmit

On Tue, May 16, 2017 at 5:59 AM, Joel Bernstein  wrote:

> Your approach looks OK. The single sharded worker collection is only needed
> if you were using CloudSolrStream to send the initial Streaming Expression
> to the /stream handler. You are not doing this, so you're approach is fine.
>
> Here are some thoughts on what you described:
>
> 1) If you are closing the parallel stream after the top 1000 results, then
> try wrapping the intersect in a LimitStream. This stream doesn't exist yet
> so it will be a custom stream. The LimitStream can return the EOF tuple
> after it reads N tuples. This will cause the worker nodes to close the
> underlying stream and cause the Broken Pipe exception to occur at the
> /export handler, which will stop the /export.
>
> Here is the basic approach:
>
> parallel(limit(intersect(search, search)))
>
>
> 2) It can be tricky to understand where the bottleneck lies when using the
> ParallelStream for parallel relational algebra. You can use the NullStream
> to get an understanding of why performance is not increasing when you
> increase the workers. Here is the basic approach:
>
> parallel(null(intersect(search, search)))
>
> The NullStream will eat all the tuples on the workers and return a single
> tuple with the tuple count and the time taken to run the expression. So
> you'll get one tuple from each worker. This will eliminate any bottleneck
> on tuples returning through the ParallelStream and you can focus on the
> performance of the intersect and the /export handler.
>
> Then experiment with:
>
> 1) Increasing the number of parallel workers.
> 2) Increasing the number of replicas in the data collections.
>
> And watch the timing information coming back from the NullStream tuples. If
> increasing the workers is not improving performance then the bottleneck may
> be in the /export handler. So try increasing replicas and see if that
> improves performance. Different partitions of the streams will be served by
> different replicas.
>
> If performance doesn't improve with the NullStream after increasing both
> workers and replicas then we know the bottleneck is the network.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, May 15, 2017 at 10:37 PM, Susmit Shukla 
> wrote:
>
> > Hi Joel,
> >
> > Regarding the implementation, I am wrapping the topmost TupleStream in a
> > ParallelStream and execute it on the worker cluster (one of the joined
> > cluster doubles up as worker cluster). ParallelStream does submit the
> query
> > to /stream handler.
> > for #2, for e.g. I am creating 2 CloudSolrStreams , wrapping them in
> > IntersectStream and wrapping that in ParallelStream and reading out the
> > tuples from parallel stream. close() is called on parallelStream. I do
> have
> > custom streams but that is similar to intersectStream.
> > I am on solr 6.3.1
> > The 2 solr clusters serving the join queries are having many shards.
> Worker
> > collection is also multi sharded and is one from the main clusters, so do
> > you imply I should be using a single sharded "worker" collection? Would
> the
> > joins execute faster?
> > On a side note, increasing the workers beyond 1 was not improving the
> > execution times but was degrading if number was 3 and above. That is
> > counter intuitive since the joins are huge and putting more workers
> should
> > have improved the performance.
> >
> > Thanks,
> > Susmit
> >
> >
> > On Mon, May 15, 2017 at 6:47 AM, Joel Bernstein 
> > wrote:
> >
> > > Ok please do report any issues you run into. This is quite a good bug
> > > report.
> > >
> > > I reviewed the code and I believe I see the problem. The problem seems
> to
> > > be that output code from the /stream handler is not properly accounting
> > for
> > > client disconnects and closing the underlying stream. What I see in the
> > > code is that exceptions coming from read() in the stream do
> automatically
> > > close the underlying stream. But exceptions from the writing of the
> > stream
> > > do not close the stream. This needs to be fixed.
> > >
> > > A few questions about your streaming implementation:
> > >
> > > 1) Are you sending requests to the /stream handler? Or are you
> embedding
> > > CloudSolrStream in your application and bypassing the /stream handler?
> > >
> > > 2) If you're 

Re: solr /export handler - behavior during close()

2017-05-16 Thread Joel Bernstein
Your approach looks OK. The single sharded worker collection is only needed
if you were using CloudSolrStream to send the initial Streaming Expression
to the /stream handler. You are not doing this, so you're approach is fine.

Here are some thoughts on what you described:

1) If you are closing the parallel stream after the top 1000 results, then
try wrapping the intersect in a LimitStream. This stream doesn't exist yet
so it will be a custom stream. The LimitStream can return the EOF tuple
after it reads N tuples. This will cause the worker nodes to close the
underlying stream and cause the Broken Pipe exception to occur at the
/export handler, which will stop the /export.

Here is the basic approach:

parallel(limit(intersect(search, search)))


2) It can be tricky to understand where the bottleneck lies when using the
ParallelStream for parallel relational algebra. You can use the NullStream
to get an understanding of why performance is not increasing when you
increase the workers. Here is the basic approach:

parallel(null(intersect(search, search)))

The NullStream will eat all the tuples on the workers and return a single
tuple with the tuple count and the time taken to run the expression. So
you'll get one tuple from each worker. This will eliminate any bottleneck
on tuples returning through the ParallelStream and you can focus on the
performance of the intersect and the /export handler.

Then experiment with:

1) Increasing the number of parallel workers.
2) Increasing the number of replicas in the data collections.

And watch the timing information coming back from the NullStream tuples. If
increasing the workers is not improving performance then the bottleneck may
be in the /export handler. So try increasing replicas and see if that
improves performance. Different partitions of the streams will be served by
different replicas.

If performance doesn't improve with the NullStream after increasing both
workers and replicas then we know the bottleneck is the network.

Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, May 15, 2017 at 10:37 PM, Susmit Shukla 
wrote:

> Hi Joel,
>
> Regarding the implementation, I am wrapping the topmost TupleStream in a
> ParallelStream and execute it on the worker cluster (one of the joined
> cluster doubles up as worker cluster). ParallelStream does submit the query
> to /stream handler.
> for #2, for e.g. I am creating 2 CloudSolrStreams , wrapping them in
> IntersectStream and wrapping that in ParallelStream and reading out the
> tuples from parallel stream. close() is called on parallelStream. I do have
> custom streams but that is similar to intersectStream.
> I am on solr 6.3.1
> The 2 solr clusters serving the join queries are having many shards. Worker
> collection is also multi sharded and is one from the main clusters, so do
> you imply I should be using a single sharded "worker" collection? Would the
> joins execute faster?
> On a side note, increasing the workers beyond 1 was not improving the
> execution times but was degrading if number was 3 and above. That is
> counter intuitive since the joins are huge and putting more workers should
> have improved the performance.
>
> Thanks,
> Susmit
>
>
> On Mon, May 15, 2017 at 6:47 AM, Joel Bernstein 
> wrote:
>
> > Ok please do report any issues you run into. This is quite a good bug
> > report.
> >
> > I reviewed the code and I believe I see the problem. The problem seems to
> > be that output code from the /stream handler is not properly accounting
> for
> > client disconnects and closing the underlying stream. What I see in the
> > code is that exceptions coming from read() in the stream do automatically
> > close the underlying stream. But exceptions from the writing of the
> stream
> > do not close the stream. This needs to be fixed.
> >
> > A few questions about your streaming implementation:
> >
> > 1) Are you sending requests to the /stream handler? Or are you embedding
> > CloudSolrStream in your application and bypassing the /stream handler?
> >
> > 2) If you're sending Streaming Expressions to the stream handler are you
> > using SolrStream or CloudSolrStream to send the expression?
> >
> > 3) What version of Solr are you using.
> >
> > 4) Have you implemented any custom streams?
> >
> >
> > #2 is an important question. If you're sending expressions to the /stream
> > handler using CloudSolrStream the collection running the expression would
> > have to be setup a specific way. The collection running the expression
> will
> > have to be a* single shard collection*. You can have as many replicas as
> > you want but only one shard. That's because CloudSolrStream picks one
> > replica in each shard to forward the request to then merges the results
> > from the shards. So if you send in an expression using CloudSolrStream
> that
> > expression will be sent to each shard to be run and each shard will be
> > duplicating the work and return duplicate 

Re: solr /export handler - behavior during close()

2017-05-15 Thread Susmit Shukla
Hi Joel,

Regarding the implementation, I am wrapping the topmost TupleStream in a
ParallelStream and execute it on the worker cluster (one of the joined
cluster doubles up as worker cluster). ParallelStream does submit the query
to /stream handler.
for #2, for e.g. I am creating 2 CloudSolrStreams , wrapping them in
IntersectStream and wrapping that in ParallelStream and reading out the
tuples from parallel stream. close() is called on parallelStream. I do have
custom streams but that is similar to intersectStream.
I am on solr 6.3.1
The 2 solr clusters serving the join queries are having many shards. Worker
collection is also multi sharded and is one from the main clusters, so do
you imply I should be using a single sharded "worker" collection? Would the
joins execute faster?
On a side note, increasing the workers beyond 1 was not improving the
execution times but was degrading if number was 3 and above. That is
counter intuitive since the joins are huge and putting more workers should
have improved the performance.

Thanks,
Susmit


On Mon, May 15, 2017 at 6:47 AM, Joel Bernstein  wrote:

> Ok please do report any issues you run into. This is quite a good bug
> report.
>
> I reviewed the code and I believe I see the problem. The problem seems to
> be that output code from the /stream handler is not properly accounting for
> client disconnects and closing the underlying stream. What I see in the
> code is that exceptions coming from read() in the stream do automatically
> close the underlying stream. But exceptions from the writing of the stream
> do not close the stream. This needs to be fixed.
>
> A few questions about your streaming implementation:
>
> 1) Are you sending requests to the /stream handler? Or are you embedding
> CloudSolrStream in your application and bypassing the /stream handler?
>
> 2) If you're sending Streaming Expressions to the stream handler are you
> using SolrStream or CloudSolrStream to send the expression?
>
> 3) What version of Solr are you using.
>
> 4) Have you implemented any custom streams?
>
>
> #2 is an important question. If you're sending expressions to the /stream
> handler using CloudSolrStream the collection running the expression would
> have to be setup a specific way. The collection running the expression will
> have to be a* single shard collection*. You can have as many replicas as
> you want but only one shard. That's because CloudSolrStream picks one
> replica in each shard to forward the request to then merges the results
> from the shards. So if you send in an expression using CloudSolrStream that
> expression will be sent to each shard to be run and each shard will be
> duplicating the work and return duplicate results.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Sat, May 13, 2017 at 7:03 PM, Susmit Shukla 
> wrote:
>
> > Thanks Joel
> > Streaming is awesome, just had a huge implementation in my project. I
> found
> > out a couple more issues with streaming and did local hacks for them,
> would
> > raise them too.
> >
> > On Sat, May 13, 2017 at 2:09 PM, Joel Bernstein 
> > wrote:
> >
> > > Ah, then this is unexpected behavior. Can you open a ticket for this?
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Sat, May 13, 2017 at 2:51 PM, Susmit Shukla <
> shukla.sus...@gmail.com>
> > > wrote:
> > >
> > > > Hi Joel,
> > > >
> > > > I was using CloudSolrStream for the above test. Below is the call
> > stack.
> > > >
> > > > at
> > > > org.apache.http.impl.io.ChunkedInputStream.read(
> > > > ChunkedInputStream.java:215)
> > > > at
> > > > org.apache.http.impl.io.ChunkedInputStream.close(
> > > > ChunkedInputStream.java:316)
> > > > at
> > > > org.apache.http.impl.execchain.ResponseEntityProxy.streamClosed(
> > > > ResponseEntityProxy.java:128)
> > > > at
> > > > org.apache.http.conn.EofSensorInputStream.checkClose(
> > > > EofSensorInputStream.java:228)
> > > > at
> > > > org.apache.http.conn.EofSensorInputStream.close(
> > > > EofSensorInputStream.java:174)
> > > > at sun.nio.cs.StreamDecoder.implClose(StreamDecoder.java:378)
> > > > at sun.nio.cs.StreamDecoder.close(StreamDecoder.java:193)
> > > > at java.io.InputStreamReader.close(InputStreamReader.java:199)
> > > > at
> > > > org.apache.solr.client.solrj.io.stream.JSONTupleStream.
> > > > close(JSONTupleStream.java:91)
> > > > at
> > > > org.apache.solr.client.solrj.io.stream.SolrStream.close(
> > > > SolrStream.java:186)
> > > >
> > > > Thanks,
> > > > Susmit
> > > >
> > > > On Sat, May 13, 2017 at 10:48 AM, Joel Bernstein  >
> > > > wrote:
> > > >
> > > > > I was just reading the Java docs on the ChunkedInputStream.
> > > > >
> > > > > "Note that this class NEVER closes the underlying stream"
> > > > >
> > > > > In that scenario the /export would indeed continue to send data. I
> > > think
> > > > we
> > > > > can consider this an 

Re: solr /export handler - behavior during close()

2017-05-15 Thread Joel Bernstein
Ok please do report any issues you run into. This is quite a good bug
report.

I reviewed the code and I believe I see the problem. The problem seems to
be that output code from the /stream handler is not properly accounting for
client disconnects and closing the underlying stream. What I see in the
code is that exceptions coming from read() in the stream do automatically
close the underlying stream. But exceptions from the writing of the stream
do not close the stream. This needs to be fixed.

A few questions about your streaming implementation:

1) Are you sending requests to the /stream handler? Or are you embedding
CloudSolrStream in your application and bypassing the /stream handler?

2) If you're sending Streaming Expressions to the stream handler are you
using SolrStream or CloudSolrStream to send the expression?

3) What version of Solr are you using.

4) Have you implemented any custom streams?


#2 is an important question. If you're sending expressions to the /stream
handler using CloudSolrStream the collection running the expression would
have to be setup a specific way. The collection running the expression will
have to be a* single shard collection*. You can have as many replicas as
you want but only one shard. That's because CloudSolrStream picks one
replica in each shard to forward the request to then merges the results
from the shards. So if you send in an expression using CloudSolrStream that
expression will be sent to each shard to be run and each shard will be
duplicating the work and return duplicate results.




















Joel Bernstein
http://joelsolr.blogspot.com/

On Sat, May 13, 2017 at 7:03 PM, Susmit Shukla 
wrote:

> Thanks Joel
> Streaming is awesome, just had a huge implementation in my project. I found
> out a couple more issues with streaming and did local hacks for them, would
> raise them too.
>
> On Sat, May 13, 2017 at 2:09 PM, Joel Bernstein 
> wrote:
>
> > Ah, then this is unexpected behavior. Can you open a ticket for this?
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Sat, May 13, 2017 at 2:51 PM, Susmit Shukla 
> > wrote:
> >
> > > Hi Joel,
> > >
> > > I was using CloudSolrStream for the above test. Below is the call
> stack.
> > >
> > > at
> > > org.apache.http.impl.io.ChunkedInputStream.read(
> > > ChunkedInputStream.java:215)
> > > at
> > > org.apache.http.impl.io.ChunkedInputStream.close(
> > > ChunkedInputStream.java:316)
> > > at
> > > org.apache.http.impl.execchain.ResponseEntityProxy.streamClosed(
> > > ResponseEntityProxy.java:128)
> > > at
> > > org.apache.http.conn.EofSensorInputStream.checkClose(
> > > EofSensorInputStream.java:228)
> > > at
> > > org.apache.http.conn.EofSensorInputStream.close(
> > > EofSensorInputStream.java:174)
> > > at sun.nio.cs.StreamDecoder.implClose(StreamDecoder.java:378)
> > > at sun.nio.cs.StreamDecoder.close(StreamDecoder.java:193)
> > > at java.io.InputStreamReader.close(InputStreamReader.java:199)
> > > at
> > > org.apache.solr.client.solrj.io.stream.JSONTupleStream.
> > > close(JSONTupleStream.java:91)
> > > at
> > > org.apache.solr.client.solrj.io.stream.SolrStream.close(
> > > SolrStream.java:186)
> > >
> > > Thanks,
> > > Susmit
> > >
> > > On Sat, May 13, 2017 at 10:48 AM, Joel Bernstein 
> > > wrote:
> > >
> > > > I was just reading the Java docs on the ChunkedInputStream.
> > > >
> > > > "Note that this class NEVER closes the underlying stream"
> > > >
> > > > In that scenario the /export would indeed continue to send data. I
> > think
> > > we
> > > > can consider this an anti-pattern for the /export handler currently.
> > > >
> > > > I would suggest using one of the Streaming Clients to connect to the
> > > export
> > > > handler. Either CloudSolrStream or SolrStream will both interact with
> > the
> > > > /export handler in a the way that it expects.
> > > >
> > > >
> > > > Joel Bernstein
> > > > http://joelsolr.blogspot.com/
> > > >
> > > > On Sat, May 13, 2017 at 12:28 PM, Susmit Shukla <
> > shukla.sus...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi Joel,
> > > > >
> > > > > I did not observe that. On calling close() on stream, it cycled
> > through
> > > > all
> > > > > the hits that /export handler calculated.
> > > > > e.g. with a *:* query and export handler on a 100k document index,
> I
> > > > could
> > > > > see the 100kth record printed on the http wire debug log although
> > close
> > > > was
> > > > > called after reading 1st tuple. The time taken for the operation
> with
> > > > > close() call was same as that if I had read all the 100k tuples.
> > > > > As I have pointed out, close() on underlying ChunkedInputStream
> calls
> > > > > read() and solr server has probably no way to distinguish it from
> > > read()
> > > > > happening from regular tuple reads..
> > > > > I think there should be an abort() API for solr streams that hooks
> > into
> > > > > httpmethod.abort() . 

Re: solr /export handler - behavior during close()

2017-05-13 Thread Susmit Shukla
Thanks Joel
Streaming is awesome, just had a huge implementation in my project. I found
out a couple more issues with streaming and did local hacks for them, would
raise them too.

On Sat, May 13, 2017 at 2:09 PM, Joel Bernstein  wrote:

> Ah, then this is unexpected behavior. Can you open a ticket for this?
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Sat, May 13, 2017 at 2:51 PM, Susmit Shukla 
> wrote:
>
> > Hi Joel,
> >
> > I was using CloudSolrStream for the above test. Below is the call stack.
> >
> > at
> > org.apache.http.impl.io.ChunkedInputStream.read(
> > ChunkedInputStream.java:215)
> > at
> > org.apache.http.impl.io.ChunkedInputStream.close(
> > ChunkedInputStream.java:316)
> > at
> > org.apache.http.impl.execchain.ResponseEntityProxy.streamClosed(
> > ResponseEntityProxy.java:128)
> > at
> > org.apache.http.conn.EofSensorInputStream.checkClose(
> > EofSensorInputStream.java:228)
> > at
> > org.apache.http.conn.EofSensorInputStream.close(
> > EofSensorInputStream.java:174)
> > at sun.nio.cs.StreamDecoder.implClose(StreamDecoder.java:378)
> > at sun.nio.cs.StreamDecoder.close(StreamDecoder.java:193)
> > at java.io.InputStreamReader.close(InputStreamReader.java:199)
> > at
> > org.apache.solr.client.solrj.io.stream.JSONTupleStream.
> > close(JSONTupleStream.java:91)
> > at
> > org.apache.solr.client.solrj.io.stream.SolrStream.close(
> > SolrStream.java:186)
> >
> > Thanks,
> > Susmit
> >
> > On Sat, May 13, 2017 at 10:48 AM, Joel Bernstein 
> > wrote:
> >
> > > I was just reading the Java docs on the ChunkedInputStream.
> > >
> > > "Note that this class NEVER closes the underlying stream"
> > >
> > > In that scenario the /export would indeed continue to send data. I
> think
> > we
> > > can consider this an anti-pattern for the /export handler currently.
> > >
> > > I would suggest using one of the Streaming Clients to connect to the
> > export
> > > handler. Either CloudSolrStream or SolrStream will both interact with
> the
> > > /export handler in a the way that it expects.
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Sat, May 13, 2017 at 12:28 PM, Susmit Shukla <
> shukla.sus...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi Joel,
> > > >
> > > > I did not observe that. On calling close() on stream, it cycled
> through
> > > all
> > > > the hits that /export handler calculated.
> > > > e.g. with a *:* query and export handler on a 100k document index, I
> > > could
> > > > see the 100kth record printed on the http wire debug log although
> close
> > > was
> > > > called after reading 1st tuple. The time taken for the operation with
> > > > close() call was same as that if I had read all the 100k tuples.
> > > > As I have pointed out, close() on underlying ChunkedInputStream calls
> > > > read() and solr server has probably no way to distinguish it from
> > read()
> > > > happening from regular tuple reads..
> > > > I think there should be an abort() API for solr streams that hooks
> into
> > > > httpmethod.abort() . That would enable client to disconnect early and
> > > > probably that would disconnect the underlying socket so there would
> be
> > no
> > > > leaks.
> > > >
> > > > Thanks,
> > > > Susmit
> > > >
> > > >
> > > > On Sat, May 13, 2017 at 7:42 AM, Joel Bernstein 
> > > > wrote:
> > > >
> > > > > If the client closes the connection to the export handler then this
> > > > > exception will occur automatically on the server.
> > > > >
> > > > > Joel Bernstein
> > > > > http://joelsolr.blogspot.com/
> > > > >
> > > > > On Sat, May 13, 2017 at 1:46 AM, Susmit Shukla <
> > > shukla.sus...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Joel,
> > > > > >
> > > > > > Thanks for the insight. How can this exception be thrown/forced
> > from
> > > > > client
> > > > > > side. Client can't do a System.exit() as it is running as a
> webapp.
> > > > > >
> > > > > > Thanks,
> > > > > > Susmit
> > > > > >
> > > > > > On Fri, May 12, 2017 at 4:44 PM, Joel Bernstein <
> > joels...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > In this scenario the /export handler continues to export
> results
> > > > until
> > > > > it
> > > > > > > encounters a "Broken Pipe" exception. This exception is trapped
> > and
> > > > > > ignored
> > > > > > > rather then logged as it's not considered an exception if the
> > > client
> > > > > > > disconnects early.
> > > > > > >
> > > > > > > Joel Bernstein
> > > > > > > http://joelsolr.blogspot.com/
> > > > > > >
> > > > > > > On Fri, May 12, 2017 at 2:10 PM, Susmit Shukla <
> > > > > shukla.sus...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I have a question regarding solr /export handler. Here is the
> > > > > scenario
> > > > > > -
> > > > > > > > I want to use the /export handler - I only need sorted data
> and
> > > > this
> > > > > is
> > > > > > > the
> > > > > > 

Re: solr /export handler - behavior during close()

2017-05-13 Thread Joel Bernstein
Ah, then this is unexpected behavior. Can you open a ticket for this?

Joel Bernstein
http://joelsolr.blogspot.com/

On Sat, May 13, 2017 at 2:51 PM, Susmit Shukla 
wrote:

> Hi Joel,
>
> I was using CloudSolrStream for the above test. Below is the call stack.
>
> at
> org.apache.http.impl.io.ChunkedInputStream.read(
> ChunkedInputStream.java:215)
> at
> org.apache.http.impl.io.ChunkedInputStream.close(
> ChunkedInputStream.java:316)
> at
> org.apache.http.impl.execchain.ResponseEntityProxy.streamClosed(
> ResponseEntityProxy.java:128)
> at
> org.apache.http.conn.EofSensorInputStream.checkClose(
> EofSensorInputStream.java:228)
> at
> org.apache.http.conn.EofSensorInputStream.close(
> EofSensorInputStream.java:174)
> at sun.nio.cs.StreamDecoder.implClose(StreamDecoder.java:378)
> at sun.nio.cs.StreamDecoder.close(StreamDecoder.java:193)
> at java.io.InputStreamReader.close(InputStreamReader.java:199)
> at
> org.apache.solr.client.solrj.io.stream.JSONTupleStream.
> close(JSONTupleStream.java:91)
> at
> org.apache.solr.client.solrj.io.stream.SolrStream.close(
> SolrStream.java:186)
>
> Thanks,
> Susmit
>
> On Sat, May 13, 2017 at 10:48 AM, Joel Bernstein 
> wrote:
>
> > I was just reading the Java docs on the ChunkedInputStream.
> >
> > "Note that this class NEVER closes the underlying stream"
> >
> > In that scenario the /export would indeed continue to send data. I think
> we
> > can consider this an anti-pattern for the /export handler currently.
> >
> > I would suggest using one of the Streaming Clients to connect to the
> export
> > handler. Either CloudSolrStream or SolrStream will both interact with the
> > /export handler in a the way that it expects.
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Sat, May 13, 2017 at 12:28 PM, Susmit Shukla  >
> > wrote:
> >
> > > Hi Joel,
> > >
> > > I did not observe that. On calling close() on stream, it cycled through
> > all
> > > the hits that /export handler calculated.
> > > e.g. with a *:* query and export handler on a 100k document index, I
> > could
> > > see the 100kth record printed on the http wire debug log although close
> > was
> > > called after reading 1st tuple. The time taken for the operation with
> > > close() call was same as that if I had read all the 100k tuples.
> > > As I have pointed out, close() on underlying ChunkedInputStream calls
> > > read() and solr server has probably no way to distinguish it from
> read()
> > > happening from regular tuple reads..
> > > I think there should be an abort() API for solr streams that hooks into
> > > httpmethod.abort() . That would enable client to disconnect early and
> > > probably that would disconnect the underlying socket so there would be
> no
> > > leaks.
> > >
> > > Thanks,
> > > Susmit
> > >
> > >
> > > On Sat, May 13, 2017 at 7:42 AM, Joel Bernstein 
> > > wrote:
> > >
> > > > If the client closes the connection to the export handler then this
> > > > exception will occur automatically on the server.
> > > >
> > > > Joel Bernstein
> > > > http://joelsolr.blogspot.com/
> > > >
> > > > On Sat, May 13, 2017 at 1:46 AM, Susmit Shukla <
> > shukla.sus...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Joel,
> > > > >
> > > > > Thanks for the insight. How can this exception be thrown/forced
> from
> > > > client
> > > > > side. Client can't do a System.exit() as it is running as a webapp.
> > > > >
> > > > > Thanks,
> > > > > Susmit
> > > > >
> > > > > On Fri, May 12, 2017 at 4:44 PM, Joel Bernstein <
> joels...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > In this scenario the /export handler continues to export results
> > > until
> > > > it
> > > > > > encounters a "Broken Pipe" exception. This exception is trapped
> and
> > > > > ignored
> > > > > > rather then logged as it's not considered an exception if the
> > client
> > > > > > disconnects early.
> > > > > >
> > > > > > Joel Bernstein
> > > > > > http://joelsolr.blogspot.com/
> > > > > >
> > > > > > On Fri, May 12, 2017 at 2:10 PM, Susmit Shukla <
> > > > shukla.sus...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I have a question regarding solr /export handler. Here is the
> > > > scenario
> > > > > -
> > > > > > > I want to use the /export handler - I only need sorted data and
> > > this
> > > > is
> > > > > > the
> > > > > > > fastest way to get it. I am doing multiple level joins using
> > > streams
> > > > > > using
> > > > > > > /export handler. I know the number of top level records to be
> > > > retrieved
> > > > > > but
> > > > > > > not for each individual stream rolling up to the final result.
> > > > > > > I observed that calling close() on a /export stream is too
> > > expensive.
> > > > > It
> > > > > > > reads the stream to the very end of hits. Assuming there are
> 100
> > > > > million
> > > > > > > hits for each stream ,first 1k records were found after joins

Re: solr /export handler - behavior during close()

2017-05-13 Thread Susmit Shukla
Hi Joel,

I was using CloudSolrStream for the above test. Below is the call stack.

at
org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:215)
at
org.apache.http.impl.io.ChunkedInputStream.close(ChunkedInputStream.java:316)
at
org.apache.http.impl.execchain.ResponseEntityProxy.streamClosed(ResponseEntityProxy.java:128)
at
org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
at
org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174)
at sun.nio.cs.StreamDecoder.implClose(StreamDecoder.java:378)
at sun.nio.cs.StreamDecoder.close(StreamDecoder.java:193)
at java.io.InputStreamReader.close(InputStreamReader.java:199)
at
org.apache.solr.client.solrj.io.stream.JSONTupleStream.close(JSONTupleStream.java:91)
at
org.apache.solr.client.solrj.io.stream.SolrStream.close(SolrStream.java:186)

Thanks,
Susmit

On Sat, May 13, 2017 at 10:48 AM, Joel Bernstein  wrote:

> I was just reading the Java docs on the ChunkedInputStream.
>
> "Note that this class NEVER closes the underlying stream"
>
> In that scenario the /export would indeed continue to send data. I think we
> can consider this an anti-pattern for the /export handler currently.
>
> I would suggest using one of the Streaming Clients to connect to the export
> handler. Either CloudSolrStream or SolrStream will both interact with the
> /export handler in a the way that it expects.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Sat, May 13, 2017 at 12:28 PM, Susmit Shukla 
> wrote:
>
> > Hi Joel,
> >
> > I did not observe that. On calling close() on stream, it cycled through
> all
> > the hits that /export handler calculated.
> > e.g. with a *:* query and export handler on a 100k document index, I
> could
> > see the 100kth record printed on the http wire debug log although close
> was
> > called after reading 1st tuple. The time taken for the operation with
> > close() call was same as that if I had read all the 100k tuples.
> > As I have pointed out, close() on underlying ChunkedInputStream calls
> > read() and solr server has probably no way to distinguish it from read()
> > happening from regular tuple reads..
> > I think there should be an abort() API for solr streams that hooks into
> > httpmethod.abort() . That would enable client to disconnect early and
> > probably that would disconnect the underlying socket so there would be no
> > leaks.
> >
> > Thanks,
> > Susmit
> >
> >
> > On Sat, May 13, 2017 at 7:42 AM, Joel Bernstein 
> > wrote:
> >
> > > If the client closes the connection to the export handler then this
> > > exception will occur automatically on the server.
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Sat, May 13, 2017 at 1:46 AM, Susmit Shukla <
> shukla.sus...@gmail.com>
> > > wrote:
> > >
> > > > Hi Joel,
> > > >
> > > > Thanks for the insight. How can this exception be thrown/forced from
> > > client
> > > > side. Client can't do a System.exit() as it is running as a webapp.
> > > >
> > > > Thanks,
> > > > Susmit
> > > >
> > > > On Fri, May 12, 2017 at 4:44 PM, Joel Bernstein 
> > > > wrote:
> > > >
> > > > > In this scenario the /export handler continues to export results
> > until
> > > it
> > > > > encounters a "Broken Pipe" exception. This exception is trapped and
> > > > ignored
> > > > > rather then logged as it's not considered an exception if the
> client
> > > > > disconnects early.
> > > > >
> > > > > Joel Bernstein
> > > > > http://joelsolr.blogspot.com/
> > > > >
> > > > > On Fri, May 12, 2017 at 2:10 PM, Susmit Shukla <
> > > shukla.sus...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I have a question regarding solr /export handler. Here is the
> > > scenario
> > > > -
> > > > > > I want to use the /export handler - I only need sorted data and
> > this
> > > is
> > > > > the
> > > > > > fastest way to get it. I am doing multiple level joins using
> > streams
> > > > > using
> > > > > > /export handler. I know the number of top level records to be
> > > retrieved
> > > > > but
> > > > > > not for each individual stream rolling up to the final result.
> > > > > > I observed that calling close() on a /export stream is too
> > expensive.
> > > > It
> > > > > > reads the stream to the very end of hits. Assuming there are 100
> > > > million
> > > > > > hits for each stream ,first 1k records were found after joins and
> > we
> > > > call
> > > > > > close() after that, it would take many minutes/hours to finish
> it.
> > > > > > Currently I have put close() call in a different thread -
> basically
> > > > fire
> > > > > > and forget. But the cluster is very strained because of the
> > > > unneccessary
> > > > > > reads.
> > > > > >
> > > > > > Internally streaming uses ChunkedInputStream of HttpClient and it
> > has
> > > > to
> > > > > be
> > > > > > drained in the close() call. But from server point 

Re: solr /export handler - behavior during close()

2017-05-13 Thread Joel Bernstein
I was just reading the Java docs on the ChunkedInputStream.

"Note that this class NEVER closes the underlying stream"

In that scenario the /export would indeed continue to send data. I think we
can consider this an anti-pattern for the /export handler currently.

I would suggest using one of the Streaming Clients to connect to the export
handler. Either CloudSolrStream or SolrStream will both interact with the
/export handler in a the way that it expects.


Joel Bernstein
http://joelsolr.blogspot.com/

On Sat, May 13, 2017 at 12:28 PM, Susmit Shukla 
wrote:

> Hi Joel,
>
> I did not observe that. On calling close() on stream, it cycled through all
> the hits that /export handler calculated.
> e.g. with a *:* query and export handler on a 100k document index, I could
> see the 100kth record printed on the http wire debug log although close was
> called after reading 1st tuple. The time taken for the operation with
> close() call was same as that if I had read all the 100k tuples.
> As I have pointed out, close() on underlying ChunkedInputStream calls
> read() and solr server has probably no way to distinguish it from read()
> happening from regular tuple reads..
> I think there should be an abort() API for solr streams that hooks into
> httpmethod.abort() . That would enable client to disconnect early and
> probably that would disconnect the underlying socket so there would be no
> leaks.
>
> Thanks,
> Susmit
>
>
> On Sat, May 13, 2017 at 7:42 AM, Joel Bernstein 
> wrote:
>
> > If the client closes the connection to the export handler then this
> > exception will occur automatically on the server.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Sat, May 13, 2017 at 1:46 AM, Susmit Shukla 
> > wrote:
> >
> > > Hi Joel,
> > >
> > > Thanks for the insight. How can this exception be thrown/forced from
> > client
> > > side. Client can't do a System.exit() as it is running as a webapp.
> > >
> > > Thanks,
> > > Susmit
> > >
> > > On Fri, May 12, 2017 at 4:44 PM, Joel Bernstein 
> > > wrote:
> > >
> > > > In this scenario the /export handler continues to export results
> until
> > it
> > > > encounters a "Broken Pipe" exception. This exception is trapped and
> > > ignored
> > > > rather then logged as it's not considered an exception if the client
> > > > disconnects early.
> > > >
> > > > Joel Bernstein
> > > > http://joelsolr.blogspot.com/
> > > >
> > > > On Fri, May 12, 2017 at 2:10 PM, Susmit Shukla <
> > shukla.sus...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I have a question regarding solr /export handler. Here is the
> > scenario
> > > -
> > > > > I want to use the /export handler - I only need sorted data and
> this
> > is
> > > > the
> > > > > fastest way to get it. I am doing multiple level joins using
> streams
> > > > using
> > > > > /export handler. I know the number of top level records to be
> > retrieved
> > > > but
> > > > > not for each individual stream rolling up to the final result.
> > > > > I observed that calling close() on a /export stream is too
> expensive.
> > > It
> > > > > reads the stream to the very end of hits. Assuming there are 100
> > > million
> > > > > hits for each stream ,first 1k records were found after joins and
> we
> > > call
> > > > > close() after that, it would take many minutes/hours to finish it.
> > > > > Currently I have put close() call in a different thread - basically
> > > fire
> > > > > and forget. But the cluster is very strained because of the
> > > unneccessary
> > > > > reads.
> > > > >
> > > > > Internally streaming uses ChunkedInputStream of HttpClient and it
> has
> > > to
> > > > be
> > > > > drained in the close() call. But from server point of view, it
> should
> > > > stop
> > > > > sending more data once close() has been issued.
> > > > > There is a read() call in close() method of ChunkedInputStream that
> > is
> > > > > indistinguishable from real read(). If /export handler stops
> sending
> > > more
> > > > > data after close it would be very useful.
> > > > >
> > > > > Another option would be to use /select handler and get into
> business
> > of
> > > > > managing a custom cursor mark that is based on the stream sort and
> is
> > > > reset
> > > > > until it fetches the required records at topmost level.
> > > > >
> > > > > Any thoughts.
> > > > >
> > > > > Thanks,
> > > > > Susmit
> > > > >
> > > >
> > >
> >
>


Re: solr /export handler - behavior during close()

2017-05-13 Thread Susmit Shukla
Hi Joel,

I did not observe that. On calling close() on stream, it cycled through all
the hits that /export handler calculated.
e.g. with a *:* query and export handler on a 100k document index, I could
see the 100kth record printed on the http wire debug log although close was
called after reading 1st tuple. The time taken for the operation with
close() call was same as that if I had read all the 100k tuples.
As I have pointed out, close() on underlying ChunkedInputStream calls
read() and solr server has probably no way to distinguish it from read()
happening from regular tuple reads..
I think there should be an abort() API for solr streams that hooks into
httpmethod.abort() . That would enable client to disconnect early and
probably that would disconnect the underlying socket so there would be no
leaks.

Thanks,
Susmit


On Sat, May 13, 2017 at 7:42 AM, Joel Bernstein  wrote:

> If the client closes the connection to the export handler then this
> exception will occur automatically on the server.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Sat, May 13, 2017 at 1:46 AM, Susmit Shukla 
> wrote:
>
> > Hi Joel,
> >
> > Thanks for the insight. How can this exception be thrown/forced from
> client
> > side. Client can't do a System.exit() as it is running as a webapp.
> >
> > Thanks,
> > Susmit
> >
> > On Fri, May 12, 2017 at 4:44 PM, Joel Bernstein 
> > wrote:
> >
> > > In this scenario the /export handler continues to export results until
> it
> > > encounters a "Broken Pipe" exception. This exception is trapped and
> > ignored
> > > rather then logged as it's not considered an exception if the client
> > > disconnects early.
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Fri, May 12, 2017 at 2:10 PM, Susmit Shukla <
> shukla.sus...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I have a question regarding solr /export handler. Here is the
> scenario
> > -
> > > > I want to use the /export handler - I only need sorted data and this
> is
> > > the
> > > > fastest way to get it. I am doing multiple level joins using streams
> > > using
> > > > /export handler. I know the number of top level records to be
> retrieved
> > > but
> > > > not for each individual stream rolling up to the final result.
> > > > I observed that calling close() on a /export stream is too expensive.
> > It
> > > > reads the stream to the very end of hits. Assuming there are 100
> > million
> > > > hits for each stream ,first 1k records were found after joins and we
> > call
> > > > close() after that, it would take many minutes/hours to finish it.
> > > > Currently I have put close() call in a different thread - basically
> > fire
> > > > and forget. But the cluster is very strained because of the
> > unneccessary
> > > > reads.
> > > >
> > > > Internally streaming uses ChunkedInputStream of HttpClient and it has
> > to
> > > be
> > > > drained in the close() call. But from server point of view, it should
> > > stop
> > > > sending more data once close() has been issued.
> > > > There is a read() call in close() method of ChunkedInputStream that
> is
> > > > indistinguishable from real read(). If /export handler stops sending
> > more
> > > > data after close it would be very useful.
> > > >
> > > > Another option would be to use /select handler and get into business
> of
> > > > managing a custom cursor mark that is based on the stream sort and is
> > > reset
> > > > until it fetches the required records at topmost level.
> > > >
> > > > Any thoughts.
> > > >
> > > > Thanks,
> > > > Susmit
> > > >
> > >
> >
>


Re: solr /export handler - behavior during close()

2017-05-13 Thread Joel Bernstein
If the client closes the connection to the export handler then this
exception will occur automatically on the server.

Joel Bernstein
http://joelsolr.blogspot.com/

On Sat, May 13, 2017 at 1:46 AM, Susmit Shukla 
wrote:

> Hi Joel,
>
> Thanks for the insight. How can this exception be thrown/forced from client
> side. Client can't do a System.exit() as it is running as a webapp.
>
> Thanks,
> Susmit
>
> On Fri, May 12, 2017 at 4:44 PM, Joel Bernstein 
> wrote:
>
> > In this scenario the /export handler continues to export results until it
> > encounters a "Broken Pipe" exception. This exception is trapped and
> ignored
> > rather then logged as it's not considered an exception if the client
> > disconnects early.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Fri, May 12, 2017 at 2:10 PM, Susmit Shukla 
> > wrote:
> >
> > > Hi,
> > >
> > > I have a question regarding solr /export handler. Here is the scenario
> -
> > > I want to use the /export handler - I only need sorted data and this is
> > the
> > > fastest way to get it. I am doing multiple level joins using streams
> > using
> > > /export handler. I know the number of top level records to be retrieved
> > but
> > > not for each individual stream rolling up to the final result.
> > > I observed that calling close() on a /export stream is too expensive.
> It
> > > reads the stream to the very end of hits. Assuming there are 100
> million
> > > hits for each stream ,first 1k records were found after joins and we
> call
> > > close() after that, it would take many minutes/hours to finish it.
> > > Currently I have put close() call in a different thread - basically
> fire
> > > and forget. But the cluster is very strained because of the
> unneccessary
> > > reads.
> > >
> > > Internally streaming uses ChunkedInputStream of HttpClient and it has
> to
> > be
> > > drained in the close() call. But from server point of view, it should
> > stop
> > > sending more data once close() has been issued.
> > > There is a read() call in close() method of ChunkedInputStream that is
> > > indistinguishable from real read(). If /export handler stops sending
> more
> > > data after close it would be very useful.
> > >
> > > Another option would be to use /select handler and get into business of
> > > managing a custom cursor mark that is based on the stream sort and is
> > reset
> > > until it fetches the required records at topmost level.
> > >
> > > Any thoughts.
> > >
> > > Thanks,
> > > Susmit
> > >
> >
>


Re: solr /export handler - behavior during close()

2017-05-12 Thread Susmit Shukla
Hi Joel,

Thanks for the insight. How can this exception be thrown/forced from client
side. Client can't do a System.exit() as it is running as a webapp.

Thanks,
Susmit

On Fri, May 12, 2017 at 4:44 PM, Joel Bernstein  wrote:

> In this scenario the /export handler continues to export results until it
> encounters a "Broken Pipe" exception. This exception is trapped and ignored
> rather then logged as it's not considered an exception if the client
> disconnects early.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, May 12, 2017 at 2:10 PM, Susmit Shukla 
> wrote:
>
> > Hi,
> >
> > I have a question regarding solr /export handler. Here is the scenario -
> > I want to use the /export handler - I only need sorted data and this is
> the
> > fastest way to get it. I am doing multiple level joins using streams
> using
> > /export handler. I know the number of top level records to be retrieved
> but
> > not for each individual stream rolling up to the final result.
> > I observed that calling close() on a /export stream is too expensive. It
> > reads the stream to the very end of hits. Assuming there are 100 million
> > hits for each stream ,first 1k records were found after joins and we call
> > close() after that, it would take many minutes/hours to finish it.
> > Currently I have put close() call in a different thread - basically fire
> > and forget. But the cluster is very strained because of the unneccessary
> > reads.
> >
> > Internally streaming uses ChunkedInputStream of HttpClient and it has to
> be
> > drained in the close() call. But from server point of view, it should
> stop
> > sending more data once close() has been issued.
> > There is a read() call in close() method of ChunkedInputStream that is
> > indistinguishable from real read(). If /export handler stops sending more
> > data after close it would be very useful.
> >
> > Another option would be to use /select handler and get into business of
> > managing a custom cursor mark that is based on the stream sort and is
> reset
> > until it fetches the required records at topmost level.
> >
> > Any thoughts.
> >
> > Thanks,
> > Susmit
> >
>


Re: solr /export handler - behavior during close()

2017-05-12 Thread Joel Bernstein
In this scenario the /export handler continues to export results until it
encounters a "Broken Pipe" exception. This exception is trapped and ignored
rather then logged as it's not considered an exception if the client
disconnects early.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, May 12, 2017 at 2:10 PM, Susmit Shukla 
wrote:

> Hi,
>
> I have a question regarding solr /export handler. Here is the scenario -
> I want to use the /export handler - I only need sorted data and this is the
> fastest way to get it. I am doing multiple level joins using streams using
> /export handler. I know the number of top level records to be retrieved but
> not for each individual stream rolling up to the final result.
> I observed that calling close() on a /export stream is too expensive. It
> reads the stream to the very end of hits. Assuming there are 100 million
> hits for each stream ,first 1k records were found after joins and we call
> close() after that, it would take many minutes/hours to finish it.
> Currently I have put close() call in a different thread - basically fire
> and forget. But the cluster is very strained because of the unneccessary
> reads.
>
> Internally streaming uses ChunkedInputStream of HttpClient and it has to be
> drained in the close() call. But from server point of view, it should stop
> sending more data once close() has been issued.
> There is a read() call in close() method of ChunkedInputStream that is
> indistinguishable from real read(). If /export handler stops sending more
> data after close it would be very useful.
>
> Another option would be to use /select handler and get into business of
> managing a custom cursor mark that is based on the stream sort and is reset
> until it fetches the required records at topmost level.
>
> Any thoughts.
>
> Thanks,
> Susmit
>


solr /export handler - behavior during close()

2017-05-12 Thread Susmit Shukla
Hi,

I have a question regarding solr /export handler. Here is the scenario -
I want to use the /export handler - I only need sorted data and this is the
fastest way to get it. I am doing multiple level joins using streams using
/export handler. I know the number of top level records to be retrieved but
not for each individual stream rolling up to the final result.
I observed that calling close() on a /export stream is too expensive. It
reads the stream to the very end of hits. Assuming there are 100 million
hits for each stream ,first 1k records were found after joins and we call
close() after that, it would take many minutes/hours to finish it.
Currently I have put close() call in a different thread - basically fire
and forget. But the cluster is very strained because of the unneccessary
reads.

Internally streaming uses ChunkedInputStream of HttpClient and it has to be
drained in the close() call. But from server point of view, it should stop
sending more data once close() has been issued.
There is a read() call in close() method of ChunkedInputStream that is
indistinguishable from real read(). If /export handler stops sending more
data after close it would be very useful.

Another option would be to use /select handler and get into business of
managing a custom cursor mark that is based on the stream sort and is reset
until it fetches the required records at topmost level.

Any thoughts.

Thanks,
Susmit