Re: The Streaming API (Solrj.io) : id must have DocValues?

2016-04-26 Thread Joel Bernstein
My blog is pretty out of date at this point unfortunately. I need to get
some better examples published.

Also there is huge amount of work that went into Solr 6 Streaming API and
Streaming Expressions that make them much easier to work with. In Solr 6.1
you'll be able to test Streaming Expressions from the Solr admin console
which should be very helpful.

Since you're planning on using the joins, performance will be very much
driven by the number of shards and replicas pushing the streams and the
number of workers performing the join.

If you're still having problems with Solr 6.0, feel free to post the
Expression you're using and I or other people can help debug the issue.


Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Apr 26, 2016 at 2:29 PM, sudsport s  wrote:

> I see that some work was done to remove stream handler form config. so
> enabling stream handler is still security issue?
>
> https://issues.apache.org/jira/browse/SOLR-8262
>
> On Tue, Apr 26, 2016 at 11:14 AM, sudsport s  wrote:
>
>> I am using solr 5.3.1 server & solr5.5 on client ( solrj) . I will try
>> with solrj 6.0
>>
>> On Tue, Apr 26, 2016 at 11:12 AM, Susmit Shukla 
>> wrote:
>>
>>> Which solrj version are you using? could you try with solrj 6.0
>>>
>>> On Tue, Apr 26, 2016 at 10:36 AM, sudsport s 
>>> wrote:
>>>
>>> > @Joel
>>> > >Can you describe how you're planning on using Streaming?
>>> >
>>> > I am mostly using it for distirbuted join case. We were planning to use
>>> > similar logic (hash id and join) in Spark for our usecase. but since
>>> data
>>> > is stored in solr , I will be using solr stream to perform same
>>> operation.
>>> >
>>> > I have similar user cases to build probabilistic data-structures while
>>> > streaming results. I might have to spend some time in exploring query
>>> > optimization (while doing join decide sort order etc)
>>> >
>>> > Please let me know if you have any feedback.
>>> >
>>> > On Tue, Apr 26, 2016 at 10:30 AM, sudsport s 
>>> wrote:
>>> >
>>> > > Thanks @Reth yes that was my one of the concern. I will look at JIRA
>>> you
>>> > > mentioned.
>>> > >
>>> > > Thanks Joel
>>> > > I used some of examples for streaming client from your blog. I got
>>> basic
>>> > > tuple stream working but I get following exception while running
>>> parallel
>>> > > string.
>>> > >
>>> > >
>>> > > java.io.IOException: java.util.concurrent.ExecutionException:
>>> > > org.noggit.JSONParser$ParseException: JSON Parse Error:
>>> char=<,position=0
>>> > > BEFORE='<' AFTER='html>  >> > >
>>> > >
>>> > > looks like Parallel stream is trying to access /stream on shard. can
>>> > > someone tell me how to enable stream handler? I have export handler
>>> > > enabled. I will look at latest solrconfig to see if I can turn that
>>> on.
>>> > >
>>> > >
>>> > >
>>> > > @Joel I am running sizing exercises already , I will run new one with
>>> > > solr5.5+ and docValues on id enabled.
>>> > >
>>> > > BTW Solr streaming has amazing response times thanks for making it so
>>> > > FAST!!!
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Mon, Apr 25, 2016 at 10:54 AM, Joel Bernstein >> >
>>> > > wrote:
>>> > >
>>> > >> Can you describe how you're planning on using Streaming? I can
>>> provide
>>> > >> some
>>> > >> feedback on how it will perform for your use use.
>>> > >>
>>> > >> When scaling out Streaming you'll get large performance boosts when
>>> you
>>> > >> increase the number of shards, replicas and workers. This is
>>> > particularly
>>> > >> true if you're doing parallel relational algebra or map/reduce
>>> > operations.
>>> > >>
>>> > >> As far a DocValues being expensive with unique fields, you'll want
>>> to
>>> > do a
>>> > >> sizing exercise to see how many documents per-shard work best for
>>> your
>>> > use
>>> > >> case. There are different docValues implementations that will allow
>>> you
>>> > to
>>> > >> trade off memory for performance.
>>> > >>
>>> > >> Joel Bernstein
>>> > >> http://joelsolr.blogspot.com/
>>> > >>
>>> > >> On Mon, Apr 25, 2016 at 3:30 AM, Reth RM 
>>> wrote:
>>> > >>
>>> > >> > Hi,
>>> > >> >
>>> > >> > So, is the concern related to same field value being stored twice:
>>> > with
>>> > >> > stored=true and docValues=true? If that is the case, there is a
>>> jira
>>> > >> > relevant to this, fixed[1]. If you upgrade to 5.5/6.0 version, it
>>> is
>>> > >> > possible to read non-stored fields from docValues index., check
>>> out.
>>> > >> >
>>> > >> >
>>> > >> > [1] https://issues.apache.org/jira/browse/SOLR-8220
>>> > >> >
>>> > >> > On Mon, Apr 25, 2016 at 9:44 AM, sudsport s >> >
>>> > >> wrote:
>>> > >> >
>>> > >> > > Thanks Erik for reply,
>>> > >> > >
>>> > >> > > Since I was storing Id (its stored field) and after enabling
>>> > >> docValues my
>>> > >> > > guess is it will be stored in 2 

Re: The Streaming API (Solrj.io) : id must have DocValues?

2016-04-26 Thread sudsport s
I see that some work was done to remove stream handler form config. so
enabling stream handler is still security issue?

https://issues.apache.org/jira/browse/SOLR-8262

On Tue, Apr 26, 2016 at 11:14 AM, sudsport s  wrote:

> I am using solr 5.3.1 server & solr5.5 on client ( solrj) . I will try
> with solrj 6.0
>
> On Tue, Apr 26, 2016 at 11:12 AM, Susmit Shukla 
> wrote:
>
>> Which solrj version are you using? could you try with solrj 6.0
>>
>> On Tue, Apr 26, 2016 at 10:36 AM, sudsport s 
>> wrote:
>>
>> > @Joel
>> > >Can you describe how you're planning on using Streaming?
>> >
>> > I am mostly using it for distirbuted join case. We were planning to use
>> > similar logic (hash id and join) in Spark for our usecase. but since
>> data
>> > is stored in solr , I will be using solr stream to perform same
>> operation.
>> >
>> > I have similar user cases to build probabilistic data-structures while
>> > streaming results. I might have to spend some time in exploring query
>> > optimization (while doing join decide sort order etc)
>> >
>> > Please let me know if you have any feedback.
>> >
>> > On Tue, Apr 26, 2016 at 10:30 AM, sudsport s 
>> wrote:
>> >
>> > > Thanks @Reth yes that was my one of the concern. I will look at JIRA
>> you
>> > > mentioned.
>> > >
>> > > Thanks Joel
>> > > I used some of examples for streaming client from your blog. I got
>> basic
>> > > tuple stream working but I get following exception while running
>> parallel
>> > > string.
>> > >
>> > >
>> > > java.io.IOException: java.util.concurrent.ExecutionException:
>> > > org.noggit.JSONParser$ParseException: JSON Parse Error:
>> char=<,position=0
>> > > BEFORE='<' AFTER='html>  > > >
>> > >
>> > > looks like Parallel stream is trying to access /stream on shard. can
>> > > someone tell me how to enable stream handler? I have export handler
>> > > enabled. I will look at latest solrconfig to see if I can turn that
>> on.
>> > >
>> > >
>> > >
>> > > @Joel I am running sizing exercises already , I will run new one with
>> > > solr5.5+ and docValues on id enabled.
>> > >
>> > > BTW Solr streaming has amazing response times thanks for making it so
>> > > FAST!!!
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Mon, Apr 25, 2016 at 10:54 AM, Joel Bernstein 
>> > > wrote:
>> > >
>> > >> Can you describe how you're planning on using Streaming? I can
>> provide
>> > >> some
>> > >> feedback on how it will perform for your use use.
>> > >>
>> > >> When scaling out Streaming you'll get large performance boosts when
>> you
>> > >> increase the number of shards, replicas and workers. This is
>> > particularly
>> > >> true if you're doing parallel relational algebra or map/reduce
>> > operations.
>> > >>
>> > >> As far a DocValues being expensive with unique fields, you'll want to
>> > do a
>> > >> sizing exercise to see how many documents per-shard work best for
>> your
>> > use
>> > >> case. There are different docValues implementations that will allow
>> you
>> > to
>> > >> trade off memory for performance.
>> > >>
>> > >> Joel Bernstein
>> > >> http://joelsolr.blogspot.com/
>> > >>
>> > >> On Mon, Apr 25, 2016 at 3:30 AM, Reth RM 
>> wrote:
>> > >>
>> > >> > Hi,
>> > >> >
>> > >> > So, is the concern related to same field value being stored twice:
>> > with
>> > >> > stored=true and docValues=true? If that is the case, there is a
>> jira
>> > >> > relevant to this, fixed[1]. If you upgrade to 5.5/6.0 version, it
>> is
>> > >> > possible to read non-stored fields from docValues index., check
>> out.
>> > >> >
>> > >> >
>> > >> > [1] https://issues.apache.org/jira/browse/SOLR-8220
>> > >> >
>> > >> > On Mon, Apr 25, 2016 at 9:44 AM, sudsport s 
>> > >> wrote:
>> > >> >
>> > >> > > Thanks Erik for reply,
>> > >> > >
>> > >> > > Since I was storing Id (its stored field) and after enabling
>> > >> docValues my
>> > >> > > guess is it will be stored in 2 places. also as per my
>> understanding
>> > >> > > docValues are great when you have values which repeat. I am not
>> sure
>> > >> how
>> > >> > > beneficial it would be for uniqueId field.
>> > >> > > I am looking at collection of few hundred billion documents ,
>> that
>> > is
>> > >> > > reason I really want to care about expense from design phase.
>> > >> > >
>> > >> > >
>> > >> > >
>> > >> > >
>> > >> > > On Sun, Apr 24, 2016 at 7:24 PM, Erick Erickson <
>> > >> erickerick...@gmail.com
>> > >> > >
>> > >> > > wrote:
>> > >> > >
>> > >> > > > In a word, "yes".
>> > >> > > >
>> > >> > > > DocValues aren't particularly expensive, or expensive at all.
>> The
>> > >> idea
>> > >> > > > is that when you sort by a field or facet, the field has to be
>> > >> > > > "uninverted" which builds the entire structure in Java's JVM
>> (this
>> > >> is
>> > >> > > > when the field is _not_ DocValues).
>> > >> > > >
>> > >> > > > DocValues 

Re: The Streaming API (Solrj.io) : id must have DocValues?

2016-04-26 Thread sudsport s
I am using solr 5.3.1 server & solr5.5 on client ( solrj) . I will try with
solrj 6.0

On Tue, Apr 26, 2016 at 11:12 AM, Susmit Shukla 
wrote:

> Which solrj version are you using? could you try with solrj 6.0
>
> On Tue, Apr 26, 2016 at 10:36 AM, sudsport s  wrote:
>
> > @Joel
> > >Can you describe how you're planning on using Streaming?
> >
> > I am mostly using it for distirbuted join case. We were planning to use
> > similar logic (hash id and join) in Spark for our usecase. but since data
> > is stored in solr , I will be using solr stream to perform same
> operation.
> >
> > I have similar user cases to build probabilistic data-structures while
> > streaming results. I might have to spend some time in exploring query
> > optimization (while doing join decide sort order etc)
> >
> > Please let me know if you have any feedback.
> >
> > On Tue, Apr 26, 2016 at 10:30 AM, sudsport s 
> wrote:
> >
> > > Thanks @Reth yes that was my one of the concern. I will look at JIRA
> you
> > > mentioned.
> > >
> > > Thanks Joel
> > > I used some of examples for streaming client from your blog. I got
> basic
> > > tuple stream working but I get following exception while running
> parallel
> > > string.
> > >
> > >
> > > java.io.IOException: java.util.concurrent.ExecutionException:
> > > org.noggit.JSONParser$ParseException: JSON Parse Error:
> char=<,position=0
> > > BEFORE='<' AFTER='html>   > >
> > >
> > > looks like Parallel stream is trying to access /stream on shard. can
> > > someone tell me how to enable stream handler? I have export handler
> > > enabled. I will look at latest solrconfig to see if I can turn that on.
> > >
> > >
> > >
> > > @Joel I am running sizing exercises already , I will run new one with
> > > solr5.5+ and docValues on id enabled.
> > >
> > > BTW Solr streaming has amazing response times thanks for making it so
> > > FAST!!!
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Mon, Apr 25, 2016 at 10:54 AM, Joel Bernstein 
> > > wrote:
> > >
> > >> Can you describe how you're planning on using Streaming? I can provide
> > >> some
> > >> feedback on how it will perform for your use use.
> > >>
> > >> When scaling out Streaming you'll get large performance boosts when
> you
> > >> increase the number of shards, replicas and workers. This is
> > particularly
> > >> true if you're doing parallel relational algebra or map/reduce
> > operations.
> > >>
> > >> As far a DocValues being expensive with unique fields, you'll want to
> > do a
> > >> sizing exercise to see how many documents per-shard work best for your
> > use
> > >> case. There are different docValues implementations that will allow
> you
> > to
> > >> trade off memory for performance.
> > >>
> > >> Joel Bernstein
> > >> http://joelsolr.blogspot.com/
> > >>
> > >> On Mon, Apr 25, 2016 at 3:30 AM, Reth RM 
> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > So, is the concern related to same field value being stored twice:
> > with
> > >> > stored=true and docValues=true? If that is the case, there is a jira
> > >> > relevant to this, fixed[1]. If you upgrade to 5.5/6.0 version, it is
> > >> > possible to read non-stored fields from docValues index., check out.
> > >> >
> > >> >
> > >> > [1] https://issues.apache.org/jira/browse/SOLR-8220
> > >> >
> > >> > On Mon, Apr 25, 2016 at 9:44 AM, sudsport s 
> > >> wrote:
> > >> >
> > >> > > Thanks Erik for reply,
> > >> > >
> > >> > > Since I was storing Id (its stored field) and after enabling
> > >> docValues my
> > >> > > guess is it will be stored in 2 places. also as per my
> understanding
> > >> > > docValues are great when you have values which repeat. I am not
> sure
> > >> how
> > >> > > beneficial it would be for uniqueId field.
> > >> > > I am looking at collection of few hundred billion documents , that
> > is
> > >> > > reason I really want to care about expense from design phase.
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Sun, Apr 24, 2016 at 7:24 PM, Erick Erickson <
> > >> erickerick...@gmail.com
> > >> > >
> > >> > > wrote:
> > >> > >
> > >> > > > In a word, "yes".
> > >> > > >
> > >> > > > DocValues aren't particularly expensive, or expensive at all.
> The
> > >> idea
> > >> > > > is that when you sort by a field or facet, the field has to be
> > >> > > > "uninverted" which builds the entire structure in Java's JVM
> (this
> > >> is
> > >> > > > when the field is _not_ DocValues).
> > >> > > >
> > >> > > > DocValues essentially serialize this structure to disk. So your
> > >> > > > on-disk index size is larger, but that size is MMaped rather
> than
> > >> > > > stored on Java's heap.
> > >> > > >
> > >> > > > Really, the question I'd have to ask though is "why do you care
> > >> about
> > >> > > > the expense?". If you have a functional requirement that has to
> be
> > >> > > > served by returning the id via the /export handler, you 

Re: The Streaming API (Solrj.io) : id must have DocValues?

2016-04-26 Thread Susmit Shukla
Which solrj version are you using? could you try with solrj 6.0

On Tue, Apr 26, 2016 at 10:36 AM, sudsport s  wrote:

> @Joel
> >Can you describe how you're planning on using Streaming?
>
> I am mostly using it for distirbuted join case. We were planning to use
> similar logic (hash id and join) in Spark for our usecase. but since data
> is stored in solr , I will be using solr stream to perform same operation.
>
> I have similar user cases to build probabilistic data-structures while
> streaming results. I might have to spend some time in exploring query
> optimization (while doing join decide sort order etc)
>
> Please let me know if you have any feedback.
>
> On Tue, Apr 26, 2016 at 10:30 AM, sudsport s  wrote:
>
> > Thanks @Reth yes that was my one of the concern. I will look at JIRA you
> > mentioned.
> >
> > Thanks Joel
> > I used some of examples for streaming client from your blog. I got basic
> > tuple stream working but I get following exception while running parallel
> > string.
> >
> >
> > java.io.IOException: java.util.concurrent.ExecutionException:
> > org.noggit.JSONParser$ParseException: JSON Parse Error: char=<,position=0
> > BEFORE='<' AFTER='html>   >
> >
> > looks like Parallel stream is trying to access /stream on shard. can
> > someone tell me how to enable stream handler? I have export handler
> > enabled. I will look at latest solrconfig to see if I can turn that on.
> >
> >
> >
> > @Joel I am running sizing exercises already , I will run new one with
> > solr5.5+ and docValues on id enabled.
> >
> > BTW Solr streaming has amazing response times thanks for making it so
> > FAST!!!
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Apr 25, 2016 at 10:54 AM, Joel Bernstein 
> > wrote:
> >
> >> Can you describe how you're planning on using Streaming? I can provide
> >> some
> >> feedback on how it will perform for your use use.
> >>
> >> When scaling out Streaming you'll get large performance boosts when you
> >> increase the number of shards, replicas and workers. This is
> particularly
> >> true if you're doing parallel relational algebra or map/reduce
> operations.
> >>
> >> As far a DocValues being expensive with unique fields, you'll want to
> do a
> >> sizing exercise to see how many documents per-shard work best for your
> use
> >> case. There are different docValues implementations that will allow you
> to
> >> trade off memory for performance.
> >>
> >> Joel Bernstein
> >> http://joelsolr.blogspot.com/
> >>
> >> On Mon, Apr 25, 2016 at 3:30 AM, Reth RM  wrote:
> >>
> >> > Hi,
> >> >
> >> > So, is the concern related to same field value being stored twice:
> with
> >> > stored=true and docValues=true? If that is the case, there is a jira
> >> > relevant to this, fixed[1]. If you upgrade to 5.5/6.0 version, it is
> >> > possible to read non-stored fields from docValues index., check out.
> >> >
> >> >
> >> > [1] https://issues.apache.org/jira/browse/SOLR-8220
> >> >
> >> > On Mon, Apr 25, 2016 at 9:44 AM, sudsport s 
> >> wrote:
> >> >
> >> > > Thanks Erik for reply,
> >> > >
> >> > > Since I was storing Id (its stored field) and after enabling
> >> docValues my
> >> > > guess is it will be stored in 2 places. also as per my understanding
> >> > > docValues are great when you have values which repeat. I am not sure
> >> how
> >> > > beneficial it would be for uniqueId field.
> >> > > I am looking at collection of few hundred billion documents , that
> is
> >> > > reason I really want to care about expense from design phase.
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Sun, Apr 24, 2016 at 7:24 PM, Erick Erickson <
> >> erickerick...@gmail.com
> >> > >
> >> > > wrote:
> >> > >
> >> > > > In a word, "yes".
> >> > > >
> >> > > > DocValues aren't particularly expensive, or expensive at all. The
> >> idea
> >> > > > is that when you sort by a field or facet, the field has to be
> >> > > > "uninverted" which builds the entire structure in Java's JVM (this
> >> is
> >> > > > when the field is _not_ DocValues).
> >> > > >
> >> > > > DocValues essentially serialize this structure to disk. So your
> >> > > > on-disk index size is larger, but that size is MMaped rather than
> >> > > > stored on Java's heap.
> >> > > >
> >> > > > Really, the question I'd have to ask though is "why do you care
> >> about
> >> > > > the expense?". If you have a functional requirement that has to be
> >> > > > served by returning the id via the /export handler, you really
> have
> >> no
> >> > > > choice.
> >> > > >
> >> > > > Best,
> >> > > > Erick
> >> > > >
> >> > > >
> >> > > > On Sun, Apr 24, 2016 at 9:55 AM, sudsport s  >
> >> > > wrote:
> >> > > > > I was trying to use Streaming for reading basic tuple stream. I
> am
> >> > > using
> >> > > > > sort by id asc ,
> >> > > > > I am getting following exception
> >> > > > >
> >> > > > > I am using export search handler as per
> >> > > 

Re: The Streaming API (Solrj.io) : id must have DocValues?

2016-04-26 Thread sudsport s
@Joel
>Can you describe how you're planning on using Streaming?

I am mostly using it for distirbuted join case. We were planning to use
similar logic (hash id and join) in Spark for our usecase. but since data
is stored in solr , I will be using solr stream to perform same operation.

I have similar user cases to build probabilistic data-structures while
streaming results. I might have to spend some time in exploring query
optimization (while doing join decide sort order etc)

Please let me know if you have any feedback.

On Tue, Apr 26, 2016 at 10:30 AM, sudsport s  wrote:

> Thanks @Reth yes that was my one of the concern. I will look at JIRA you
> mentioned.
>
> Thanks Joel
> I used some of examples for streaming client from your blog. I got basic
> tuple stream working but I get following exception while running parallel
> string.
>
>
> java.io.IOException: java.util.concurrent.ExecutionException:
> org.noggit.JSONParser$ParseException: JSON Parse Error: char=<,position=0
> BEFORE='<' AFTER='html>  
>
> looks like Parallel stream is trying to access /stream on shard. can
> someone tell me how to enable stream handler? I have export handler
> enabled. I will look at latest solrconfig to see if I can turn that on.
>
>
>
> @Joel I am running sizing exercises already , I will run new one with
> solr5.5+ and docValues on id enabled.
>
> BTW Solr streaming has amazing response times thanks for making it so
> FAST!!!
>
>
>
>
>
>
>
> On Mon, Apr 25, 2016 at 10:54 AM, Joel Bernstein 
> wrote:
>
>> Can you describe how you're planning on using Streaming? I can provide
>> some
>> feedback on how it will perform for your use use.
>>
>> When scaling out Streaming you'll get large performance boosts when you
>> increase the number of shards, replicas and workers. This is particularly
>> true if you're doing parallel relational algebra or map/reduce operations.
>>
>> As far a DocValues being expensive with unique fields, you'll want to do a
>> sizing exercise to see how many documents per-shard work best for your use
>> case. There are different docValues implementations that will allow you to
>> trade off memory for performance.
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Mon, Apr 25, 2016 at 3:30 AM, Reth RM  wrote:
>>
>> > Hi,
>> >
>> > So, is the concern related to same field value being stored twice: with
>> > stored=true and docValues=true? If that is the case, there is a jira
>> > relevant to this, fixed[1]. If you upgrade to 5.5/6.0 version, it is
>> > possible to read non-stored fields from docValues index., check out.
>> >
>> >
>> > [1] https://issues.apache.org/jira/browse/SOLR-8220
>> >
>> > On Mon, Apr 25, 2016 at 9:44 AM, sudsport s 
>> wrote:
>> >
>> > > Thanks Erik for reply,
>> > >
>> > > Since I was storing Id (its stored field) and after enabling
>> docValues my
>> > > guess is it will be stored in 2 places. also as per my understanding
>> > > docValues are great when you have values which repeat. I am not sure
>> how
>> > > beneficial it would be for uniqueId field.
>> > > I am looking at collection of few hundred billion documents , that is
>> > > reason I really want to care about expense from design phase.
>> > >
>> > >
>> > >
>> > >
>> > > On Sun, Apr 24, 2016 at 7:24 PM, Erick Erickson <
>> erickerick...@gmail.com
>> > >
>> > > wrote:
>> > >
>> > > > In a word, "yes".
>> > > >
>> > > > DocValues aren't particularly expensive, or expensive at all. The
>> idea
>> > > > is that when you sort by a field or facet, the field has to be
>> > > > "uninverted" which builds the entire structure in Java's JVM (this
>> is
>> > > > when the field is _not_ DocValues).
>> > > >
>> > > > DocValues essentially serialize this structure to disk. So your
>> > > > on-disk index size is larger, but that size is MMaped rather than
>> > > > stored on Java's heap.
>> > > >
>> > > > Really, the question I'd have to ask though is "why do you care
>> about
>> > > > the expense?". If you have a functional requirement that has to be
>> > > > served by returning the id via the /export handler, you really have
>> no
>> > > > choice.
>> > > >
>> > > > Best,
>> > > > Erick
>> > > >
>> > > >
>> > > > On Sun, Apr 24, 2016 at 9:55 AM, sudsport s 
>> > > wrote:
>> > > > > I was trying to use Streaming for reading basic tuple stream. I am
>> > > using
>> > > > > sort by id asc ,
>> > > > > I am getting following exception
>> > > > >
>> > > > > I am using export search handler as per
>> > > > >
>> > https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
>> > > > >
>> > > > > null:java.io.IOException: id must have DocValues to use this
>> feature.
>> > > > > at
>> > > >
>> > >
>> >
>> org.apache.solr.response.SortingResponseWriter.getFieldWriters(SortingResponseWriter.java:241)
>> > > > > at
>> > > >
>> > >
>> >
>> 

Re: The Streaming API (Solrj.io) : id must have DocValues?

2016-04-26 Thread sudsport s
Thanks @Reth yes that was my one of the concern. I will look at JIRA you
mentioned.

Thanks Joel
I used some of examples for streaming client from your blog. I got basic
tuple stream working but I get following exception while running parallel
string.


java.io.IOException: java.util.concurrent.ExecutionException:
org.noggit.JSONParser$ParseException: JSON Parse Error: char=<,position=0
BEFORE='<' AFTER='html>   wrote:

> Can you describe how you're planning on using Streaming? I can provide some
> feedback on how it will perform for your use use.
>
> When scaling out Streaming you'll get large performance boosts when you
> increase the number of shards, replicas and workers. This is particularly
> true if you're doing parallel relational algebra or map/reduce operations.
>
> As far a DocValues being expensive with unique fields, you'll want to do a
> sizing exercise to see how many documents per-shard work best for your use
> case. There are different docValues implementations that will allow you to
> trade off memory for performance.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, Apr 25, 2016 at 3:30 AM, Reth RM  wrote:
>
> > Hi,
> >
> > So, is the concern related to same field value being stored twice: with
> > stored=true and docValues=true? If that is the case, there is a jira
> > relevant to this, fixed[1]. If you upgrade to 5.5/6.0 version, it is
> > possible to read non-stored fields from docValues index., check out.
> >
> >
> > [1] https://issues.apache.org/jira/browse/SOLR-8220
> >
> > On Mon, Apr 25, 2016 at 9:44 AM, sudsport s 
> wrote:
> >
> > > Thanks Erik for reply,
> > >
> > > Since I was storing Id (its stored field) and after enabling docValues
> my
> > > guess is it will be stored in 2 places. also as per my understanding
> > > docValues are great when you have values which repeat. I am not sure
> how
> > > beneficial it would be for uniqueId field.
> > > I am looking at collection of few hundred billion documents , that is
> > > reason I really want to care about expense from design phase.
> > >
> > >
> > >
> > >
> > > On Sun, Apr 24, 2016 at 7:24 PM, Erick Erickson <
> erickerick...@gmail.com
> > >
> > > wrote:
> > >
> > > > In a word, "yes".
> > > >
> > > > DocValues aren't particularly expensive, or expensive at all. The
> idea
> > > > is that when you sort by a field or facet, the field has to be
> > > > "uninverted" which builds the entire structure in Java's JVM (this is
> > > > when the field is _not_ DocValues).
> > > >
> > > > DocValues essentially serialize this structure to disk. So your
> > > > on-disk index size is larger, but that size is MMaped rather than
> > > > stored on Java's heap.
> > > >
> > > > Really, the question I'd have to ask though is "why do you care about
> > > > the expense?". If you have a functional requirement that has to be
> > > > served by returning the id via the /export handler, you really have
> no
> > > > choice.
> > > >
> > > > Best,
> > > > Erick
> > > >
> > > >
> > > > On Sun, Apr 24, 2016 at 9:55 AM, sudsport s 
> > > wrote:
> > > > > I was trying to use Streaming for reading basic tuple stream. I am
> > > using
> > > > > sort by id asc ,
> > > > > I am getting following exception
> > > > >
> > > > > I am using export search handler as per
> > > > >
> > https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
> > > > >
> > > > > null:java.io.IOException: id must have DocValues to use this
> feature.
> > > > > at
> > > >
> > >
> >
> org.apache.solr.response.SortingResponseWriter.getFieldWriters(SortingResponseWriter.java:241)
> > > > > at
> > > >
> > >
> >
> org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:120)
> > > > > at
> > > >
> > >
> >
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:53)
> > > > > at
> > > >
> > org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:742)
> > > > > at
> > > > org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471)
> > > > > at
> > > >
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
> > > > > at
> > > >
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
> > > > > at
> > > >
> > >
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> > > > > at
> > > >
> > >
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> > > > > at
> > > >
> > >
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> > > > > at
> > > >
> > >
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
> > > > > at
> > > >
> > >
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
> > > > > at
> 

Re: The Streaming API (Solrj.io) : id must have DocValues?

2016-04-25 Thread Joel Bernstein
Can you describe how you're planning on using Streaming? I can provide some
feedback on how it will perform for your use use.

When scaling out Streaming you'll get large performance boosts when you
increase the number of shards, replicas and workers. This is particularly
true if you're doing parallel relational algebra or map/reduce operations.

As far a DocValues being expensive with unique fields, you'll want to do a
sizing exercise to see how many documents per-shard work best for your use
case. There are different docValues implementations that will allow you to
trade off memory for performance.

Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Apr 25, 2016 at 3:30 AM, Reth RM  wrote:

> Hi,
>
> So, is the concern related to same field value being stored twice: with
> stored=true and docValues=true? If that is the case, there is a jira
> relevant to this, fixed[1]. If you upgrade to 5.5/6.0 version, it is
> possible to read non-stored fields from docValues index., check out.
>
>
> [1] https://issues.apache.org/jira/browse/SOLR-8220
>
> On Mon, Apr 25, 2016 at 9:44 AM, sudsport s  wrote:
>
> > Thanks Erik for reply,
> >
> > Since I was storing Id (its stored field) and after enabling docValues my
> > guess is it will be stored in 2 places. also as per my understanding
> > docValues are great when you have values which repeat. I am not sure how
> > beneficial it would be for uniqueId field.
> > I am looking at collection of few hundred billion documents , that is
> > reason I really want to care about expense from design phase.
> >
> >
> >
> >
> > On Sun, Apr 24, 2016 at 7:24 PM, Erick Erickson  >
> > wrote:
> >
> > > In a word, "yes".
> > >
> > > DocValues aren't particularly expensive, or expensive at all. The idea
> > > is that when you sort by a field or facet, the field has to be
> > > "uninverted" which builds the entire structure in Java's JVM (this is
> > > when the field is _not_ DocValues).
> > >
> > > DocValues essentially serialize this structure to disk. So your
> > > on-disk index size is larger, but that size is MMaped rather than
> > > stored on Java's heap.
> > >
> > > Really, the question I'd have to ask though is "why do you care about
> > > the expense?". If you have a functional requirement that has to be
> > > served by returning the id via the /export handler, you really have no
> > > choice.
> > >
> > > Best,
> > > Erick
> > >
> > >
> > > On Sun, Apr 24, 2016 at 9:55 AM, sudsport s 
> > wrote:
> > > > I was trying to use Streaming for reading basic tuple stream. I am
> > using
> > > > sort by id asc ,
> > > > I am getting following exception
> > > >
> > > > I am using export search handler as per
> > > >
> https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
> > > >
> > > > null:java.io.IOException: id must have DocValues to use this feature.
> > > > at
> > >
> >
> org.apache.solr.response.SortingResponseWriter.getFieldWriters(SortingResponseWriter.java:241)
> > > > at
> > >
> >
> org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:120)
> > > > at
> > >
> >
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:53)
> > > > at
> > >
> org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:742)
> > > > at
> > > org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471)
> > > > at
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
> > > > at
> > >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
> > > > at
> > >
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> > > > at
> > >
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> > > > at
> > >
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> > > > at
> > >
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
> > > > at
> > >
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
> > > > at
> > >
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> > > > at
> > >
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> > > > at org.eclipse.jetty.server.session.SessionHandler.doScope(
> > > >
> > > >
> > > > does it make sense to enable docValues for unique field? How
> expensive
> > > is it?
> > > >
> > > >
> > > > if I have existing collection can I update schema and optimize
> > > > collection to get docvalues enabled for id?
> > > >
> > > >
> > > > --
> > > >
> > > > Thanks
> > >
> >
>


Re: The Streaming API (Solrj.io) : id must have DocValues?

2016-04-25 Thread Reth RM
Hi,

So, is the concern related to same field value being stored twice: with
stored=true and docValues=true? If that is the case, there is a jira
relevant to this, fixed[1]. If you upgrade to 5.5/6.0 version, it is
possible to read non-stored fields from docValues index., check out.


[1] https://issues.apache.org/jira/browse/SOLR-8220

On Mon, Apr 25, 2016 at 9:44 AM, sudsport s  wrote:

> Thanks Erik for reply,
>
> Since I was storing Id (its stored field) and after enabling docValues my
> guess is it will be stored in 2 places. also as per my understanding
> docValues are great when you have values which repeat. I am not sure how
> beneficial it would be for uniqueId field.
> I am looking at collection of few hundred billion documents , that is
> reason I really want to care about expense from design phase.
>
>
>
>
> On Sun, Apr 24, 2016 at 7:24 PM, Erick Erickson 
> wrote:
>
> > In a word, "yes".
> >
> > DocValues aren't particularly expensive, or expensive at all. The idea
> > is that when you sort by a field or facet, the field has to be
> > "uninverted" which builds the entire structure in Java's JVM (this is
> > when the field is _not_ DocValues).
> >
> > DocValues essentially serialize this structure to disk. So your
> > on-disk index size is larger, but that size is MMaped rather than
> > stored on Java's heap.
> >
> > Really, the question I'd have to ask though is "why do you care about
> > the expense?". If you have a functional requirement that has to be
> > served by returning the id via the /export handler, you really have no
> > choice.
> >
> > Best,
> > Erick
> >
> >
> > On Sun, Apr 24, 2016 at 9:55 AM, sudsport s 
> wrote:
> > > I was trying to use Streaming for reading basic tuple stream. I am
> using
> > > sort by id asc ,
> > > I am getting following exception
> > >
> > > I am using export search handler as per
> > > https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
> > >
> > > null:java.io.IOException: id must have DocValues to use this feature.
> > > at
> >
> org.apache.solr.response.SortingResponseWriter.getFieldWriters(SortingResponseWriter.java:241)
> > > at
> >
> org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:120)
> > > at
> >
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:53)
> > > at
> > org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:742)
> > > at
> > org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471)
> > > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
> > > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
> > > at
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> > > at
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> > > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> > > at
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
> > > at
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
> > > at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> > > at
> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> > > at org.eclipse.jetty.server.session.SessionHandler.doScope(
> > >
> > >
> > > does it make sense to enable docValues for unique field? How expensive
> > is it?
> > >
> > >
> > > if I have existing collection can I update schema and optimize
> > > collection to get docvalues enabled for id?
> > >
> > >
> > > --
> > >
> > > Thanks
> >
>


Re: The Streaming API (Solrj.io) : id must have DocValues?

2016-04-24 Thread sudsport s
Thanks Erik for reply,

Since I was storing Id (its stored field) and after enabling docValues my
guess is it will be stored in 2 places. also as per my understanding
docValues are great when you have values which repeat. I am not sure how
beneficial it would be for uniqueId field.
I am looking at collection of few hundred billion documents , that is
reason I really want to care about expense from design phase.




On Sun, Apr 24, 2016 at 7:24 PM, Erick Erickson 
wrote:

> In a word, "yes".
>
> DocValues aren't particularly expensive, or expensive at all. The idea
> is that when you sort by a field or facet, the field has to be
> "uninverted" which builds the entire structure in Java's JVM (this is
> when the field is _not_ DocValues).
>
> DocValues essentially serialize this structure to disk. So your
> on-disk index size is larger, but that size is MMaped rather than
> stored on Java's heap.
>
> Really, the question I'd have to ask though is "why do you care about
> the expense?". If you have a functional requirement that has to be
> served by returning the id via the /export handler, you really have no
> choice.
>
> Best,
> Erick
>
>
> On Sun, Apr 24, 2016 at 9:55 AM, sudsport s  wrote:
> > I was trying to use Streaming for reading basic tuple stream. I am using
> > sort by id asc ,
> > I am getting following exception
> >
> > I am using export search handler as per
> > https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
> >
> > null:java.io.IOException: id must have DocValues to use this feature.
> > at
> org.apache.solr.response.SortingResponseWriter.getFieldWriters(SortingResponseWriter.java:241)
> > at
> org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:120)
> > at
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:53)
> > at
> org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:742)
> > at
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471)
> > at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
> > at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
> > at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> > at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> > at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> > at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
> > at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
> > at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> > at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> > at org.eclipse.jetty.server.session.SessionHandler.doScope(
> >
> >
> > does it make sense to enable docValues for unique field? How expensive
> is it?
> >
> >
> > if I have existing collection can I update schema and optimize
> > collection to get docvalues enabled for id?
> >
> >
> > --
> >
> > Thanks
>


Re: The Streaming API (Solrj.io) : id must have DocValues?

2016-04-24 Thread Erick Erickson
In a word, "yes".

DocValues aren't particularly expensive, or expensive at all. The idea
is that when you sort by a field or facet, the field has to be
"uninverted" which builds the entire structure in Java's JVM (this is
when the field is _not_ DocValues).

DocValues essentially serialize this structure to disk. So your
on-disk index size is larger, but that size is MMaped rather than
stored on Java's heap.

Really, the question I'd have to ask though is "why do you care about
the expense?". If you have a functional requirement that has to be
served by returning the id via the /export handler, you really have no
choice.

Best,
Erick


On Sun, Apr 24, 2016 at 9:55 AM, sudsport s  wrote:
> I was trying to use Streaming for reading basic tuple stream. I am using
> sort by id asc ,
> I am getting following exception
>
> I am using export search handler as per
> https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
>
> null:java.io.IOException: id must have DocValues to use this feature.
> at 
> org.apache.solr.response.SortingResponseWriter.getFieldWriters(SortingResponseWriter.java:241)
> at 
> org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:120)
> at 
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:53)
> at 
> org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:742)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
> at org.eclipse.jetty.server.session.SessionHandler.doScope(
>
>
> does it make sense to enable docValues for unique field? How expensive is it?
>
>
> if I have existing collection can I update schema and optimize
> collection to get docvalues enabled for id?
>
>
> --
>
> Thanks


The Streaming API (Solrj.io) : id must have DocValues?

2016-04-24 Thread sudsport s
I was trying to use Streaming for reading basic tuple stream. I am using
sort by id asc ,
I am getting following exception

I am using export search handler as per
https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets

null:java.io.IOException: id must have DocValues to use this feature.
at 
org.apache.solr.response.SortingResponseWriter.getFieldWriters(SortingResponseWriter.java:241)
at 
org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:120)
at 
org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:53)
at 
org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:742)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at org.eclipse.jetty.server.session.SessionHandler.doScope(


does it make sense to enable docValues for unique field? How expensive is it?


if I have existing collection can I update schema and optimize
collection to get docvalues enabled for id?


--

Thanks