Re: The Streaming API (Solrj.io) : id must have DocValues?

Joel Bernstein Tue, 26 Apr 2016 11:51:15 -0700

My blog is pretty out of date at this point unfortunately. I need to get
some better examples published.


Also there is huge amount of work that went into Solr 6 Streaming API and
Streaming Expressions that make them much easier to work with. In Solr 6.1
you'll be able to test Streaming Expressions from the Solr admin console
which should be very helpful.

Since you're planning on using the joins, performance will be very much
driven by the number of shards and replicas pushing the streams and the
number of workers performing the join.

If you're still having problems with Solr 6.0, feel free to post the
Expression you're using and I or other people can help debug the issue.


Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Apr 26, 2016 at 2:29 PM, sudsport s <sudssf2...@gmail.com> wrote:

> I see that some work was done to remove stream handler form config. so
> enabling stream handler is still security issue?
>
> https://issues.apache.org/jira/browse/SOLR-8262
>
> On Tue, Apr 26, 2016 at 11:14 AM, sudsport s <sudssf2...@gmail.com> wrote:
>
>> I am using solr 5.3.1 server & solr5.5 on client ( solrj) . I will try
>> with solrj 6.0
>>
>> On Tue, Apr 26, 2016 at 11:12 AM, Susmit Shukla <shukla.sus...@gmail.com>
>> wrote:
>>
>>> Which solrj version are you using? could you try with solrj 6.0
>>>
>>> On Tue, Apr 26, 2016 at 10:36 AM, sudsport s <sudssf2...@gmail.com>
>>> wrote:
>>>
>>> > @Joel
>>> > >Can you describe how you're planning on using Streaming?
>>> >
>>> > I am mostly using it for distirbuted join case. We were planning to use
>>> > similar logic (hash id and join) in Spark for our usecase. but since
>>> data
>>> > is stored in solr , I will be using solr stream to perform same
>>> operation.
>>> >
>>> > I have similar user cases to build probabilistic data-structures while
>>> > streaming results. I might have to spend some time in exploring query
>>> > optimization (while doing join decide sort order etc)
>>> >
>>> > Please let me know if you have any feedback.
>>> >
>>> > On Tue, Apr 26, 2016 at 10:30 AM, sudsport s <sudssf2...@gmail.com>
>>> wrote:
>>> >
>>> > > Thanks @Reth yes that was my one of the concern. I will look at JIRA
>>> you
>>> > > mentioned.
>>> > >
>>> > > Thanks Joel
>>> > > I used some of examples for streaming client from your blog. I got
>>> basic
>>> > > tuple stream working but I get following exception while running
>>> parallel
>>> > > string.
>>> > >
>>> > >
>>> > > java.io.IOException: java.util.concurrent.ExecutionException:
>>> > > org.noggit.JSONParser$ParseException: JSON Parse Error:
>>> char=<,position=0
>>> > > BEFORE='<' AFTER='html> <head> <meta http-equiv="Content-'
>>> > > at
>>> > >
>>> >
>>> org.apache.solr.client.solrj.io.stream.CloudSolrStream.openStreams(CloudSolrStream.java:332)
>>> > > at
>>> > >
>>> >
>>> org.apache.solr.client.solrj.io.stream.CloudSolrStream.open(CloudSolrStream.java:231)
>>> > >
>>> > >
>>> > >
>>> > > I tried to look into solr logs but after turning on debug mode I
>>> found
>>> > > following
>>> > > POST /solr/collection_shard20_replica1/stream HTTP/1.1
>>> > > "HTTP/1.1 404 Not Found[\r][\n]"
>>> > >
>>> > >
>>> > > looks like Parallel stream is trying to access /stream on shard. can
>>> > > someone tell me how to enable stream handler? I have export handler
>>> > > enabled. I will look at latest solrconfig to see if I can turn that
>>> on.
>>> > >
>>> > >
>>> > >
>>> > > @Joel I am running sizing exercises already , I will run new one with
>>> > > solr5.5+ and docValues on id enabled.
>>> > >
>>> > > BTW Solr streaming has amazing response times thanks for making it so
>>> > > FAST!!!
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Mon, Apr 25, 2016 at 10:54 AM, Joel Bernstein <joels...@gmail.com
>>> >
>>> > > wrote:
>>> > >
>>> > >> Can you describe how you're planning on using Streaming? I can
>>> provide
>>> > >> some
>>> > >> feedback on how it will perform for your use use.
>>> > >>
>>> > >> When scaling out Streaming you'll get large performance boosts when
>>> you
>>> > >> increase the number of shards, replicas and workers. This is
>>> > particularly
>>> > >> true if you're doing parallel relational algebra or map/reduce
>>> > operations.
>>> > >>
>>> > >> As far a DocValues being expensive with unique fields, you'll want
>>> to
>>> > do a
>>> > >> sizing exercise to see how many documents per-shard work best for
>>> your
>>> > use
>>> > >> case. There are different docValues implementations that will allow
>>> you
>>> > to
>>> > >> trade off memory for performance.
>>> > >>
>>> > >> Joel Bernstein
>>> > >> http://joelsolr.blogspot.com/
>>> > >>
>>> > >> On Mon, Apr 25, 2016 at 3:30 AM, Reth RM <reth.ik...@gmail.com>
>>> wrote:
>>> > >>
>>> > >> > Hi,
>>> > >> >
>>> > >> > So, is the concern related to same field value being stored twice:
>>> > with
>>> > >> > stored=true and docValues=true? If that is the case, there is a
>>> jira
>>> > >> > relevant to this, fixed[1]. If you upgrade to 5.5/6.0 version, it
>>> is
>>> > >> > possible to read non-stored fields from docValues index., check
>>> out.
>>> > >> >
>>> > >> >
>>> > >> > [1] https://issues.apache.org/jira/browse/SOLR-8220
>>> > >> >
>>> > >> > On Mon, Apr 25, 2016 at 9:44 AM, sudsport s <sudssf2...@gmail.com
>>> >
>>> > >> wrote:
>>> > >> >
>>> > >> > > Thanks Erik for reply,
>>> > >> > >
>>> > >> > > Since I was storing Id (its stored field) and after enabling
>>> > >> docValues my
>>> > >> > > guess is it will be stored in 2 places. also as per my
>>> understanding
>>> > >> > > docValues are great when you have values which repeat. I am not
>>> sure
>>> > >> how
>>> > >> > > beneficial it would be for uniqueId field.
>>> > >> > > I am looking at collection of few hundred billion documents ,
>>> that
>>> > is
>>> > >> > > reason I really want to care about expense from design phase.
>>> > >> > >
>>> > >> > >
>>> > >> > >
>>> > >> > >
>>> > >> > > On Sun, Apr 24, 2016 at 7:24 PM, Erick Erickson <
>>> > >> erickerick...@gmail.com
>>> > >> > >
>>> > >> > > wrote:
>>> > >> > >
>>> > >> > > > In a word, "yes".
>>> > >> > > >
>>> > >> > > > DocValues aren't particularly expensive, or expensive at all.
>>> The
>>> > >> idea
>>> > >> > > > is that when you sort by a field or facet, the field has to be
>>> > >> > > > "uninverted" which builds the entire structure in Java's JVM
>>> (this
>>> > >> is
>>> > >> > > > when the field is _not_ DocValues).
>>> > >> > > >
>>> > >> > > > DocValues essentially serialize this structure to disk. So
>>> your
>>> > >> > > > on-disk index size is larger, but that size is MMaped rather
>>> than
>>> > >> > > > stored on Java's heap.
>>> > >> > > >
>>> > >> > > > Really, the question I'd have to ask though is "why do you
>>> care
>>> > >> about
>>> > >> > > > the expense?". If you have a functional requirement that has
>>> to be
>>> > >> > > > served by returning the id via the /export handler, you really
>>> > have
>>> > >> no
>>> > >> > > > choice.
>>> > >> > > >
>>> > >> > > > Best,
>>> > >> > > > Erick
>>> > >> > > >
>>> > >> > > >
>>> > >> > > > On Sun, Apr 24, 2016 at 9:55 AM, sudsport s <
>>> sudssf2...@gmail.com
>>> > >
>>> > >> > > wrote:
>>> > >> > > > > I was trying to use Streaming for reading basic tuple
>>> stream. I
>>> > am
>>> > >> > > using
>>> > >> > > > > sort by id asc ,
>>> > >> > > > > I am getting following exception
>>> > >> > > > >
>>> > >> > > > > I am using export search handler as per
>>> > >> > > > >
>>> > >> >
>>> > https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
>>> > >> > > > >
>>> > >> > > > > null:java.io.IOException: id must have DocValues to use this
>>> > >> feature.
>>> > >> > > > >         at
>>> > >> > > >
>>> > >> > >
>>> > >> >
>>> > >>
>>> >
>>> org.apache.solr.response.SortingResponseWriter.getFieldWriters(SortingResponseWriter.java:241)
>>> > >> > > > >         at
>>> > >> > > >
>>> > >> > >
>>> > >> >
>>> > >>
>>> >
>>> org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:120)
>>> > >> > > > >         at
>>> > >> > > >
>>> > >> > >
>>> > >> >
>>> > >>
>>> >
>>> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:53)
>>> > >> > > > >         at
>>> > >> > > >
>>> > >> >
>>> > >>
>>> >
>>> org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:742)
>>> > >> > > > >         at
>>> > >> > > >
>>> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:471)
>>> > >> > > > >         at
>>> > >> > > >
>>> > >> > >
>>> > >> >
>>> > >>
>>> >
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
>>> > >> > > > >         at
>>> > >> > > >
>>> > >> > >
>>> > >> >
>>> > >>
>>> >
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
>>> > >> > > > >         at
>>> > >> > > >
>>> > >> > >
>>> > >> >
>>> > >>
>>> >
>>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
>>> > >> > > > >         at
>>> > >> > > >
>>> > >> > >
>>> > >> >
>>> > >>
>>> >
>>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>>> > >> > > > >         at
>>> > >> > > >
>>> > >> > >
>>> > >> >
>>> > >>
>>> >
>>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>>> > >> > > > >         at
>>> > >> > > >
>>> > >> > >
>>> > >> >
>>> > >>
>>> >
>>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
>>> > >> > > > >         at
>>> > >> > > >
>>> > >> > >
>>> > >> >
>>> > >>
>>> >
>>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
>>> > >> > > > >         at
>>> > >> > > >
>>> > >> > >
>>> > >> >
>>> > >>
>>> >
>>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>>> > >> > > > >         at
>>> > >> > > >
>>> > >> >
>>> > >>
>>> >
>>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>>> > >> > > > >         at
>>> > >> org.eclipse.jetty.server.session.SessionHandler.doScope(
>>> > >> > > > >
>>> > >> > > > >
>>> > >> > > > > does it make sense to enable docValues for unique field? How
>>> > >> > expensive
>>> > >> > > > is it?
>>> > >> > > > >
>>> > >> > > > >
>>> > >> > > > > if I have existing collection can I update schema and
>>> optimize
>>> > >> > > > > collection to get docvalues enabled for id?
>>> > >> > > > >
>>> > >> > > > >
>>> > >> > > > > --
>>> > >> > > > >
>>> > >> > > > > Thanks
>>> > >> > > >
>>> > >> > >
>>> > >> >
>>> > >>
>>> > >
>>> > >
>>> >
>>>
>>
>>
>

Re: The Streaming API (Solrj.io) : id must have DocValues?

Reply via email to