[jira] [Commented] (SOLR-7535) Add UpdateStream to Streaming API and Streaming Expression

Dennis Gove (JIRA) Mon, 28 Dec 2015 19:40:22 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-7535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15073435#comment-15073435
 ]


Dennis Gove commented on SOLR-7535:
-----------------------------------

I haven't looked at the patch yet but to answer your questions, 

1. The name of the collection in the URL path and collection in any part of the 
expression can absolutely be different. There are couple of cases where this 
difference will most likely appear. First, during a join or merge of multiple 
of collections only one of the collection names can be contained in the URL. 
For example
{code}
innerJoin(
  search(people, fl="personId,name", q="*:*", sort="personId asc"),
  search(address, fl="personId,city", q="state:ny", sort="personId asc"),
  on="personId"
)
{code}
Two collections are being hit but only a single one can be included in the URL. 
There aren't any hard and fast rules about which one should be used in the URL 
and that decision could depend on a lot of different things, especially if the 
collections live in different clouds or on different hardware. 

There is also the possibility that the http request is being sent to what is 
effectively an empty collection which only exists to perform parallel work 
using the streaming api. For example, imagine you want to do some heavy metric 
processing but you don't want to use more resources than necessary on the 
servers where the collections live. You could setup an empty collection on 
totally different hardware with the intent of that hardware to act solely as 
workers on the real collection. This would allow you to do the heavy lifting on 
separate hardware from where the collection actually lives. 

For these reasons the collection name is a required parameter in the base 
streams (SolrCloudStream and FacetStream).

2. There are three types of parameters; positional, unnamed, and named. 
*Positional parameters* are those which must exist in some specific location in 
the expression. IIRC, the only positional parameters are the collection names 
in the base streams. This is done because the collection name is critical and 
as such we can say it is the first parameter, regardless of anything else 
included. 

*Unnamed parameters* are those whose meaning can be determined by the content 
of the parameter. For example, 
{code}
rollup(
  search(people, fl="personId,name,age", q="*:*", sort="personId asc"),
  max(age),
  min(age),
  avg(age)
)
{code}
in this example we know that search(...) is a stream and max(...), min(...), 
and avg(...) are metrics. Unnamed parameters are also very useful in situations 
where the number of parameters of that type are non-determistic. In the example 
above one could provide any number of metrics and by keeping them unnamed the 
user can just keep adding new metrics without worrying about names. Another 
example of this is with the MergeStream where one can merge 2 or more streams 
together.

*Named parameters* are used when you want to be very clear about what a 
particular parameter is being used for. For example, the "on" parameter in a 
join clause is to indicate that the join should be done on some field (or 
fields). The HashJoinStream is an interesting one because we have a named 
parameter "hashed" whose parameter needs to be a stream. In this case the 
decision to use a named parameter was made so as to be very clear to the user 
which stream is being hashed and which one is not. Generally it comes down to 
whether a parameter name would make things clearer for the user.

> Add UpdateStream to Streaming API and Streaming Expression
> ----------------------------------------------------------
>
>                 Key: SOLR-7535
>                 URL: https://issues.apache.org/jira/browse/SOLR-7535
>             Project: Solr
>          Issue Type: New Feature
>          Components: clients - java, SolrJ
>            Reporter: Joel Bernstein
>            Priority: Minor
>         Attachments: SOLR-7535.patch
>
>
> The ticket adds an UpdateStream implementation to the Streaming API and 
> streaming expressions. The UpdateStream will wrap a TupleStream and send the 
> Tuples it reads to a SolrCloud collection to be indexed.
> This will allow users to pull data from different Solr Cloud collections, 
> merge and transform the streams and send the transformed data to another Solr 
> Cloud collection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7535) Add UpdateStream to Streaming API and Streaming Expression

Reply via email to