Re: [jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

Joel Bernstein Wed, 20 May 2015 11:29:12 -0700

SOLR-7560 will provides a parallel SQL engine for SolrCloud. It's designed
to run interactive SQL queries across large clusters of servers. This is
one of the core big data use cases.


Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, May 20, 2015 at 7:07 PM, Noble Paul <noble.p...@gmail.com> wrote:

> Joel,  Is this ticket an attempt to solve that ? SOLR-7560
>
> On Wed, May 20, 2015 at 11:08 PM, Joel Bernstein <joels...@gmail.com>
> wrote:
>
>> The Streaming Expressions language is a DSL to process docs and emit
>> processed data. The parallel SQL engine will also fit into this category.
>> Both of these languages compile to the Streaming API which is basically a
>> real-time map-reduce framework that runs on SolrCloud worker nodes.
>>
>> The Streaming API has excellent data locality for a Map-Reduce engine
>> because it performs the map stage and sorting and partitioning of result
>> sets inside of Solr before tuples are streamed.  Sorted and partitioned
>> tuples are then sent directly to the correct worker nodes to be reduced.
>> The Streaming API doesn't follow a strict map/reduce model though. Streams
>> are merged and manipulated by wrapping decorator streams around each other.
>> So the streaming API is much more flexible then old style map/reduce.
>>
>> But the Streaming API is not designed for parallel iterative algorithms
>> like gradient descent. For the parallel iterative case it's much faster to
>> leave the data in place and run embedded algorithm inside of the Solr.
>>
>>
>>
>>
>>
>> At this point data must cross the network if you have multiple worker
>> nodes.
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Wed, May 20, 2015 at 5:57 PM, Noble Paul <noble.p...@gmail.com> wrote:
>>
>>>
>>>
>>> On Wed, May 20, 2015 at 10:17 PM, Yonik Seeley <ysee...@gmail.com>
>>> wrote:
>>>
>>>> On Wed, May 20, 2015 at 12:04 PM, Noble Paul <noble.p...@gmail.com>
>>>> wrote:
>>>> >
>>>> > On Wed, May 20, 2015 at 8:41 PM, Yonik Seeley <ysee...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> On Wed, May 20, 2015 at 11:06 AM, Noble Paul <noble.p...@gmail.com>
>>>> wrote:
>>>> >> > The problem with streaming is data locality. Data needs to be
>>>> >> > transferred
>>>> >> > across network to do the processing
>>>> >>
>>>> >> Nothing saying that you can't process data before it's streamed out,
>>>> >> right?
>>>> >
>>>> > yes, if our query language is expressive enough . Sometimes you need a
>>>> > little programming language to achieve that
>>>>
>>>> Right - and different languages can go on top of the base streaming
>>>> stuff... either before or after the streaming step.
>>>> There's no reason we can't stream derived data - it doesn't need to be
>>>> just documents.
>>>>
>>> Yes, but is there away to do it now? If we can have a DSL which can do
>>> process docs and emit the processed data , then the streaming API may be
>>> able to do without data locality .
>>>
>>> I guess the streaming API run as a standalone program. can it not be
>>> running soemwhere in the Solr cluster itself?
>>>
>>>>
>>>> -Yonik
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>> --
>>> -----------------------------------------------------
>>> Noble Paul
>>>
>>
>>
>
>
> --
> -----------------------------------------------------
> Noble Paul
>

Re: [jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

Reply via email to