Joel, Is this ticket an attempt to solve that ? SOLR-7560 On Wed, May 20, 2015 at 11:08 PM, Joel Bernstein <[email protected]> wrote:
> The Streaming Expressions language is a DSL to process docs and emit > processed data. The parallel SQL engine will also fit into this category. > Both of these languages compile to the Streaming API which is basically a > real-time map-reduce framework that runs on SolrCloud worker nodes. > > The Streaming API has excellent data locality for a Map-Reduce engine > because it performs the map stage and sorting and partitioning of result > sets inside of Solr before tuples are streamed. Sorted and partitioned > tuples are then sent directly to the correct worker nodes to be reduced. > The Streaming API doesn't follow a strict map/reduce model though. Streams > are merged and manipulated by wrapping decorator streams around each other. > So the streaming API is much more flexible then old style map/reduce. > > But the Streaming API is not designed for parallel iterative algorithms > like gradient descent. For the parallel iterative case it's much faster to > leave the data in place and run embedded algorithm inside of the Solr. > > > > > > At this point data must cross the network if you have multiple worker > nodes. > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Wed, May 20, 2015 at 5:57 PM, Noble Paul <[email protected]> wrote: > >> >> >> On Wed, May 20, 2015 at 10:17 PM, Yonik Seeley <[email protected]> wrote: >> >>> On Wed, May 20, 2015 at 12:04 PM, Noble Paul <[email protected]> >>> wrote: >>> > >>> > On Wed, May 20, 2015 at 8:41 PM, Yonik Seeley <[email protected]> >>> wrote: >>> >> >>> >> On Wed, May 20, 2015 at 11:06 AM, Noble Paul <[email protected]> >>> wrote: >>> >> > The problem with streaming is data locality. Data needs to be >>> >> > transferred >>> >> > across network to do the processing >>> >> >>> >> Nothing saying that you can't process data before it's streamed out, >>> >> right? >>> > >>> > yes, if our query language is expressive enough . Sometimes you need a >>> > little programming language to achieve that >>> >>> Right - and different languages can go on top of the base streaming >>> stuff... either before or after the streaming step. >>> There's no reason we can't stream derived data - it doesn't need to be >>> just documents. >>> >> Yes, but is there away to do it now? If we can have a DSL which can do >> process docs and emit the processed data , then the streaming API may be >> able to do without data locality . >> >> I guess the streaming API run as a standalone program. can it not be >> running soemwhere in the Solr cluster itself? >> >>> >>> -Yonik >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>> >> >> >> -- >> ----------------------------------------------------- >> Noble Paul >> > > -- ----------------------------------------------------- Noble Paul
