SOLR-7560 will provides a parallel SQL engine for SolrCloud. It's designed to run interactive SQL queries across large clusters of servers. This is one of the core big data use cases.
Joel Bernstein http://joelsolr.blogspot.com/ On Wed, May 20, 2015 at 7:07 PM, Noble Paul <noble.p...@gmail.com> wrote: > Joel, Is this ticket an attempt to solve that ? SOLR-7560 > > On Wed, May 20, 2015 at 11:08 PM, Joel Bernstein <joels...@gmail.com> > wrote: > >> The Streaming Expressions language is a DSL to process docs and emit >> processed data. The parallel SQL engine will also fit into this category. >> Both of these languages compile to the Streaming API which is basically a >> real-time map-reduce framework that runs on SolrCloud worker nodes. >> >> The Streaming API has excellent data locality for a Map-Reduce engine >> because it performs the map stage and sorting and partitioning of result >> sets inside of Solr before tuples are streamed. Sorted and partitioned >> tuples are then sent directly to the correct worker nodes to be reduced. >> The Streaming API doesn't follow a strict map/reduce model though. Streams >> are merged and manipulated by wrapping decorator streams around each other. >> So the streaming API is much more flexible then old style map/reduce. >> >> But the Streaming API is not designed for parallel iterative algorithms >> like gradient descent. For the parallel iterative case it's much faster to >> leave the data in place and run embedded algorithm inside of the Solr. >> >> >> >> >> >> At this point data must cross the network if you have multiple worker >> nodes. >> >> Joel Bernstein >> http://joelsolr.blogspot.com/ >> >> On Wed, May 20, 2015 at 5:57 PM, Noble Paul <noble.p...@gmail.com> wrote: >> >>> >>> >>> On Wed, May 20, 2015 at 10:17 PM, Yonik Seeley <ysee...@gmail.com> >>> wrote: >>> >>>> On Wed, May 20, 2015 at 12:04 PM, Noble Paul <noble.p...@gmail.com> >>>> wrote: >>>> > >>>> > On Wed, May 20, 2015 at 8:41 PM, Yonik Seeley <ysee...@gmail.com> >>>> wrote: >>>> >> >>>> >> On Wed, May 20, 2015 at 11:06 AM, Noble Paul <noble.p...@gmail.com> >>>> wrote: >>>> >> > The problem with streaming is data locality. Data needs to be >>>> >> > transferred >>>> >> > across network to do the processing >>>> >> >>>> >> Nothing saying that you can't process data before it's streamed out, >>>> >> right? >>>> > >>>> > yes, if our query language is expressive enough . Sometimes you need a >>>> > little programming language to achieve that >>>> >>>> Right - and different languages can go on top of the base streaming >>>> stuff... either before or after the streaming step. >>>> There's no reason we can't stream derived data - it doesn't need to be >>>> just documents. >>>> >>> Yes, but is there away to do it now? If we can have a DSL which can do >>> process docs and emit the processed data , then the streaming API may be >>> able to do without data locality . >>> >>> I guess the streaming API run as a standalone program. can it not be >>> running soemwhere in the Solr cluster itself? >>> >>>> >>>> -Yonik >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>> >>>> >>> >>> >>> -- >>> ----------------------------------------------------- >>> Noble Paul >>> >> >> > > > -- > ----------------------------------------------------- > Noble Paul >