[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2018-02-07 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356095#comment-16356095
 ] 

Noble Paul commented on SOLR-5069:
--

Streaming API is there way to go

> MapReduce for SolrCloud
> ---
>
> Key: SOLR-5069
> URL: https://issues.apache.org/jira/browse/SOLR-5069
> Project: Solr
>  Issue Type: New Feature
>  Components: SolrCloud
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>
> Solr currently does not have a way to run long running computational tasks 
> across the cluster. We can piggyback on the mapreduce paradigm so that users 
> have smooth learning curve.
>  * The mapreduce component will be written as a RequestHandler in Solr
>  * Works only in SolrCloud mode. (No support for standalone mode) 
>  * Users can write MapReduce programs in Javascript or Java. First cut would 
> be JS ( ? )
> h1. sample word count program
> h2.how to invoke?
> http://host:port/solr/collection-x/mapreduce?map===collectionX
> h3. params 
> * map :  A javascript implementation of the map program
> * reduce : a Javascript implementation of the reduce program
> * sink : The collection to which the output is written. If this is not passed 
> , the request will wait till completion and respond with the output of the 
> reduce program and will be emitted as a standard solr response. . If no sink 
> is passed the request will be redirected to the "reduce node" where it will 
> wait till the process is complete. If the sink param is passed ,the rsponse 
> will contain an id of the run which can be used to query the status in 
> another command.
> * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
> node is chosen
> The node which received the command would first identify one replica from 
> each slice where the map program is executed . It will also identify one 
> another node from the same collection where the reduce program is run. Each 
> run is given an id and the details of the nodes participating in the run will 
> be written to ZK (as an ephemeral node). 
> h4. map script 
> {code:JavaScript}
> var res = $.streamQuery($.param(“q"));//this is not run across the cluster. 
> //Only on this index
> while(res.hasMore()){
>   var doc = res.next();
>   map(doc);
> }
> function  map(doc) {
>   var txt = doc.get(“txt”);//the field on which word count is performed
>   var words = txt.split(" ");
>for(i = 0; i < words.length; i++){
>   $.emit(words[i],{‘count’:1});// this will send the map over to //the 
> reduce host
> }
> }
> {code}
> Essentially two threads are created in the 'map' hosts . One for running the 
> program and the other for co-ordinating with the 'reduce' host . The maps 
> emitted are streamed live over an http connection to the reduce program
> h4. reduce script
> This script is run in one node . This node accepts http connections from map 
> nodes and the 'maps' that are sent are collected in a queue which will be 
> polled and fed into the reduce program. This also keeps the 'reduced' data in 
> memory till the whole run is complete. It expects a "done" message from all 
> 'map' nodes before it declares the tasks are complete. After  reduce program 
> is executed for all the input it proceeds to write out the result to the 
> 'sink' collection or it is written straight out to the response.
> {code:JavaScript}
> var pair = $.nextMap();
> var reduced = $.getCtx().getReducedMap();// a hashmap
> var count = reduced.get(pair.key());
> if(count === null) {
>   count = {“count”:0};
>   reduced.put(pair.key(), count);
> }
> count.count += pair.val().count ;
> {code}
> h4.example output
> {code:JavaScript}
> {
> “result”:[
> “wordx”:{ 
>  “count”:15876765
>  },
> “wordy” : {
>“count”:24657654
>   }
>  
>   ]
> }
> {code}
> TBD
> * The format in which the output is written to the target collection, I 
> assume the reducedMap will have values mapping to the schema of the collection
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2015-05-20 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14552175#comment-14552175
 ] 

Noble Paul commented on SOLR-5069:
--

Is there some low hanging fruit that we can achieve easily?

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery($.param(“q));//this is not run across the cluster. 
 //Only on this index
 while(res.hasMore()){
   var doc = res.next();
   map(doc);
 }
 function  map(doc) {
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.emit(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2015-05-20 Thread Joel Bernstein
If you're going to do be shuffling data to multiple worker nodes then data
will be crossing the network. Shuffling provides the foundation for certain
parallel computing tasks, such as performing large scale parallel
relational algebra.

For machine learning algorithms we'll likely need a parallel iterative
design which leaves the data in place.

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, May 20, 2015 at 4:11 PM, Yonik Seeley ysee...@gmail.com wrote:

 On Wed, May 20, 2015 at 11:06 AM, Noble Paul noble.p...@gmail.com wrote:
  The problem with streaming is data locality. Data needs to be transferred
  across network to do the processing

 Nothing saying that you can't process data before it's streamed out, right?

 -Yonik

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: [jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2015-05-20 Thread Noble Paul
On Wed, May 20, 2015 at 8:41 PM, Yonik Seeley ysee...@gmail.com wrote:

 On Wed, May 20, 2015 at 11:06 AM, Noble Paul noble.p...@gmail.com wrote:
  The problem with streaming is data locality. Data needs to be transferred
  across network to do the processing

 Nothing saying that you can't process data before it's streamed out, right?

yes, if our query language is expressive enough . Sometimes you need a
little programming language to achieve that


 -Yonik

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
-
Noble Paul


Re: [jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2015-05-20 Thread Yonik Seeley
On Wed, May 20, 2015 at 12:04 PM, Noble Paul noble.p...@gmail.com wrote:

 On Wed, May 20, 2015 at 8:41 PM, Yonik Seeley ysee...@gmail.com wrote:

 On Wed, May 20, 2015 at 11:06 AM, Noble Paul noble.p...@gmail.com wrote:
  The problem with streaming is data locality. Data needs to be
  transferred
  across network to do the processing

 Nothing saying that you can't process data before it's streamed out,
 right?

 yes, if our query language is expressive enough . Sometimes you need a
 little programming language to achieve that

Right - and different languages can go on top of the base streaming
stuff... either before or after the streaming step.
There's no reason we can't stream derived data - it doesn't need to be
just documents.

-Yonik

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2015-05-20 Thread Noble Paul
On Wed, May 20, 2015 at 10:17 PM, Yonik Seeley ysee...@gmail.com wrote:

 On Wed, May 20, 2015 at 12:04 PM, Noble Paul noble.p...@gmail.com wrote:
 
  On Wed, May 20, 2015 at 8:41 PM, Yonik Seeley ysee...@gmail.com wrote:
 
  On Wed, May 20, 2015 at 11:06 AM, Noble Paul noble.p...@gmail.com
 wrote:
   The problem with streaming is data locality. Data needs to be
   transferred
   across network to do the processing
 
  Nothing saying that you can't process data before it's streamed out,
  right?
 
  yes, if our query language is expressive enough . Sometimes you need a
  little programming language to achieve that

 Right - and different languages can go on top of the base streaming
 stuff... either before or after the streaming step.
 There's no reason we can't stream derived data - it doesn't need to be
 just documents.

Yes, but is there away to do it now? If we can have a DSL which can do
process docs and emit the processed data , then the streaming API may be
able to do without data locality .

I guess the streaming API run as a standalone program. can it not be
running soemwhere in the Solr cluster itself?


 -Yonik

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
-
Noble Paul


Re: [jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2015-05-20 Thread Joel Bernstein
The Streaming Expressions language is a DSL to process docs and emit
processed data. The parallel SQL engine will also fit into this category.
Both of these languages compile to the Streaming API which is basically a
real-time map-reduce framework that runs on SolrCloud worker nodes.

The Streaming API has excellent data locality for a Map-Reduce engine
because it performs the map stage and sorting and partitioning of result
sets inside of Solr before tuples are streamed.  Sorted and partitioned
tuples are then sent directly to the correct worker nodes to be reduced.
The Streaming API doesn't follow a strict map/reduce model though. Streams
are merged and manipulated by wrapping decorator streams around each other.
So the streaming API is much more flexible then old style map/reduce.

But the Streaming API is not designed for parallel iterative algorithms
like gradient descent. For the parallel iterative case it's much faster to
leave the data in place and run embedded algorithm inside of the Solr.





At this point data must cross the network if you have multiple worker nodes.

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, May 20, 2015 at 5:57 PM, Noble Paul noble.p...@gmail.com wrote:



 On Wed, May 20, 2015 at 10:17 PM, Yonik Seeley ysee...@gmail.com wrote:

 On Wed, May 20, 2015 at 12:04 PM, Noble Paul noble.p...@gmail.com
 wrote:
 
  On Wed, May 20, 2015 at 8:41 PM, Yonik Seeley ysee...@gmail.com
 wrote:
 
  On Wed, May 20, 2015 at 11:06 AM, Noble Paul noble.p...@gmail.com
 wrote:
   The problem with streaming is data locality. Data needs to be
   transferred
   across network to do the processing
 
  Nothing saying that you can't process data before it's streamed out,
  right?
 
  yes, if our query language is expressive enough . Sometimes you need a
  little programming language to achieve that

 Right - and different languages can go on top of the base streaming
 stuff... either before or after the streaming step.
 There's no reason we can't stream derived data - it doesn't need to be
 just documents.

 Yes, but is there away to do it now? If we can have a DSL which can do
 process docs and emit the processed data , then the streaming API may be
 able to do without data locality .

 I guess the streaming API run as a standalone program. can it not be
 running soemwhere in the Solr cluster itself?


 -Yonik

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




 --
 -
 Noble Paul



Re: [jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2015-05-20 Thread Noble Paul
Joel,  Is this ticket an attempt to solve that ? SOLR-7560

On Wed, May 20, 2015 at 11:08 PM, Joel Bernstein joels...@gmail.com wrote:

 The Streaming Expressions language is a DSL to process docs and emit
 processed data. The parallel SQL engine will also fit into this category.
 Both of these languages compile to the Streaming API which is basically a
 real-time map-reduce framework that runs on SolrCloud worker nodes.

 The Streaming API has excellent data locality for a Map-Reduce engine
 because it performs the map stage and sorting and partitioning of result
 sets inside of Solr before tuples are streamed.  Sorted and partitioned
 tuples are then sent directly to the correct worker nodes to be reduced.
 The Streaming API doesn't follow a strict map/reduce model though. Streams
 are merged and manipulated by wrapping decorator streams around each other.
 So the streaming API is much more flexible then old style map/reduce.

 But the Streaming API is not designed for parallel iterative algorithms
 like gradient descent. For the parallel iterative case it's much faster to
 leave the data in place and run embedded algorithm inside of the Solr.





 At this point data must cross the network if you have multiple worker
 nodes.

 Joel Bernstein
 http://joelsolr.blogspot.com/

 On Wed, May 20, 2015 at 5:57 PM, Noble Paul noble.p...@gmail.com wrote:



 On Wed, May 20, 2015 at 10:17 PM, Yonik Seeley ysee...@gmail.com wrote:

 On Wed, May 20, 2015 at 12:04 PM, Noble Paul noble.p...@gmail.com
 wrote:
 
  On Wed, May 20, 2015 at 8:41 PM, Yonik Seeley ysee...@gmail.com
 wrote:
 
  On Wed, May 20, 2015 at 11:06 AM, Noble Paul noble.p...@gmail.com
 wrote:
   The problem with streaming is data locality. Data needs to be
   transferred
   across network to do the processing
 
  Nothing saying that you can't process data before it's streamed out,
  right?
 
  yes, if our query language is expressive enough . Sometimes you need a
  little programming language to achieve that

 Right - and different languages can go on top of the base streaming
 stuff... either before or after the streaming step.
 There's no reason we can't stream derived data - it doesn't need to be
 just documents.

 Yes, but is there away to do it now? If we can have a DSL which can do
 process docs and emit the processed data , then the streaming API may be
 able to do without data locality .

 I guess the streaming API run as a standalone program. can it not be
 running soemwhere in the Solr cluster itself?


 -Yonik

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




 --
 -
 Noble Paul





-- 
-
Noble Paul


Re: [jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2015-05-20 Thread Joel Bernstein
SOLR-7560 will provides a parallel SQL engine for SolrCloud. It's designed
to run interactive SQL queries across large clusters of servers. This is
one of the core big data use cases.

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, May 20, 2015 at 7:07 PM, Noble Paul noble.p...@gmail.com wrote:

 Joel,  Is this ticket an attempt to solve that ? SOLR-7560

 On Wed, May 20, 2015 at 11:08 PM, Joel Bernstein joels...@gmail.com
 wrote:

 The Streaming Expressions language is a DSL to process docs and emit
 processed data. The parallel SQL engine will also fit into this category.
 Both of these languages compile to the Streaming API which is basically a
 real-time map-reduce framework that runs on SolrCloud worker nodes.

 The Streaming API has excellent data locality for a Map-Reduce engine
 because it performs the map stage and sorting and partitioning of result
 sets inside of Solr before tuples are streamed.  Sorted and partitioned
 tuples are then sent directly to the correct worker nodes to be reduced.
 The Streaming API doesn't follow a strict map/reduce model though. Streams
 are merged and manipulated by wrapping decorator streams around each other.
 So the streaming API is much more flexible then old style map/reduce.

 But the Streaming API is not designed for parallel iterative algorithms
 like gradient descent. For the parallel iterative case it's much faster to
 leave the data in place and run embedded algorithm inside of the Solr.





 At this point data must cross the network if you have multiple worker
 nodes.

 Joel Bernstein
 http://joelsolr.blogspot.com/

 On Wed, May 20, 2015 at 5:57 PM, Noble Paul noble.p...@gmail.com wrote:



 On Wed, May 20, 2015 at 10:17 PM, Yonik Seeley ysee...@gmail.com
 wrote:

 On Wed, May 20, 2015 at 12:04 PM, Noble Paul noble.p...@gmail.com
 wrote:
 
  On Wed, May 20, 2015 at 8:41 PM, Yonik Seeley ysee...@gmail.com
 wrote:
 
  On Wed, May 20, 2015 at 11:06 AM, Noble Paul noble.p...@gmail.com
 wrote:
   The problem with streaming is data locality. Data needs to be
   transferred
   across network to do the processing
 
  Nothing saying that you can't process data before it's streamed out,
  right?
 
  yes, if our query language is expressive enough . Sometimes you need a
  little programming language to achieve that

 Right - and different languages can go on top of the base streaming
 stuff... either before or after the streaming step.
 There's no reason we can't stream derived data - it doesn't need to be
 just documents.

 Yes, but is there away to do it now? If we can have a DSL which can do
 process docs and emit the processed data , then the streaming API may be
 able to do without data locality .

 I guess the streaming API run as a standalone program. can it not be
 running soemwhere in the Solr cluster itself?


 -Yonik

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




 --
 -
 Noble Paul





 --
 -
 Noble Paul



Re: [jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2015-05-20 Thread Noble Paul
The problem with streaming is data locality. Data needs to be transferred
across network to do the processing
On May 20, 2015 8:15 PM, Yonik Seeley (JIRA) j...@apache.org wrote:


 [
 https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14552414#comment-14552414
 ]

 Yonik Seeley commented on SOLR-5069:
 

 Looks like SOLR-6526 (Solr Streaming API) is pretty much map-reduce?
 And then on top is SOLR-7377 (Solr Streaming Expressions) and SOLR-7560
 (Parallel SQL)

  MapReduce for SolrCloud
  ---
 
  Key: SOLR-5069
  URL: https://issues.apache.org/jira/browse/SOLR-5069
  Project: Solr
   Issue Type: New Feature
   Components: SolrCloud
 Reporter: Noble Paul
 Assignee: Noble Paul
 
  Solr currently does not have a way to run long running computational
 tasks across the cluster. We can piggyback on the mapreduce paradigm so
 that users have smooth learning curve.
   * The mapreduce component will be written as a RequestHandler in Solr
   * Works only in SolrCloud mode. (No support for standalone mode)
   * Users can write MapReduce programs in Javascript or Java. First cut
 would be JS ( ? )
  h1. sample word count program
  h2.how to invoke?
  http://host:port
 /solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
  h3. params
  * map :  A javascript implementation of the map program
  * reduce : a Javascript implementation of the reduce program
  * sink : The collection to which the output is written. If this is not
 passed , the request will wait till completion and respond with the output
 of the reduce program and will be emitted as a standard solr response. . If
 no sink is passed the request will be redirected to the reduce node where
 it will wait till the process is complete. If the sink param is passed ,the
 rsponse will contain an id of the run which can be used to query the status
 in another command.
  * reduceNode : Node name where the reduce is run . If not passed an
 arbitrary node is chosen
  The node which received the command would first identify one replica
 from each slice where the map program is executed . It will also identify
 one another node from the same collection where the reduce program is run.
 Each run is given an id and the details of the nodes participating in the
 run will be written to ZK (as an ephemeral node).
  h4. map script
  {code:JavaScript}
  var res = $.streamQuery($.param(“q));//this is not run across the
 cluster. //Only on this index
  while(res.hasMore()){
var doc = res.next();
map(doc);
  }
  function  map(doc) {
var txt = doc.get(“txt”);//the field on which word count is performed
var words = txt.split( );
 for(i = 0; i  words.length; i++){
$.emit(words[i],{‘count’:1});// this will send the map over to
 //the reduce host
  }
  }
  {code}
  Essentially two threads are created in the 'map' hosts . One for running
 the program and the other for co-ordinating with the 'reduce' host . The
 maps emitted are streamed live over an http connection to the reduce program
  h4. reduce script
  This script is run in one node . This node accepts http connections from
 map nodes and the 'maps' that are sent are collected in a queue which will
 be polled and fed into the reduce program. This also keeps the 'reduced'
 data in memory till the whole run is complete. It expects a done message
 from all 'map' nodes before it declares the tasks are complete. After
 reduce program is executed for all the input it proceeds to write out the
 result to the 'sink' collection or it is written straight out to the
 response.
  {code:JavaScript}
  var pair = $.nextMap();
  var reduced = $.getCtx().getReducedMap();// a hashmap
  var count = reduced.get(pair.key());
  if(count === null) {
count = {“count”:0};
reduced.put(pair.key(), count);
  }
  count.count += pair.val().count ;
  {code}
  h4.example output
  {code:JavaScript}
  {
  “result”:[
  “wordx”:{
   “count”:15876765
   },
  “wordy” : {
 “count”:24657654
}
 
]
  }
  {code}
  TBD
  * The format in which the output is written to the target collection, I
 assume the reducedMap will have values mapping to the schema of the
 collection
 



 --
 This message was sent by Atlassian JIRA
 (v6.3.4#6332)

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: [jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2015-05-20 Thread Yonik Seeley
On Wed, May 20, 2015 at 11:06 AM, Noble Paul noble.p...@gmail.com wrote:
 The problem with streaming is data locality. Data needs to be transferred
 across network to do the processing

Nothing saying that you can't process data before it's streamed out, right?

-Yonik

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2015-05-20 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14552414#comment-14552414
 ] 

Yonik Seeley commented on SOLR-5069:


Looks like SOLR-6526 (Solr Streaming API) is pretty much map-reduce?
And then on top is SOLR-7377 (Solr Streaming Expressions) and SOLR-7560 
(Parallel SQL)

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery($.param(“q));//this is not run across the cluster. 
 //Only on this index
 while(res.hasMore()){
   var doc = res.next();
   map(doc);
 }
 function  map(doc) {
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.emit(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2015-05-18 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14549383#comment-14549383
 ] 

Markus Jelsma commented on SOLR-5069:
-

[~ab] anything new to add to this topic? I am sure interest is still here :)

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery($.param(“q));//this is not run across the cluster. 
 //Only on this index
 while(res.hasMore()){
   var doc = res.next();
   map(doc);
 }
 function  map(doc) {
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.emit(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2014-02-25 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911597#comment-13911597
 ] 

ASF subversion and git services commented on SOLR-5069:
---

Commit 1571702 from [~noble.paul] in branch 'dev/trunk'
[ https://svn.apache.org/r1571702 ]

SOLR-5069 fixing test failure

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2014-02-25 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911600#comment-13911600
 ] 

ASF subversion and git services commented on SOLR-5069:
---

Commit 1571703 from [~noble.paul] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1571703 ]

SOLR-5069 fixing test failure

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-26 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13720869#comment-13720869
 ] 

Andrzej Bialecki  commented on SOLR-5069:
-

See here for an explanation how this works in MongoDB: 
http://isurues.wordpress.com/2013/05/28/what-is-re-reduce-in-mongodb-map-reduce/
 . CouchDB also uses the same reduce function, only it passes a boolean flag to 
inform the function whether a particular invocation is the first reduce (acting 
on values straight from mappers) or a re-reduce (acting on results of previous 
partial reduces).

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-25 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719540#comment-13719540
 ] 

Otis Gospodnetic commented on SOLR-5069:


This is great to see - I asked about this in SOLR-1301 - 
https://issues.apache.org/jira/browse/SOLR-1301?focusedCommentId=13678948page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13678948
 :)

{quote}
The node which received the command would first identify one replica from each 
slice where the map program is executed . It will also identify one another 
node from the same collection where the reduce program is run. Each run is 
given an id and the details of the nodes participating in the run will be 
written to ZK (as an ephemeral node).
{quote}

Lukas and Andrzej have already addressed my immediate thought when I read the 
above, but they talked about using the cost approach, limiting resource use, 
and such.  But I think we should learn from others' mistakes and choices here.  
Is it good enough to limit resources?  Just limiting resources means that any 
concurrent queries *will* be effected - the question is just how much.  Would 
it be better to mark some nodes as eligible for running analytical/batch/MR 
jobs + search or eligible for running analytical/batch/MR jobs and NO search 
- i.e. nodes that are a part of the SolrCloud cluster, but run ONLY these jobs 
and do NOT handle queries?

I think we saw DataStax do this with Cassandra and Brisk and we see that with 
people using HBase with HBase replication and using one HBase cluster for 
real-time/interactive access and the other cluster for running jobs.


 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = 

[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-25 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719550#comment-13719550
 ] 

Noble Paul commented on SOLR-5069:
--

bq.Would it be better to mark some nodes as eligible for running 
analytical/batch/MR jobs + search 

Instead of marking certain nodes as (eligible for X)how about passing the node 
names in the request itself ? That way we are not introducing some kind of 
'role' in the system but still get all the benefits?

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-25 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719554#comment-13719554
 ] 

Otis Gospodnetic commented on SOLR-5069:


bq. Instead of marking certain nodes as (eligible for X)how about passing the 
node names in the request itself ? That way we are not introducing some kind of 
'role' in the system but still get all the benefits?

But if searches are running on *all* nodes, then the above doesn't achieve 
complete separation of search vs. job work.

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-25 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719556#comment-13719556
 ] 

Noble Paul commented on SOLR-5069:
--

bq.But if searches are running on all nodes, then the above doesn't achieve 
complete separation of search vs. job work.

makes sense...

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-25 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719563#comment-13719563
 ] 

Otis Gospodnetic commented on SOLR-5069:


bq. It should be something we should think of as a feature of Solr. Being a 
part of a cluster but not taking part in certain roles (leader/search/jobs etc

Yeah, perhaps something like that.  We already have Overseer and Leader, which 
are also roles of some sort, though those are completely managed by SolrCloud, 
meaning SolrCloud/ZK do the node election and node assignment for these 
particular roles, AFAIK. While for search vs. job (vs. mixed) role the 
assignment is likely to come from a human+ZK.

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-25 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719593#comment-13719593
 ] 

Yonik Seeley commented on SOLR-5069:


bq. It should be something we should think of as a feature of Solr.

Right - it's unrelated to this feature.  We've already kicked around the idea 
of roles for nodes for years now (like in SOLR-2765), and they would be 
useful in many contexts.  Someone actually needs to do the work though... 
patches welcome ;-)


 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-25 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719612#comment-13719612
 ] 

Andrzej Bialecki  commented on SOLR-5069:
-

bq. some things will be completely streamable w/o any need for buffering... 
think of re-implementing the terms component here - we can access terms in 
sorted order so the reducer would simply need to do a merge sort on the streams 
and then stream that result back!
It could be probably implemented as a special case, because it strongly depends 
on the map() output being sorted. However, in general case reducer must wait 
for all mappers to finish because mappers may produce keys out of order and 
non-unique.

+1 on node roles, as a separate issue - it should not hold off this issue.

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-25 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719654#comment-13719654
 ] 

Andrzej Bialecki  commented on SOLR-5069:
-

An alternative solution for minimizing the amount of data in memory during 
reduce phase is to use re-reduce, or a reduce-side combiner, using Hadoop 
terminology.

This is an additional function that runs on the reducer and periodically 
performs intermediate reductions of already accumulated values for a key, and 
preserves the intermediate results (and discards the accumulated values). This 
function does not emit anything to the final output. Then the final reduction 
function operates on a mix of values that arrived since the last intermediate 
reduction, plus all results of previous intermediate reductions.

This works well for simple aggregations (where the additional function may be 
in fact a copy of the reduce function) but may not be suitable to all classes 
of problems.

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: 

[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-24 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13718058#comment-13718058
 ] 

Andrzej Bialecki  commented on SOLR-5069:
-

bq.  why can't reduce start as soon as the mappers start producing? 
Because reducer needs to operate on the complete list of values for a given 
key. Take for example the wordcount - not waiting for all mappers would cause 
reducer to emit only partial aggregations. In general mappers should be free to 
emit arbitrary keys, so new values may appear at any moment until all mappers 
are finished.

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-24 Thread Lukas Vlcek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13718068#comment-13718068
 ] 

Lukas Vlcek commented on SOLR-5069:
---

Hello,

may be OT but in spite of the fact that having MapReduce in (near) real time 
[clustered] search server sounds very interesting and indeed useful, is this 
something that is good to put into the system?

I might be naive but as far as I can understand MR tasks can be both RAM and IO 
(disk,network) intensive. How can one tune the system for fast indexing/search 
performance if the additional load put on the system from MR is hardly 
predictable?

Not to mention the fact that MR is like a hammer. And if you put hammer into 
hands of users, then everything starts looking like a  you know the story.

Regards,
Lukas

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, 

[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-24 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13718067#comment-13718067
 ] 

Noble Paul commented on SOLR-5069:
--

I guess , I haven't explained correctly.

The reducer output is available only after all the mappers are done. But the 
reducer is started along with the mappers and working in parallel . 

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-24 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13718152#comment-13718152
 ] 

Noble Paul commented on SOLR-5069:
--

bq.MR tasks can be both RAM and IO (disk,network) intensive

You are right. We won't recommend people to use the cluster for MR and search 
indexing at the same time . They will definitely see degraded performance. But 
then, that is expected , right? How is it better than setting up another 
cluster (Hadoop) for MR if if you need it? 

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-24 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13718434#comment-13718434
 ] 

Andrzej Bialecki  commented on SOLR-5069:
-

bq. The reducer output is available only after all the mappers are done. But 
the reducer is started along with the mappers and is working in parallel.

[~noble.paul]: Sure, you can start the reducer - but for any given key you have 
to wait anyway with processing until all values for a given key become 
available - and this practically means that the reducer has to wait until all 
mappers are done.

bq. How can one tune the system for fast indexing/search performance if the 
additional load put on the system from MR is hardly predictable?

[~lukas.vlcek]: that's why I suggested that this framework should have the 
ability to specify a budget for job execution, at least in terms of RAM or 
key-value pair count. Still, for occasional analytic jobs or simple 
aggregations the load should be predictable or bearable, and the performance 
cost of using this tool would be negligible compared to the cost of integrating 
and operating a separate analytic platform.

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  

--
This 

[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-24 Thread Lukas Vlcek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13718518#comment-13718518
 ] 

Lukas Vlcek commented on SOLR-5069:
---

[~porqpine]: Well, I see the point. From the user point of view this sounds 
very cool and it will be interesting to see how this feature works out.

Though, this reminds me the situation that happened in Google couple of year 
ago (I heard this from one ex-Googler, not sure if there is any official 
evidence) when they introduced MR platform internally and all summer interns 
started using it. A lot of non-optimal tasks started eating their resources - 
because it is so easy to translate a lot of problems into MR (but it does not 
mean that MR solution to the problem is the optimal one).

As for setting up a separate analytical platform, well... the cost of setting 
it up is one thing, but the benefit of existing tooling and experience is 
another one. Are you going to reimplement Mahout into Solr? - well may be you 
are not aiming at this level of complexity.

You can throttle the thing on many levels, as a result the task will just run 
longer, right? Isn't this in fact a big challenge? If I understand Lucene 
correctly, the costly part is if you need to keep aged IndexReaders around 
because this leads to higher number of opened segments and consumption of 
related resources. And what if the data included into the MR calculation 
changes (reindex/delete) in the meantime? Then you need to be careful in 
presenting the results to the clients because they may be too used to Hadoop MR 
where the original data set is still available.

Anyway, I am sure you are already aware of all this. I am just curious :-)

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = 

[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-24 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13718578#comment-13718578
 ] 

Noble Paul commented on SOLR-5069:
--

bq.that's why I suggested that this framework should have the ability to 
specify a budget ...

Yoi are right. Even the version 1.0 should have a way to budget the RAM at 
mapper and reducer for a given task

bq.A lot of non-optimal tasks started eating their resources...


bq.but the benefit of existing tooling and experience is another one

Actually it works both ways. Mahout (or other systems) will have more mature 
support for certain tasks. There are more people familiar with Solr/Lucene. 
That will help them to be up and running with little effort.

bq.as a result the task will just run longer, right?

Well, that is the tradeoff you make. choose expensive h/w or wait longer

bq.the costly part is if you need to keep aged IndexReaders around ...

Yes, If you have frequent commits and frequent MR tasks running. You will 
rarely run a long running process on a very frequently changing dataset . 

Lucene does not delete 'data' because segments are cleaned up. They are just 
copied over if segment merges happen. If deletes happen in between , Lucene 
will behave much better because we always operate on the same IndexReader and 
the results will be consistent with the state of the data at the time the task 
is fired




 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
 

[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-24 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13718762#comment-13718762
 ] 

Yonik Seeley commented on SOLR-5069:


Awesome stuff Noble!

bq. why can't reduce start as soon as the mappers start producing? whatever is 
emitted by the mapper is up for reducer to chew on.
Right - and some things will be completely streamable w/o any need for 
buffering... think of re-implementing the terms component here - we can access 
terms in sorted order so the reducer would simply need to do a merge sort on 
the streams and then stream that result back! 

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-24 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13718785#comment-13718785
 ] 

Eks Dev commented on SOLR-5069:
---

wow, this is getting pretty close to collection clustering and other candies, 
somehow to plug-in mahout and it's there

Great job and great direction for solr. End-applications not only need to find 
things, they often want to do something with them as well :)

Thanks!   

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/collection-x/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-23 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13716486#comment-13716486
 ] 

Andrzej Bialecki  commented on SOLR-5069:
-

Exciting idea! Almost as exciting as SolrCloud on MapReduce :)

A few comments:
# distributed map-reduce in reality is a sequence of:
## split input and assign splits to M nodes
## apply map() on M nodes in parallel
##* for large datasets the emitted data from mappers is spooled to disk
## shuffle - ie. partition and ship emitted data from M mappers into N 
reducers
##* (wait until all mappers are done, so that each partition's key-space is 
complete)
## sort by key in each of N reducers, collecting values for each key
##* again, for large datasets this is a disk-based sort
## apply N reducers in parallel and emit final output (in N parts)
# if I understand it correctly the model that you presented has some 
limitations:
## as many input splits as there are shards (and consequently as many mappers)
## single reducer. Theoretically it should be possible to use N nodes to act as 
reducers if you implement the concept of partitioner - this would cut down the 
memory load on each reducer node. Of course, streaming back the results would 
be a challenge, but saving them into a collection should work just fine.
## no shuffling - all data from mappers will go to a single reducer
## no intermediate storage of data, all intermediate values need to fit in 
memory
## what about the sorting phase? I assume it's an implicit function in the 
reducedMap (treemap?)
# since all fine-grained emitted values from map end up being sent to 1 
reducer, which has to collect all this data in memory first before applying the 
reduce() op, the concept of a map-side combiner seems useful, to be able to 
quickly minimize the amount of data to be sent to reducer.
# it would be very easy to OOM your Solr nodes at the reduce phase. There 
should be some built-in safety mechanism for this.
# what parts of Solr are available in the script's context? Making all Solr API 
available could lead to unpredictable side-effects, so this set of APIs needs 
to be curated. E.g. I think it would make sense to make analyzer factories 
available.

And finally, an observation: regular distributed search can be viewed as a 
special case of map-reduce computation ;)

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce 

[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-23 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13716624#comment-13716624
 ] 

Noble Paul commented on SOLR-5069:
--

Thanks Andrzej

I started off with a simple model so  that the version 1 can be implemented 
easily.

'N' reducers add to implementation complexity. However , it should be done 
eventually.  

bq.no intermediate storage of data, all intermediate values need to fit in 
memory

Yes,in my model,  the mappers will be throttled so that we can fix the amount 
of intermediate data kept in memory. $.map() call would wait if the size 
threshold is reached

bq. what about the sorting phase? I assume it's an implicit function in the 
reducedMap (treemap?)

we should have the choice on how to sort the map .Yes, Some kind of sorted map 
should be offered .probably sort on some key's value in the map


bq.it would be very easy to OOM your Solr nodes at the reduce phase. 

Sure, here the idea is to do some overflow to disk beyond a threshold. With 
memory getting abundant , we probably should use some off heap solution , so 
that the reduce is not I/O bound.

bq.what parts of Solr are available in the script's context

Good that you asked. We should keep the API's available limited . For instance 
, anything that can alter the state of the system should not be exposed to the 
script. Anything that can help processing /manipulating data should be exposed


 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume 

[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-23 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13716655#comment-13716655
 ] 

Andrzej Bialecki  commented on SOLR-5069:
-

bq. Sure, here the idea is to do some overflow to disk beyond a threshold.
Berkeley DB, db4o, and an Apache-licensed MapDB (mapdb.org), and probably 
others, all provide persistent Java Collections API. We could use one of these 
- you could add a provider mechanism to separate the actual implementation from 
the plain Collections api.

bq.  $.map() call would wait if the size threshold is reached
Throttling the mappers wouldn't help with OOM on the reduce() side - reduce() 
can start only when all mappers are finished. I think a map-side combiner would 
be much more helpful, if possible (reductions that are not simple aggregations 
usually can't be performed in map-side combiners).

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: 

[jira] [Commented] (SOLR-5069) MapReduce for SolrCloud

2013-07-23 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13717922#comment-13717922
 ] 

Noble Paul commented on SOLR-5069:
--

bq.reduce() can start only when all mappers are finished

Why. why can't reduce start as soon as the mappers start producing? whatever is 
emitted by the mapper is up for reducer to chew on. 

All said,  map side combiner is definitely useful and would reduce 
memory/network IO

 MapReduce for SolrCloud
 ---

 Key: SOLR-5069
 URL: https://issues.apache.org/jira/browse/SOLR-5069
 Project: Solr
  Issue Type: New Feature
  Components: SolrCloud
Reporter: Noble Paul
Assignee: Noble Paul

 Solr currently does not have a way to run long running computational tasks 
 across the cluster. We can piggyback on the mapreduce paradigm so that users 
 have smooth learning curve.
  * The mapreduce component will be written as a RequestHandler in Solr
  * Works only in SolrCloud mode. (No support for standalone mode) 
  * Users can write MapReduce programs in Javascript or Java. First cut would 
 be JS ( ? )
 h1. sample word count program
 h2.how to invoke?
 http://host:port/solr/mapreduce?map=map-scriptreduce=reduce-scriptsink=collectionX
 h3. params 
 * map :  A javascript implementation of the map program
 * reduce : a Javascript implementation of the reduce program
 * sink : The collection to which the output is written. If this is not passed 
 , the request will wait till completion and respond with the output of the 
 reduce program and will be emitted as a standard solr response. . If no sink 
 is passed the request will be redirected to the reduce node where it will 
 wait till the process is complete. If the sink param is passed ,the rsponse 
 will contain an id of the run which can be used to query the status in 
 another command.
 * reduceNode : Node name where the reduce is run . If not passed an arbitrary 
 node is chosen
 The node which received the command would first identify one replica from 
 each slice where the map program is executed . It will also identify one 
 another node from the same collection where the reduce program is run. Each 
 run is given an id and the details of the nodes participating in the run will 
 be written to ZK (as an ephemeral node). 
 h4. map script 
 {code:JavaScript}
 var res = $.streamQuery(*:*);//this is not run across the cluster. //Only on 
 this index
 while(res.hasMore()){
   var doc = res.next();
   var txt = doc.get(“txt”);//the field on which word count is performed
   var words = txt.split( );
for(i = 0; i  words.length; i++){
   $.map(words[i],{‘count’:1});// this will send the map over to //the 
 reduce host
 }
 }
 {code}
 Essentially two threads are created in the 'map' hosts . One for running the 
 program and the other for co-ordinating with the 'reduce' host . The maps 
 emitted are streamed live over an http connection to the reduce program
 h4. reduce script
 This script is run in one node . This node accepts http connections from map 
 nodes and the 'maps' that are sent are collected in a queue which will be 
 polled and fed into the reduce program. This also keeps the 'reduced' data in 
 memory till the whole run is complete. It expects a done message from all 
 'map' nodes before it declares the tasks are complete. After  reduce program 
 is executed for all the input it proceeds to write out the result to the 
 'sink' collection or it is written straight out to the response.
 {code:JavaScript}
 var pair = $.nextMap();
 var reduced = $.getCtx().getReducedMap();// a hashmap
 var count = reduced.get(pair.key());
 if(count === null) {
   count = {“count”:0};
   reduced.put(pair.key(), count);
 }
 count.count += pair.val().count ;
 {code}
 h4.example output
 {code:JavaScript}
 {
 “result”:[
 “wordx”:{ 
  “count”:15876765
  },
 “wordy” : {
“count”:24657654
   }
  
   ]
 }
 {code}
 TBD
 * The format in which the output is written to the target collection, I 
 assume the reducedMap will have values mapping to the schema of the collection
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org