[ 
https://issues.apache.org/jira/browse/BLUR-18?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13475938#comment-13475938
 ] 

Aaron McCurry commented on BLUR-18:
-----------------------------------

After thinking about it, we should probably just run one input split per server 
instead of per shard.  That way a single MR program won't overwhelm the shard 
cluster.  In the future we may want to allow this to be configurable.

I think that you are on the right track, I modified your code and added a 
little of my own to describe what I was thinking.

Driver program:

public static void main(String[] args) {
  //This code will execute against the blur controllers
  Configuration conf = new Configuration();
  Session session = client.openReadSession();
  QuerySession querySession = client.executeQuery(session, "select * from 
table1");
  Job job = BlurInputFormat.configureJob(configuration,querySession);

  //run job

  client.closeReadSession(session);
}

public List<InputSplit> getSplits(JobContext context) throws IOException, 
InterruptedException {
  try {
    QuerySession querySession = BlurInputFormat.readQuerySession(context);
    List<InputSplit> splits = new ArrayList<String>();
    List<String> shardServerConnections = getShardServers(querySession);
    for (String shardServerConnection : shardServerConnections){
      splits.add(new BlurSplit(shardServerConnection, querySession));
    }
    return splits;      
  } catch (...)
    //throw exceptions
  }
}

private List<String> getShardServers(QuerySession querySession) {
  //add to the query session object what shard cluster the query is executing 
against
  //we will need to add this into the thrift api
  //then lookup the shard servers from the blur controller
}
                
> Rework the MapReduce Library to implement Input/OutputFromats
> -------------------------------------------------------------
>
>                 Key: BLUR-18
>                 URL: https://issues.apache.org/jira/browse/BLUR-18
>             Project: Apache Blur
>          Issue Type: Improvement
>            Reporter: Aaron McCurry
>
> Currently the only way to implement indexing is to use the BlurReducer.  A 
> better way to implement this would be to support Hadoop input/outputformats 
> in both the new and old api's.  This would allow an easier integration with 
> other Hadoop projects such as Hive and Pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to