Github user nickwallen commented on a diff in the pull request:

    https://github.com/apache/metron/pull/831#discussion_r150314117
  
    --- Diff: 
metron-deployment/packaging/ambari/metron-mpack/src/main/resources/common-services/METRON/CURRENT/package/scripts/indexing_commands.py
 ---
    @@ -162,34 +164,77 @@ def start_indexing_topology(self, env):
                                           self.__params.metron_principal_name,
                                           
execute_user=self.__params.metron_user)
     
    -            start_cmd_template = 
"""{0}/bin/start_elasticsearch_topology.sh \
    -                                        -s {1} \
    -                                        -z {2}"""
    -            start_cmd = 
start_cmd_template.format(self.__params.metron_home,
    -                                                  self.__indexing_topology,
    -                                                  
self.__params.zookeeper_quorum)
    +            start_cmd_template = """{0}/bin/start_hdfs_topology.sh"""
    +            start_cmd = 
start_cmd_template.format(self.__params.metron_home)
                 Execute(start_cmd, user=self.__params.metron_user, tries=3, 
try_sleep=5, logoutput=True)
     
             else:
    -            Logger.info('Indexing topology already running')
    +            Logger.info('Batch Indexing topology already running')
     
    -        Logger.info('Finished starting indexing topology')
    +        Logger.info('Finished starting batch indexing topology')
     
    -    def stop_indexing_topology(self, env):
    -        Logger.info('Stopping ' + self.__indexing_topology)
    +    def start_random_access_indexing_topology(self, env):
    +        Logger.info('Starting ' + self.__random_access_indexing_topology)
    --- End diff --
    
    First off, I think we definitely need to make this happen.  Each index 
destination is going to have very different performance characteristics that 
need to be tuned in isolation.  I think this is a step in the right direction.
    
    As I read this we have effectively hard-coded two indexing topologies; 
random access and batch. This is definitely the most logical way to get to 
separate topologies based on our existing code base.  But I am wondering if we 
might think about this in a slightly different way.
    
    What I really like about indexing is that we have the idea of multiple, 
independent destinations.  For example, my indexing configuration could look 
like this.
    ```
    {
       "elasticsearch": {
          "index": "foo",
          "enabled" : true 
        },
       "hdfs": {
          "index": "foo",
          "batchSize": 1,
          "enabled" : true
        }
    }
    ```
    
    What if we introduced logic that consumes the indexing configuration, 
determines that it needs to launch 2 topologies in this case, and then launches 
those 2 separate topologies?  If I had 3 destinations configured, then it would 
launch 3 topologies; one for each destination?
    
    I can definitely see the extra complexity in doing this.  You have to make 
sure the user can independently configure each of the topologies.  You have to 
respond to configuration changes made by the user.  And probably a few other 
complications.
    
    But these are already complications that we need to deal with in Parsing.  
A user can define 1 to N Parsing topologies.  It seems like if we can solve 
these challenges for Parsing, we can do the same for Indexing.
    
    Anywho, I can totally see this PR as a near-term solution to the immediate 
problem, which might lead towards a longer-term solution like I propose.  I 
just wanted to see if anyone had related thoughts.
    
    
    
    
    
    



---

Reply via email to