Github user nickwallen commented on a diff in the pull request: https://github.com/apache/metron/pull/831#discussion_r150314117 --- Diff: metron-deployment/packaging/ambari/metron-mpack/src/main/resources/common-services/METRON/CURRENT/package/scripts/indexing_commands.py --- @@ -162,34 +164,77 @@ def start_indexing_topology(self, env): self.__params.metron_principal_name, execute_user=self.__params.metron_user) - start_cmd_template = """{0}/bin/start_elasticsearch_topology.sh \ - -s {1} \ - -z {2}""" - start_cmd = start_cmd_template.format(self.__params.metron_home, - self.__indexing_topology, - self.__params.zookeeper_quorum) + start_cmd_template = """{0}/bin/start_hdfs_topology.sh""" + start_cmd = start_cmd_template.format(self.__params.metron_home) Execute(start_cmd, user=self.__params.metron_user, tries=3, try_sleep=5, logoutput=True) else: - Logger.info('Indexing topology already running') + Logger.info('Batch Indexing topology already running') - Logger.info('Finished starting indexing topology') + Logger.info('Finished starting batch indexing topology') - def stop_indexing_topology(self, env): - Logger.info('Stopping ' + self.__indexing_topology) + def start_random_access_indexing_topology(self, env): + Logger.info('Starting ' + self.__random_access_indexing_topology) --- End diff -- First off, I think we definitely need to make this happen. Each index destination is going to have very different performance characteristics that need to be tuned in isolation. I think this is a step in the right direction. As I read this we have effectively hard-coded two indexing topologies; random access and batch. This is definitely the most logical way to get to separate topologies based on our existing code base. But I am wondering if we might think about this in a slightly different way. What I really like about indexing is that we have the idea of multiple, independent destinations. For example, my indexing configuration could look like this. ``` { "elasticsearch": { "index": "foo", "enabled" : true }, "hdfs": { "index": "foo", "batchSize": 1, "enabled" : true } } ``` What if we introduced logic that consumes the indexing configuration, determines that it needs to launch 2 topologies in this case, and then launches those 2 separate topologies? If I had 3 destinations configured, then it would launch 3 topologies; one for each destination? I can definitely see the extra complexity in doing this. You have to make sure the user can independently configure each of the topologies. You have to respond to configuration changes made by the user. And probably a few other complications. But these are already complications that we need to deal with in Parsing. A user can define 1 to N Parsing topologies. It seems like if we can solve these challenges for Parsing, we can do the same for Indexing. Anywho, I can totally see this PR as a near-term solution to the immediate problem, which might lead towards a longer-term solution like I propose. I just wanted to see if anyone had related thoughts.
---