Github user nickwallen commented on a diff in the pull request:
https://github.com/apache/metron/pull/831#discussion_r150314117
--- Diff:
metron-deployment/packaging/ambari/metron-mpack/src/main/resources/common-services/METRON/CURRENT/package/scripts/indexing_commands.py
---
@@ -162,34 +164,77 @@ def start_indexing_topology(self, env):
self.__params.metron_principal_name,
execute_user=self.__params.metron_user)
- start_cmd_template =
"""{0}/bin/start_elasticsearch_topology.sh \
- -s {1} \
- -z {2}"""
- start_cmd =
start_cmd_template.format(self.__params.metron_home,
- self.__indexing_topology,
-
self.__params.zookeeper_quorum)
+ start_cmd_template = """{0}/bin/start_hdfs_topology.sh"""
+ start_cmd =
start_cmd_template.format(self.__params.metron_home)
Execute(start_cmd, user=self.__params.metron_user, tries=3,
try_sleep=5, logoutput=True)
else:
- Logger.info('Indexing topology already running')
+ Logger.info('Batch Indexing topology already running')
- Logger.info('Finished starting indexing topology')
+ Logger.info('Finished starting batch indexing topology')
- def stop_indexing_topology(self, env):
- Logger.info('Stopping ' + self.__indexing_topology)
+ def start_random_access_indexing_topology(self, env):
+ Logger.info('Starting ' + self.__random_access_indexing_topology)
--- End diff --
First off, I think we definitely need to make this happen. Each index
destination is going to have very different performance characteristics that
need to be tuned in isolation. I think this is a step in the right direction.
As I read this we have effectively hard-coded two indexing topologies;
random access and batch. This is definitely the most logical way to get to
separate topologies based on our existing code base. But I am wondering if we
might think about this in a slightly different way.
What I really like about indexing is that we have the idea of multiple,
independent destinations. For example, my indexing configuration could look
like this.
```
{
"elasticsearch": {
"index": "foo",
"enabled" : true
},
"hdfs": {
"index": "foo",
"batchSize": 1,
"enabled" : true
}
}
```
What if we introduced logic that consumes the indexing configuration,
determines that it needs to launch 2 topologies in this case, and then launches
those 2 separate topologies? If I had 3 destinations configured, then it would
launch 3 topologies; one for each destination?
I can definitely see the extra complexity in doing this. You have to make
sure the user can independently configure each of the topologies. You have to
respond to configuration changes made by the user. And probably a few other
complications.
But these are already complications that we need to deal with in Parsing.
A user can define 1 to N Parsing topologies. It seems like if we can solve
these challenges for Parsing, we can do the same for Indexing.
Anywho, I can totally see this PR as a near-term solution to the immediate
problem, which might lead towards a longer-term solution like I propose. I
just wanted to see if anyone had related thoughts.
---