GitHub user nickwallen reopened a pull request:
https://github.com/apache/metron/pull/1191
METRON-1772 Support alternative input formats in the Batch Profiler
[Feature Branch]
By default, the Batch Profiler supports the text/json that Metron lands in
HDFS as the source of the archived telemetry. Of course, this is not the best
option for archiving telemetry in many cases and users may choose to store it
in alternative formats.
Alternatives like ORC should be supported when reading the input telemetry
in the Batch Profiler. The user should be able to customize the profiler based
on how they have chosen to archive their telemetry.
- Updated README to describe how to read alternative input formats.
- Added an additional command line option that allows the user to pass
custom options to the `DataFrameReader`. This may be needed by a user
depending on how the telemetry is archived.
- For example, this allows the user to pass reader options like
`quote`, `nullValue`, etc needed by
[csv](https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/sql/DataFrameReader.html#csv-java.lang.String...-)
or `allowSingleQuote`, `allowComments` needed by
[json](https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/sql/DataFrameReader.html#json-scala.collection.Seq-)
- Added an integration test that validates that the Batch Profiler can read
ORC data.
- Added an integration test that validates that the Batch Profiler can read
CSV data. I added CSV as a test so that I could validate the user providing
custom options to the `DataFrameReader`.
This is a pull request against the `METRON-1699-create-batch-profiler`
feature branch.
This is dependent on the following PRs. By filtering on the last commit,
this PR can be reviewed before the others are reviewed and merged.
- [ ] #1189
## Testing
1. Stand-up a development environment.
```
cd metron-deployment/development/centos6
vagrant up
vagrant ssh
sudo su -
```
1. Validate the environment by ensuring alerts are visible within the
Alerts UI and that the Metron Service Check in Ambari passes.
1. Allow some telemetry to be archived in HDFS.
```
[root@node1 ~]# hdfs dfs -cat /apps/metron/indexing/indexed/*/* | wc -l
6916
```
1. Shutdown Metron topologies, Storm, Elasticsearch, Kibana, MapReduce2 to
free up some resources on the VM.
1. Use Ambari to install Spark (version 2.3+). Actions > Add Service >
Spark2
1. Make sure Spark can talk to HBase.
```
SPARK_HOME=/usr/hdp/current/spark2-client
cp /usr/hdp/current/hbase-client/conf/hbase-site.xml $SPARK_HOME/conf/
```
1. Follow the [Getting
Started](https://github.com/apache/metron/tree/feature/METRON-1699-create-batch-profiler/metron-analytics/metron-profiler-spark#getting-started)
section of the README to seed a basic profile using the text/json telemetry
that is archived in HDFS.
1. Create the Profile.
```
[root@node1 ~]# source /etc/default/metron
[root@node1 ~]# cat $METRON_HOME/config/zookeeper/profiler.json
{
"profiles": [
{
"profile": "hello-world",
"foreach": "'global'",
"init":{ "count": "0" },
"update": { "count": "count + 1" },
"result": "count"
}
],
"timestampField": "timestamp"
}
```
1. Edit the Batch Profiler properties. to point it at the correct input
path (changed localhost:9000 to localhost:8020).
```
[root@node1 ~]# cat
/usr/metron/0.5.1/config/batch-profiler.properties
spark.app.name=Batch Profiler
spark.master=local
spark.sql.shuffle.partitions=8
profiler.batch.input.path=hdfs://localhost:8020/apps/metron/indexing/indexed/*/*
profiler.batch.input.format=text
profiler.period.duration=15
profiler.period.duration.units=MINUTES
```
1. Edit logging as you see fit. For example, set Spark logging to WARN
and Profiler logging to DEBUG. This is described in the README.
1. Run the Batch Profiler.
```
$METRON_HOME/bin/start_batch_profiler.sh
```
1. Launch the Stellar REPL and retrieve the profile data. Save this result
as it will be used for validation in subsequent steps.
```
[root@node1 ~]# $METRON_HOME/bin/stellar -z $ZOOKEEPER
...
Stellar, Go!
Functions are loading lazily in the background and will be unavailable
until loaded fully.
...
[Stellar]>>> window