That’s right, ODM (open data model) is planned for the future, none of the Spot
components are leveraging ODM for now.
To answer your question Giacomo, I can only think about two solutions for now:
1. If your current data is not used for any other process as data source, you
can write a simple Spark Job to transform and rename columns, save that new
data and delete your original data set, if you need to keep your original data,
then I think you can do the same and have duplicated data for now.
2. Patch spot-ml: This might involve different things but it doable.
a. You need to update ml_ops.sh. ml_ops.sh is the main script running the Spark
job, it receives the parameters of the date you want to process, the type of
data and the results you want to save. Since ml_ops.sh works with a date you
either need to reorganize your data to follow a similar structure like
/user/<spot-user>/flow/hive/y=2017/m=03/d=09 so you can keep using this script.
Another option is to use ml_test.sh. This script is made just for testing data
sets without having the structure I mentioned. By doing so, you will need to
change some parameters inside the script that are hardcoded (it’s just a test
script) to save results in a dynamic location, get a specific amount of
results, etc.
I’m talking about:
DSOURCE=$1
RAWDATA_PATH=$2
TOL=1.1
MAXRESULTS=20
HPATH=${HUSER}/${DSOURCE}/test/scored_results
b. Depending on the data type you want to implement (netflow, dns queries,
proxy) you are going to have to map your columns with our existing columns.
For flow you can check this particular object:
https://github.com/apache/incubator-spot/blob/master/spot-ml/src/main/scala/org/apache/spot/netflow/FlowSchema.scala
There you are going to see the name of columns we are using assigned to
constants like this:
val TimeReceived = "treceived"
val TimeReceivedField = StructField(TimeReceived, StringType, nullable = true)
You can then, after you did a mapping of your columns with spot’s, just change
the value for that String. For the output, you will need to preserve Spot’s
names, for that, you need to change the name of the column in the StructField
to match the old column name. Your code should look like this:
val TimeReceived = "mytimecolumn"
val TimeReceivedField = StructField("treceived", StringType, nullable = true)
Where the String constant has your column name but the StructField has the old
name, again, that’s for output.
I’m trying to think about what other places (for DNS and Proxy it should be
pretty much the same). I’ll write another email if I remember something else
but for now.
Let me know how it goes.
Thanks.
On 3/8/17, 11:12 AM, "Giacomo Bernardi" <[email protected]> wrote:
Thanks,
I had seen a couple of references to the ODM in the Spot docs:
http://spot.incubator.apache.org/project-components/open-data-models/
https://github.com/apache/incubator-spot/blob/master/docs/open-data-model/open-data-model.md
but I got confused, as I didn't understand if this is actually used or it's
a future/planned feature. Can anyone clarify, please?
Thanks,
Giacomo
On 7 March 2017 at 16:53, Michael Ridley <[email protected]> wrote:
> Hi Giacomo-
>
> Don't have any advice on what you are trying to do, but I think the end
> goal is to have everything leverage the common data models in Spot. So I
> think the recommendation would be to figure out a way to convert your data
> to the common data model. But I don't think the Spot ML code actually
> leverages the common data model yet, so that's more of a future solution.
>
> If anyone knows better, feel free to correct me.
>
> Michael
>
> On Tue, Mar 7, 2017 at 10:57 AM, Giacomo Bernardi <[email protected]> wrote:
>
> > Hi,
> > let me ask a suggestion on how to proceed:
> >
> > I already have flow data stored HDFS in Parquet files from an existing
> > netflow receiver system, but with different columns/schema than Spot.
I'd
> > like to patch spot-ml and spot-oa to have them run directly on that data
> > without having to store everything twice.
> >
> > I'm still figuring out the parsing code, any hints on how I should do
> this?
> > Or, even better, how to do it in a sane/modular way that can be useful
> for
> > everyone?
> >
> > Thanks a lot!
> > Giacomo
> >
>
>
>
> --
> Michael Ridley <[email protected]>
> office: (650) 352-1337
> mobile: (571) 438-2420
> Senior Solutions Architect
> Cloudera, Inc.
>