Re: Reading custom flow data

Barona, Ricardo Thu, 09 Mar 2017 07:59:41 -0800

That’s right, ODM (open data model) is planned for the future, none of the Spot 
components are leveraging ODM for now.

To answer your question Giacomo, I can only think about two solutions for now:

1. If your current data is not used for any other process as data source, you 
can write a simple Spark Job to transform and rename columns, save that new 
data and delete your original data set, if you need to keep your original data, 
then I think you can do the same and have duplicated data for now.
2. Patch spot-ml: This might involve different things but it doable.
a. You need to update ml_ops.sh. ml_ops.sh is the main script running the Spark 
job, it receives the parameters of the date you want to process, the type of 
data and the results you want to save. Since ml_ops.sh works with a date you 
either need to reorganize your data to follow a similar structure like 
/user/<spot-user>/flow/hive/y=2017/m=03/d=09 so you can keep using this script.
Another option is to use ml_test.sh. This script is made just for testing data 
sets without having the structure I mentioned. By doing so, you will need to 
change some parameters inside the script that are hardcoded (it’s just a test 
script) to save results in a dynamic location, get a specific amount of 
results, etc.
I’m talking about:
DSOURCE=$1
RAWDATA_PATH=$2
TOL=1.1
MAXRESULTS=20
HPATH=${HUSER}/${DSOURCE}/test/scored_results
b. Depending on the data type you want to implement (netflow, dns queries, 
proxy) you are going to have to map your columns with our existing columns.
For flow you can check this particular object: 
https://github.com/apache/incubator-spot/blob/master/spot-ml/src/main/scala/org/apache/spot/netflow/FlowSchema.scala

There you are going to see the name of columns we are using assigned to 
constants like this:
val TimeReceived = "treceived"
val TimeReceivedField = StructField(TimeReceived, StringType, nullable = true)
You can then, after you did a mapping of your columns with spot’s, just change 
the value for that String. For the output, you will need to preserve Spot’s 
names, for that, you need to change the name of the column in the StructField 
to match the old column name. Your code should look like this:

val TimeReceived = "mytimecolumn"
val TimeReceivedField = StructField("treceived", StringType, nullable = true)

Where the String constant has your column name but the StructField has the old 
name, again, that’s for output.

I’m trying to think about what other places (for DNS and Proxy it should be 
pretty much the same). I’ll write another email if I remember something else 
but for now.
Let me know how it goes.

Thanks.

On 3/8/17, 11:12 AM, "Giacomo Bernardi" <[email protected]> wrote:

    Thanks,
    I had seen a couple of references to the ODM in the Spot docs:

    http://spot.incubator.apache.org/project-components/open-data-models/

https://github.com/apache/incubator-spot/blob/master/docs/open-data-model/open-data-model.md

    but I got confused, as I didn't understand if this is actually used or it's
    a future/planned feature. Can anyone clarify, please?

    Thanks,
    Giacomo

    On 7 March 2017 at 16:53, Michael Ridley <[email protected]> wrote:

    > Hi Giacomo-
    >
    > Don't have any advice on what you are trying to do, but I think the end
    > goal is to have everything leverage the common data models in Spot.  So I
    > think the recommendation would be to figure out a way to convert your data
    > to the common data model.  But I don't think the Spot ML code actually
    > leverages the common data model yet, so that's more of a future solution.
    >
    > If anyone knows better, feel free to correct me.
    >
    > Michael
    >
    > On Tue, Mar 7, 2017 at 10:57 AM, Giacomo Bernardi <[email protected]> wrote:
    >
    > > Hi,
    > > let me ask a suggestion on how to proceed:
    > >
    > > I already have flow data stored HDFS in Parquet files from an existing
    > > netflow receiver system, but with different columns/schema than Spot. 
I'd
    > > like to patch spot-ml and spot-oa to have them run directly on that data
    > > without having to store everything twice.
    > >
    > > I'm still figuring out the parsing code, any hints on how I should do
    > this?
    > > Or, even better, how to do it in a sane/modular way that can be useful
    > for
    > > everyone?
    > >
    > > Thanks a lot!
    > > Giacomo
    > >
    >
    >
    >
    > --
    > Michael Ridley <[email protected]>
    > office: (650) 352-1337
    > mobile: (571) 438-2420
    > Senior Solutions Architect
    > Cloudera, Inc.
    >

Re: Reading custom flow data

Reply via email to