Re: [MLlib] Contributing Algorithm for Outlier Detection

Ashutosh Thu, 13 Nov 2014 22:25:09 -0800

Please use the following snippet. I am still working on to make it a generic 
vector, so that input
should not Vector[String] always. But String will work fine for now.



def main(args:Array[String])
 {
  val sc = new SparkContext("local", "OutlierDetection")
  val dir = "hdfs://localhost:54310/train3"      <your file path>

   val data = sc.textFile(dir).map(word => word.split(",").toVector)
   val model = OutlierWithAVFModel.outliers(data,20,sc)

   model.score.saveAsTextFile("../scores")
   model.trimmed_data.saveAsTextFile(".../trimmed")
 }


________________________________
From: Meethu Mathew-2 [via Apache Spark Developers List] 
<ml-node+s1001551n9352...@n3.nabble.com>
Sent: Friday, November 14, 2014 11:42 AM
To: Ashutosh Trivedi (MT2013030)
Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection


Hi,

I have a doubt regarding the input to your algorithm.
_<http://www.linkedin.com/home?trk=hb_tab_home_top>_

val model = OutlierWithAVFModel.outliers(data :RDD[Vector[String]],
percent : Double, sc :SparkContext)


Here our input  data is an RDD[Vector[String]]. How we can create this
RDD from a file? sc.textFile will simply give us an RDD, how to make it
a Vector[String]?


Could you plz share any code snippet of this conversion if you have..


Regards,
Meethu Mathew

On Friday 14 November 2014 10:02 AM, Meethu Mathew wrote:

> Hi Ashutosh,
>
> Please edit the README file.I think the following function call is
> changed now.
>
> |model = OutlierWithAVFModel.outliers(master:String, input dir:String , 
> percentage:Double||)
> |
>
> Regards,
>
> *Meethu Mathew*
>
> *Engineer*
>
> *Flytxt*
>
> _<http://www.linkedin.com/home?trk=hb_tab_home_top>_
>
> On Friday 14 November 2014 12:01 AM, Ashutosh wrote:
>> Hi Anant,
>>
>> Please see the changes.
>>
>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>>
>>
>> I have changed the input format to Vector of String. I think we can also 
>> make it generic.
>>
>>
>> Line 59 & 72 : that counter will not affect in parallelism, Since it only 
>> work on one datapoint. It  only                         does the Indexing of 
>> the column.
>>
>>
>> Rest all side effects have been removed.
>>
>> 
>>
>> Thanks,
>>
>> Ashutosh
>>
>>
>>
>>
>> ________________________________
>> From: slcclimber [via Apache Spark Developers List] <[hidden 
>> email]</user/SendEmail.jtp?type=node&node=9352&i=0>>
>> Sent: Tuesday, November 11, 2014 11:46 PM
>> To: Ashutosh Trivedi (MT2013030)
>> Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection
>>
>>
>> Mayur,
>> Libsvm format sounds good to me. I could work on writing the tests if that 
>> helps you?
>> Anant
>>
>> On Nov 11, 2014 11:06 AM, "Ashutosh [via Apache Spark Developers List]" 
>> <[hidden email]</user/SendEmail.jtp?type=node&node=9287&i=0>> wrote:
>>
>> Hi Mayur,
>>
>> Vector data types are implemented using breeze library, it is presented at
>>
>> .../org/apache/spark/mllib/linalg
>>
>>
>> Anant,
>>
>> One restriction I found that a vector can only be of 'Double', so it 
>> actually restrict the user.
>>
>> What are you thoughts on LibSVM format?
>>
>> Thanks for the comments, I was just trying to get away from those increment 
>> /decrement functions, they look ugly. Points are noted. I'll try to fix them 
>> soon. Tests are also required for the code.
>>
>>
>> Regards,
>>
>> Ashutosh
>>
>>
>> ________________________________
>> From: Mayur Rustagi [via Apache Spark Developers List] <ml-node+[hidden 
>> email]<http://user/SendEmail.jtp?type=node&node=9286&i=0>>
>> Sent: Saturday, November 8, 2014 12:52 PM
>> To: Ashutosh Trivedi (MT2013030)
>> Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection
>>
>>> We should take a vector instead giving the user flexibility to decide
>>> data source/ type
>> What do you mean by vector datatype exactly?
>>
>> Mayur Rustagi
>> Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" 
>> target="_blank">+1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>> On Wed, Nov 5, 2014 at 6:45 AM, slcclimber <[hidden 
>> email]<http://user/SendEmail.jtp?type=node&node=9239&i=0>> wrote:
>>
>>> Ashutosh,
>>> I still see a few issues.
>>> 1. On line 112 you are counting using a counter. Since this will happen in
>>> a RDD the counter will cause issues. Also that is not good functional style
>>> to use a filter function with a side effect.
>>> You could use randomSplit instead. This does not the same thing without the
>>> side effect.
>>> 2. Similar shared usage of j in line 102 is going to be an issue as well.
>>> also hash seed does not need to be sequential it could be randomly
>>> generated or hashed on the values.
>>> 3. The compute function and trim scores still runs on a comma separeated
>>> RDD. We should take a vector instead giving the user flexibility to decide
>>> data source/ type. what if we want data from hive tables or parquet or JSON
>>> or avro formats. This is a very restrictive format. With vectors the user
>>> has the choice of taking in whatever data format and converting them to
>>> vectors insteda of reading json files creating a csv file and then workig
>>> on that.
>>> 4. Similar use of counters in 54 and 65 is an issue.
>>> Basically the shared state counters is a huge issue that does not scale.
>>> Since the processing of RDD's is distributed and the value j lives on the
>>> master.
>>>
>>> Anant
>>>
>>>
>>>
>>> On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List]
>>> <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=1>> wrote:
>>>
>>>>    Anant,
>>>>
>>>> I got rid of those increment/ decrements functions and now code is much
>>>> cleaner. Please check. All your comments have been looked after.
>>>>
>>>>
>>>>
>>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>>>>    _Ashu
>>>>
>>>> <
>>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>>>>     Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master ·
>>>> codeAshu/Outlier-Detection-with-AVF-Spark · GitHub
>>>>    Contribute to Outlier-Detection-with-AVF-Spark development by creating
>>> an
>>>> account on GitHub.
>>>>    Read more...
>>>> <
>>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala
>>>>    ------------------------------
>>>> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
>>>> email] <http://user/SendEmail.jtp?type=node&node=9083&i=0>>
>>>> *Sent:* Friday, October 31, 2014 10:09 AM
>>>> *To:* Ashutosh Trivedi (MT2013030)
>>>> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
>>>>
>>>>
>>>> You should create a jira ticket to go with it as well.
>>>> Thanks
>>>> On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers List]"
>>> <[hidden
>>>> email] <http://user/SendEmail.jtp?type=node&node=9037&i=0>> wrote:
>>>>
>>>>>    Okay. I'll try it and post it soon with test case. After that I think
>>>>> we can go ahead with the PR.
>>>>>    ------------------------------
>>>>> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden
>>>>> email] <http://user/SendEmail.jtp?type=node&node=9036&i=0>>
>>>>> *Sent:* Friday, October 31, 2014 10:03 AM
>>>>> *To:* Ashutosh Trivedi (MT2013030)
>>>>> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection
>>>>>
>>>>>
>>>>> Ashutosh,
>>>>> A vector would be a good idea vectors are used very frequently.
>>>>> Test data is usually stored in the spark/data/mllib folder
>>>>>    On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]"
>>>>> <[hidden email] <http://user/SendEmail.jtp?type=node&node=9035&i=0>>
>>>>> wrote:
>>>>>
>>>>>> Hi Anant,
>>>>>> sorry for my late reply. Thank you for taking time and reviewing it.
>>>>>>
>>>>>> I have few comments on first issue.
>>>>>>
>>>>>> You are correct on the string (csv) part. But we can not take input of
>>>>>> type you mentioned. We calculate frequency in our function. Otherwise
>>> user
>>>>>> has to do all this computation. I realize that taking a RDD[Vector]
>>> would
>>>>>> be general enough for all. What do you say?
>>>>>>
>>>>>> I agree on rest all the issues. I will correct them soon and post it.
>>>>>> I have a doubt on test cases. Where should I put data while giving test
>>>>>> scripts? or should i generate synthetic data for testing with in the
>>>>>> scripts, how does this work?
>>>>>>
>>>>>> Regards,
>>>>>> Ashutosh
>>>>>>
>>>>>> ------------------------------
>>>>>>    If you reply to this email, your message will be added to the
>>>>>> discussion below:
>>>>>>
>>>>>>
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html
>>>>>>    To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>>>>> Detection, click here.
>>>>>> NAML
>>>>>> <
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>>> ------------------------------
>>>>>    If you reply to this email, your message will be added to the
>>>>> discussion below:
>>>>>
>>>>>
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html
>>>>>    To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>>>> Detection, click here.
>>>>> NAML
>>>>> <
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>>> ------------------------------
>>>>>    If you reply to this email, your message will be added to the
>>>>> discussion below:
>>>>>
>>>>>
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html
>>>>>    To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>>>> Detection, click here.
>>>>> NAML
>>>>> <
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>> ------------------------------
>>>>    If you reply to this email, your message will be added to the discussion
>>>> below:
>>>>
>>>>
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html
>>>>    To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>> Detection, click
>>>> here.
>>>> NAML
>>>> <
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>>
>>>> ------------------------------
>>>>    If you reply to this email, your message will be added to the discussion
>>>> below:
>>>>
>>>>
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9083.html
>>>>    To unsubscribe from [MLlib] Contributing Algorithm for Outlier
>>> Detection, click
>>>> here
>>>> <
>>>>
>>>> .
>>>> NAML
>>>> <
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9095.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>> ________________________________
>> If you reply to this email, your message will be added to the discussion 
>> below:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9239.html
>> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, 
>> click here.
>> NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>>
>> ________________________________
>> If you reply to this email, your message will be added to the discussion 
>> below:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9286.html
>> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, 
>> click here.
>> NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>>
>> ________________________________
>> If you reply to this email, your message will be added to the discussion 
>> below:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9287.html
>> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, 
>> click here<
>> NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>>
>>
>>
>> --
>> View this message in context: 
>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9327.html
>> Sent from the Apache Spark Developers List mailing list archive at 
>> Nabble.com.



________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9352.html
To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click 
here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YXNodXRvc2gudHJpdmVkaUBpaWl0Yi5vcmd8ODg4MHwtMzkzMzE5NzYx>.
NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9353.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [MLlib] Contributing Algorithm for Outlier Detection

Reply via email to