Please use the following snippet. I am still working on to make it a generic vector, so that input should not Vector[String] always. But String will work fine for now.
def main(args:Array[String]) { val sc = new SparkContext("local", "OutlierDetection") val dir = "hdfs://localhost:54310/train3" <your file path> val data = sc.textFile(dir).map(word => word.split(",").toVector) val model = OutlierWithAVFModel.outliers(data,20,sc) model.score.saveAsTextFile("../scores") model.trimmed_data.saveAsTextFile(".../trimmed") } ________________________________ From: Meethu Mathew-2 [via Apache Spark Developers List] <ml-node+s1001551n9352...@n3.nabble.com> Sent: Friday, November 14, 2014 11:42 AM To: Ashutosh Trivedi (MT2013030) Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection Hi, I have a doubt regarding the input to your algorithm. _<http://www.linkedin.com/home?trk=hb_tab_home_top>_ val model = OutlierWithAVFModel.outliers(data :RDD[Vector[String]], percent : Double, sc :SparkContext) Here our input data is an RDD[Vector[String]]. How we can create this RDD from a file? sc.textFile will simply give us an RDD, how to make it a Vector[String]? Could you plz share any code snippet of this conversion if you have.. Regards, Meethu Mathew On Friday 14 November 2014 10:02 AM, Meethu Mathew wrote: > Hi Ashutosh, > > Please edit the README file.I think the following function call is > changed now. > > |model = OutlierWithAVFModel.outliers(master:String, input dir:String , > percentage:Double||) > | > > Regards, > > *Meethu Mathew* > > *Engineer* > > *Flytxt* > > _<http://www.linkedin.com/home?trk=hb_tab_home_top>_ > > On Friday 14 November 2014 12:01 AM, Ashutosh wrote: >> Hi Anant, >> >> Please see the changes. >> >> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala >> >> >> I have changed the input format to Vector of String. I think we can also >> make it generic. >> >> >> Line 59 & 72 : that counter will not affect in parallelism, Since it only >> work on one datapoint. It only does the Indexing of >> the column. >> >> >> Rest all side effects have been removed. >> >> >> >> Thanks, >> >> Ashutosh >> >> >> >> >> ________________________________ >> From: slcclimber [via Apache Spark Developers List] <[hidden >> email]</user/SendEmail.jtp?type=node&node=9352&i=0>> >> Sent: Tuesday, November 11, 2014 11:46 PM >> To: Ashutosh Trivedi (MT2013030) >> Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection >> >> >> Mayur, >> Libsvm format sounds good to me. I could work on writing the tests if that >> helps you? >> Anant >> >> On Nov 11, 2014 11:06 AM, "Ashutosh [via Apache Spark Developers List]" >> <[hidden email]</user/SendEmail.jtp?type=node&node=9287&i=0>> wrote: >> >> Hi Mayur, >> >> Vector data types are implemented using breeze library, it is presented at >> >> .../org/apache/spark/mllib/linalg >> >> >> Anant, >> >> One restriction I found that a vector can only be of 'Double', so it >> actually restrict the user. >> >> What are you thoughts on LibSVM format? >> >> Thanks for the comments, I was just trying to get away from those increment >> /decrement functions, they look ugly. Points are noted. I'll try to fix them >> soon. Tests are also required for the code. >> >> >> Regards, >> >> Ashutosh >> >> >> ________________________________ >> From: Mayur Rustagi [via Apache Spark Developers List] <ml-node+[hidden >> email]<http://user/SendEmail.jtp?type=node&node=9286&i=0>> >> Sent: Saturday, November 8, 2014 12:52 PM >> To: Ashutosh Trivedi (MT2013030) >> Subject: Re: [MLlib] Contributing Algorithm for Outlier Detection >> >>> We should take a vector instead giving the user flexibility to decide >>> data source/ type >> What do you mean by vector datatype exactly? >> >> Mayur Rustagi >> Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" >> target="_blank">+1 (760) 203 3257 >> http://www.sigmoidanalytics.com >> @mayur_rustagi <https://twitter.com/mayur_rustagi> >> >> >> On Wed, Nov 5, 2014 at 6:45 AM, slcclimber <[hidden >> email]<http://user/SendEmail.jtp?type=node&node=9239&i=0>> wrote: >> >>> Ashutosh, >>> I still see a few issues. >>> 1. On line 112 you are counting using a counter. Since this will happen in >>> a RDD the counter will cause issues. Also that is not good functional style >>> to use a filter function with a side effect. >>> You could use randomSplit instead. This does not the same thing without the >>> side effect. >>> 2. Similar shared usage of j in line 102 is going to be an issue as well. >>> also hash seed does not need to be sequential it could be randomly >>> generated or hashed on the values. >>> 3. The compute function and trim scores still runs on a comma separeated >>> RDD. We should take a vector instead giving the user flexibility to decide >>> data source/ type. what if we want data from hive tables or parquet or JSON >>> or avro formats. This is a very restrictive format. With vectors the user >>> has the choice of taking in whatever data format and converting them to >>> vectors insteda of reading json files creating a csv file and then workig >>> on that. >>> 4. Similar use of counters in 54 and 65 is an issue. >>> Basically the shared state counters is a huge issue that does not scale. >>> Since the processing of RDD's is distributed and the value j lives on the >>> master. >>> >>> Anant >>> >>> >>> >>> On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers List] >>> <[hidden email]<http://user/SendEmail.jtp?type=node&node=9239&i=1>> wrote: >>> >>>> Anant, >>>> >>>> I got rid of those increment/ decrements functions and now code is much >>>> cleaner. Please check. All your comments have been looked after. >>>> >>>> >>>> >>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala >>>> _Ashu >>>> >>>> < >>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala >>>> Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master · >>>> codeAshu/Outlier-Detection-with-AVF-Spark · GitHub >>>> Contribute to Outlier-Detection-with-AVF-Spark development by creating >>> an >>>> account on GitHub. >>>> Read more... >>>> < >>> https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala >>>> ------------------------------ >>>> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden >>>> email] <http://user/SendEmail.jtp?type=node&node=9083&i=0>> >>>> *Sent:* Friday, October 31, 2014 10:09 AM >>>> *To:* Ashutosh Trivedi (MT2013030) >>>> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection >>>> >>>> >>>> You should create a jira ticket to go with it as well. >>>> Thanks >>>> On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers List]" >>> <[hidden >>>> email] <http://user/SendEmail.jtp?type=node&node=9037&i=0>> wrote: >>>> >>>>> Okay. I'll try it and post it soon with test case. After that I think >>>>> we can go ahead with the PR. >>>>> ------------------------------ >>>>> *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden >>>>> email] <http://user/SendEmail.jtp?type=node&node=9036&i=0>> >>>>> *Sent:* Friday, October 31, 2014 10:03 AM >>>>> *To:* Ashutosh Trivedi (MT2013030) >>>>> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection >>>>> >>>>> >>>>> Ashutosh, >>>>> A vector would be a good idea vectors are used very frequently. >>>>> Test data is usually stored in the spark/data/mllib folder >>>>> On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers List]" >>>>> <[hidden email] <http://user/SendEmail.jtp?type=node&node=9035&i=0>> >>>>> wrote: >>>>> >>>>>> Hi Anant, >>>>>> sorry for my late reply. Thank you for taking time and reviewing it. >>>>>> >>>>>> I have few comments on first issue. >>>>>> >>>>>> You are correct on the string (csv) part. But we can not take input of >>>>>> type you mentioned. We calculate frequency in our function. Otherwise >>> user >>>>>> has to do all this computation. I realize that taking a RDD[Vector] >>> would >>>>>> be general enough for all. What do you say? >>>>>> >>>>>> I agree on rest all the issues. I will correct them soon and post it. >>>>>> I have a doubt on test cases. Where should I put data while giving test >>>>>> scripts? or should i generate synthetic data for testing with in the >>>>>> scripts, how does this work? >>>>>> >>>>>> Regards, >>>>>> Ashutosh >>>>>> >>>>>> ------------------------------ >>>>>> If you reply to this email, your message will be added to the >>>>>> discussion below: >>>>>> >>>>>> >>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html >>>>>> To unsubscribe from [MLlib] Contributing Algorithm for Outlier >>>>>> Detection, click here. >>>>>> NAML >>>>>> < >>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml >>>>> ------------------------------ >>>>> If you reply to this email, your message will be added to the >>>>> discussion below: >>>>> >>>>> >>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html >>>>> To unsubscribe from [MLlib] Contributing Algorithm for Outlier >>>>> Detection, click here. >>>>> NAML >>>>> < >>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml >>>>> ------------------------------ >>>>> If you reply to this email, your message will be added to the >>>>> discussion below: >>>>> >>>>> >>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html >>>>> To unsubscribe from [MLlib] Contributing Algorithm for Outlier >>>>> Detection, click here. >>>>> NAML >>>>> < >>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml >>>> ------------------------------ >>>> If you reply to this email, your message will be added to the discussion >>>> below: >>>> >>>> >>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html >>>> To unsubscribe from [MLlib] Contributing Algorithm for Outlier >>> Detection, click >>>> here. >>>> NAML >>>> < >>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml >>>> >>>> ------------------------------ >>>> If you reply to this email, your message will be added to the discussion >>>> below: >>>> >>>> >>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9083.html >>>> To unsubscribe from [MLlib] Contributing Algorithm for Outlier >>> Detection, click >>>> here >>>> < >>>> >>>> . >>>> NAML >>>> < >>> http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9095.html >>> Sent from the Apache Spark Developers List mailing list archive at >>> Nabble.com. >>> >> ________________________________ >> If you reply to this email, your message will be added to the discussion >> below: >> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9239.html >> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, >> click here. >> NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >> >> >> ________________________________ >> If you reply to this email, your message will be added to the discussion >> below: >> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9286.html >> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, >> click here. >> NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >> >> >> ________________________________ >> If you reply to this email, your message will be added to the discussion >> below: >> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9287.html >> To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, >> click here< >> NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9327.html >> Sent from the Apache Spark Developers List mailing list archive at >> Nabble.com. ________________________________ If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9352.html To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, click here<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YXNodXRvc2gudHJpdmVkaUBpaWl0Yi5vcmd8ODg4MHwtMzkzMzE5NzYx>. NAML<http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9353.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.