Mayur, Libsvm format sounds good to me. I could work on writing the tests if that helps you? Anant On Nov 11, 2014 11:06 AM, "Ashutosh [via Apache Spark Developers List]" < ml-node+s1001551n9286...@n3.nabble.com> wrote:
> Hi Mayur, > > Vector data types are implemented using breeze library, it is presented at > > .../org/apache/spark/mllib/linalg > > > Anant, > > One restriction I found that a vector can only be of 'Double', so it > actually restrict the user. > > What are you thoughts on LibSVM format? > > Thanks for the comments, I was just trying to get away from those > increment /decrement functions, they look ugly. Points are noted. I'll try > to fix them soon. Tests are also required for the code. > > > Regards, > > Ashutosh > > > ------------------------------ > *From:* Mayur Rustagi [via Apache Spark Developers List] <ml-node+[hidden > email] <http://user/SendEmail.jtp?type=node&node=9286&i=0>> > *Sent:* Saturday, November 8, 2014 12:52 PM > *To:* Ashutosh Trivedi (MT2013030) > *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection > > > > > We should take a vector instead giving the user flexibility to decide > > data source/ type > > What do you mean by vector datatype exactly? > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi <https://twitter.com/mayur_rustagi> > > > On Wed, Nov 5, 2014 at 6:45 AM, slcclimber <[hidden email] > <http://user/SendEmail.jtp?type=node&node=9239&i=0>> wrote: > > > Ashutosh, > > I still see a few issues. > > 1. On line 112 you are counting using a counter. Since this will happen > in > > a RDD the counter will cause issues. Also that is not good functional > style > > to use a filter function with a side effect. > > You could use randomSplit instead. This does not the same thing without > the > > side effect. > > 2. Similar shared usage of j in line 102 is going to be an issue as > well. > > also hash seed does not need to be sequential it could be randomly > > generated or hashed on the values. > > 3. The compute function and trim scores still runs on a comma separeated > > RDD. We should take a vector instead giving the user flexibility to > decide > > data source/ type. what if we want data from hive tables or parquet or > JSON > > or avro formats. This is a very restrictive format. With vectors the > user > > has the choice of taking in whatever data format and converting them to > > vectors insteda of reading json files creating a csv file and then > workig > > on that. > > 4. Similar use of counters in 54 and 65 is an issue. > > Basically the shared state counters is a huge issue that does not scale. > > Since the processing of RDD's is distributed and the value j lives on > the > > master. > > > > Anant > > > > > > > > On Tue, Nov 4, 2014 at 7:22 AM, Ashutosh [via Apache Spark Developers > List] > > <[hidden email] <http://user/SendEmail.jtp?type=node&node=9239&i=1>> > wrote: > > > > > Anant, > > > > > > I got rid of those increment/ decrements functions and now code is > much > > > cleaner. Please check. All your comments have been looked after. > > > > > > > > > > > > https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala > > > > > > > > > _Ashu > > > > > > < > > > https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala > > > > > > Outlier-Detection-with-AVF-Spark/OutlierWithAVFModel.scala at master > · > > > codeAshu/Outlier-Detection-with-AVF-Spark · GitHub > > > Contribute to Outlier-Detection-with-AVF-Spark development by > creating > > an > > > account on GitHub. > > > Read more... > > > < > > > https://github.com/codeAshu/Outlier-Detection-with-AVF-Spark/blob/master/OutlierWithAVFModel.scala > > > > > > > > > ------------------------------ > > > *From:* slcclimber [via Apache Spark Developers List] <ml-node+[hidden > > > email] <http://user/SendEmail.jtp?type=node&node=9083&i=0>> > > > *Sent:* Friday, October 31, 2014 10:09 AM > > > *To:* Ashutosh Trivedi (MT2013030) > > > *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection > > > > > > > > > You should create a jira ticket to go with it as well. > > > Thanks > > > On Oct 30, 2014 10:38 PM, "Ashutosh [via Apache Spark Developers > List]" > > <[hidden > > > email] <http://user/SendEmail.jtp?type=node&node=9037&i=0>> wrote: > > > > > >> Okay. I'll try it and post it soon with test case. After that I > think > > >> we can go ahead with the PR. > > >> ------------------------------ > > >> *From:* slcclimber [via Apache Spark Developers List] > <ml-node+[hidden > > >> email] <http://user/SendEmail.jtp?type=node&node=9036&i=0>> > > >> *Sent:* Friday, October 31, 2014 10:03 AM > > >> *To:* Ashutosh Trivedi (MT2013030) > > >> *Subject:* Re: [MLlib] Contributing Algorithm for Outlier Detection > > >> > > >> > > >> Ashutosh, > > >> A vector would be a good idea vectors are used very frequently. > > >> Test data is usually stored in the spark/data/mllib folder > > >> On Oct 30, 2014 10:31 PM, "Ashutosh [via Apache Spark Developers > List]" > > >> <[hidden email] <http://user/SendEmail.jtp?type=node&node=9035&i=0>> > > >> wrote: > > >> > > >>> Hi Anant, > > >>> sorry for my late reply. Thank you for taking time and reviewing it. > > >>> > > >>> I have few comments on first issue. > > >>> > > >>> You are correct on the string (csv) part. But we can not take input > of > > >>> type you mentioned. We calculate frequency in our function. > Otherwise > > user > > >>> has to do all this computation. I realize that taking a RDD[Vector] > > would > > >>> be general enough for all. What do you say? > > >>> > > >>> I agree on rest all the issues. I will correct them soon and post > it. > > >>> I have a doubt on test cases. Where should I put data while giving > test > > >>> scripts? or should i generate synthetic data for testing with in the > > >>> scripts, how does this work? > > >>> > > >>> Regards, > > >>> Ashutosh > > >>> > > >>> ------------------------------ > > >>> If you reply to this email, your message will be added to the > > >>> discussion below: > > >>> > > >>> > > > http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9034.html > > >>> To unsubscribe from [MLlib] Contributing Algorithm for Outlier > > >>> Detection, click here. > > >>> NAML > > >>> < > > > http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml > > > > > >>> > > >> > > >> > > >> ------------------------------ > > >> If you reply to this email, your message will be added to the > > >> discussion below: > > >> > > >> > > > http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9035.html > > >> To unsubscribe from [MLlib] Contributing Algorithm for Outlier > > >> Detection, click here. > > >> NAML > > >> < > > > http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml > > > > > >> > > >> > > >> ------------------------------ > > >> If you reply to this email, your message will be added to the > > >> discussion below: > > >> > > >> > > > http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9036.html > > >> To unsubscribe from [MLlib] Contributing Algorithm for Outlier > > >> Detection, click here. > > >> NAML > > >> < > > > http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml > > > > > >> > > > > > > > > > ------------------------------ > > > If you reply to this email, your message will be added to the > discussion > > > below: > > > > > > > > > http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9037.html > > > To unsubscribe from [MLlib] Contributing Algorithm for Outlier > > Detection, click > > > here. > > > NAML > > > < > > > http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml > > > > > > > > > > > > ------------------------------ > > > If you reply to this email, your message will be added to the > discussion > > > below: > > > > > > > > > http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9083.html > > > To unsubscribe from [MLlib] Contributing Algorithm for Outlier > > Detection, click > > > here > > > < > > > > > > . > > > NAML > > > < > > > http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml > > > > > > > > > > > > > > > > -- > > View this message in context: > > > http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9095.html > > Sent from the Apache Spark Developers List mailing list archive at > > Nabble.com. > > > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9239.html > To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, > click > here. > NAML > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9286.html > To unsubscribe from [MLlib] Contributing Algorithm for Outlier Detection, > click > here > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=8880&code=YW5hbnQuYXN0eUBnbWFpbC5jb218ODg4MHwxOTU2OTQ5NjMy> > . > NAML > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9287.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.