Let's feature this project on our site and in our wiki. 2012/9/11 Leonidas Fegaras <[email protected]>
> I created a project on Github: > https://github.com/fegaras/**mrql.git<https://github.com/fegaras/mrql.git> > > Thank you for your help > Leonidas Fegaras > > > On Sep 7, 2012, at 11:20 AM, Thomas Jungblut wrote: > > Yep, a subproject would be the alternative. >> In this case we would give you PMC and committer rights so you can >> actively >> work on that. >> However this would make the mapreduce part more or less useless, so if you >> want to go the hybrid way, feel free to submit an incubation request. >> >> 2012/9/7 Suraj Menon <[email protected]> >> >> I think Thomas has a point. How about making it a sub-module/sub-project >>> of >>> Hama for now? If/When it gains enough community support to make it a top >>> level project, you can fork it as a separate project. >>> I am not completely aware of the procedures and requirements for getting >>> external project as sub-project. >>> We can look into it if you are ready to take this route. >>> >>> Could you please send me a link for setting up an open-source Apache >>>> >>> project? >>> If I am right this is what you are looking for - >>> http://incubator.apache.org/**guides/proposal.html<http://incubator.apache.org/guides/proposal.html> >>> http://incubator.apache.org/**sitemap.html<http://incubator.apache.org/sitemap.html> >>> >>> Good luck, >>> Suraj >>> >>> On Fri, Sep 7, 2012 at 11:40 AM, Thomas Jungblut >>> <[email protected]>**wrote: >>> >>> Although I think this is a great project, I think that you will not meet >>>> the requirements. >>>> You need a community and a charter to get it into the incubation. >>>> >>>> What about hosting it on Github? >>>> >>>> 2012/9/7 Leonidas Fegaras <[email protected]> >>>> >>>> Yes, this is a great idea. I have used GIT on my own server but I don't >>>>> know how to do this for ASF. Could you please send me a link for >>>>> >>>> setting >>> >>>> up >>>> >>>>> an open-source Apache project? >>>>> >>>>> >>>>> On 09/05/2012 10:51 AM, Edward J. Yoon wrote: >>>>> >>>>> If you can open source this then I'm sure the ASF community can help >>>>>> you and make this software better. >>>>>> >>>>>> Pls feel free to ask us if you need any assistance donating source >>>>>> code to the ASF or contributing to the Hama project in the future. >>>>>> >>>>>> On Thu, Aug 30, 2012 at 11:40 PM, Leonidas Fegaras< >>>>>> >>>>> [email protected]> >>> >>>> wrote: >>>>>> >>>>>> Yes sure. I have fixed the bug with the repeat stopping condition >>>>>>> >>>>>> but I >>> >>>> have >>>>>>> only tested pagerank on my small cluster. I still need to fix the >>>>>>> >>>>>> k-means >>>> >>>>> clustering (it's a special case because you improve a fixed number of >>>>>>> points). >>>>>>> Leonidas >>>>>>> >>>>>>> >>>>>>> On Aug 30, 2012, at 9:02 AM, Edward J. Yoon wrote: >>>>>>> >>>>>>> Shall we work together? >>>>>>> >>>>>>>> >>>>>>>> On Fri, Aug 24, 2012 at 9:01 PM, Leonidas Fegaras< >>>>>>>> >>>>>>> [email protected] >>> >>>> >>>>> wrote: >>>>>>>> >>>>>>>> Thank you very much for your interest and for testing my system. >>>>>>>>> It seems that my release was premature: It worked for some random >>>>>>>>> >>>>>>>> data >>>> >>>>> but >>>>>>>>> didn't for some others. It's a minor logical error that I will try >>>>>>>>> >>>>>>>> to >>> >>>> fix >>>>>>>>> in >>>>>>>>> the next few days. The problem is with the stopping condition of >>>>>>>>> >>>>>>>> the >>> >>>> repeat >>>>>>>>> expression that calculates the new pagerank from the old. It must >>>>>>>>> >>>>>>>> stop >>>> >>>>> if >>>>>>>>> ALL peers reach the specified precision. This is done by having >>>>>>>>> >>>>>>>> those >>>> >>>>> peers >>>>>>>>> that need to continue send a message to others to continue. It >>>>>>>>> >>>>>>>> seems >>> >>>> that >>>>>>>>> now when all peers agree at the same time, the program works fine. >>>>>>>>> >>>>>>>> But >>>> >>>>> if >>>>>>>>> one finishes sooner, instead of continuing the repeat loop, it runs >>>>>>>>> away >>>>>>>>> to >>>>>>>>> the next BSP step that follows the repeat, then exits prematurely >>>>>>>>> >>>>>>>> and >>> >>>> the >>>>>>>>> system hangs. The casting errors are due to the run-away peers >>>>>>>>> executing >>>>>>>>> the >>>>>>>>> wrong BSP steps reading wrong messages. Queries without repeat >>>>>>>>> >>>>>>>> though >>> >>>> are >>>>>>>>> OK. >>>>>>>>> By the way, I had a problem exchanging large amount of data during >>>>>>>>> >>>>>>>> sync >>>> >>>>> (I >>>>>>>>> discussed this with Thomas). My solution was to to break a BSP >>>>>>>>> superstep >>>>>>>>> into multiple substeps so that each substep can handle a max number >>>>>>>>> >>>>>>>> of >>>> >>>>> messages. Of course my program has to collect all messages in a >>>>>>>>> >>>>>>>> vector >>>> >>>>> in >>>>>>>>> memory. When the vector is too big, it is spilled in a local file. >>>>>>>>> >>>>>>>> This >>>> >>>>> moved the problem from the Hama side to my side and allowed me to >>>>>>>>> handle >>>>>>>>> larger data, especially in joins. I think this problem of >>>>>>>>> >>>>>>>> exchanging >>> >>>> large >>>>>>>>> amount of data during a superstep is currently a weakness of Hama. >>>>>>>>> Leonidas >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 08/24/2012 04:15 AM, Thomas Jungblut wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>>> BTW, should we feature this on our website? >>>>>>>>>> >>>>>>>>>> 2012/8/24 Thomas >>>>>>>>>> Jungblut<thomas.jungblut@**gma**il.com<http://gmail.com> >>>>>>>>>> < >>>>>>>>>> >>>>>>>>> [email protected]> >>>> >>>>> >>>>>>>>>>> >>>>>>>>>> Hi Leonidas! >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I have to admit that I have known what is going on (and had to >>>>>>>>>>> >>>>>>>>>> keep >>> >>>> silent), but I have to say: Thank you very much! >>>>>>>>>>> This will help many people writing BSPs in a more easier way. >>>>>>>>>>> >>>>>>>>>>> Of course this is not as fast as the native BSP code, Hive and >>>>>>>>>>> >>>>>>>>>> Pig >>> >>>> suffer >>>>>>>>>>> from the same problems in MR. >>>>>>>>>>> But it gives people the opportunity to develop faster and get >>>>>>>>>>> >>>>>>>>>> their >>> >>>> code >>>>>>>>>>> in production with just a minor time expense. >>>>>>>>>>> >>>>>>>>>>> And I think, that we will help you gladly on improving the BSP >>>>>>>>>>> >>>>>>>>>> part >>> >>>> of >>>>>>>>>>> your framework. At least I would do ;) >>>>>>>>>>> >>>>>>>>>>> Thanks! >>>>>>>>>>> >>>>>>>>>>> 2012/8/24 Edward J. Yoon<[email protected]> >>>>>>>>>>> >>>>>>>>>>> Here's my few test results on Oracle BDA (40G/s infiniband >>>>>>>>>>> >>>>>>>>>> network). >>>> >>>>> >>>>>>>>>>> >>>>>>>>>>>> It seems slow than our PageRank example. >>>>>>>>>>>> >>>>>>>>>>>> P.S., There are some errors so I couldn't test large-scale. >>>>>>>>>>>> (java.lang.ClassCastException: hadoop.mrql.MR_int cannot be cast >>>>>>>>>>>> >>>>>>>>>>> to >>>> >>>>> hadoop.mrql.Inv and java.lang.Error: Cannot clear a >>>>>>>>>>>> >>>>>>>>>>> non-materialized >>>> >>>>> sequence ..., etc.) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> == 100K nodes and 1M edges == >>>>>>>>>>>> >>>>>>>>>>>> *** Using 10 BSP tasks (out of a max 10). Each task will handle >>>>>>>>>>>> about >>>>>>>>>>>> 2383611 bytes of input data. >>>>>>>>>>>> >>>>>>>>>>>> Run time: 30.384 secs >>>>>>>>>>>> >>>>>>>>>>>> *** Using 20 BSP tasks (out of a max 20). Each task will handle >>>>>>>>>>>> about >>>>>>>>>>>> 1191805 bytes of input data. >>>>>>>>>>>> >>>>>>>>>>>> Run time: 24.412 secs >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon >>>>>>>>>>>> <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> Wow, very interesting. I'm going to install and test on my >>>>>>>>>>>>> >>>>>>>>>>>> large >>> >>>> >>>>>>>>>>>>> >>>>>>>>>>>> cluster. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras >>>>>>>>>>>>> <[email protected]> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> Dear Hama users, >>>>>>>>>>>>>> I am pleased to announce that the MRQL query processing system >>>>>>>>>>>>>> >>>>>>>>>>>>> can >>>> >>>>> now >>>>>>>>>>>>>> evaluate SQL-like queries on a Hama cluster. MRQL is available >>>>>>>>>>>>>> >>>>>>>>>>>>> at: >>>> >>>>> >>>>>>>>>>>>>> http://lambda.uta.edu/mrql/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> MRQL (the Map-Reduce Query Language) is an SQL-like query >>>>>>>>>>>>>> >>>>>>>>>>>>> language >>>> >>>>> for >>>>>>>>>>>>>> large-scale, distributed data analysis. MRQL is powerful >>>>>>>>>>>>>> >>>>>>>>>>>>> enough >>> >>>> to >>>> >>>>> express most common data analysis tasks over many different >>>>>>>>>>>>>> >>>>>>>>>>>>> kinds >>>> >>>>> of >>>>>>>>>>>>>> raw data, including hierarchical data and nested collections, >>>>>>>>>>>>>> >>>>>>>>>>>>> such >>>> >>>>> as >>>>>>>>>>>>>> XML data. MRQL can run in two modes: in MR (Map-Reduce) mode >>>>>>>>>>>>>> >>>>>>>>>>>>> using >>>> >>>>> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode >>>>>>>>>>>>>> >>>>>>>>>>>>> using >>> >>>> Apache >>>>>>>>>>>>>> Hama. Both modes use Apache's HDFS to read and write their >>>>>>>>>>>>>> >>>>>>>>>>>>> data. >>> >>>> >>>>>>>>>>>>>> Note that, the BSP mode is currently experimental (not >>>>>>>>>>>>>> >>>>>>>>>>>>> fine-tuned >>>> >>>>> yet) >>>>>>>>>>>>>> and lacks any fault-tolerance (if an error occurs, the entire >>>>>>>>>>>>>> >>>>>>>>>>>>> job >>>> >>>>> must >>>>>>>>>>>>>> be restarted). Due to our limited resources, MRQL has only >>>>>>>>>>>>>> >>>>>>>>>>>>> been >>> >>>> tested >>>>>>>>>>>>>> on a small cluster (7-nodes/28-cores). We compared the BSP >>>>>>>>>>>>>> >>>>>>>>>>>>> mode >>> >>>> with >>>>>>>>>>>>>> the MR mode by evaluating a pagerank query over a small graph >>>>>>>>>>>>>> (100K >>>>>>>>>>>>>> nodes, 1M edges) and found that BSP mode is about 4.5 times >>>>>>>>>>>>>> >>>>>>>>>>>>> faster >>>> >>>>> than the MR mode. Please let me know if you'd like to >>>>>>>>>>>>>> >>>>>>>>>>>>> contribute >>> >>>> to >>>>>>>>>>>>>> this project by testing MRQL on a larger cluster. >>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>> Leonidas Fegaras >>>>>>>>>>>>>> University of Texas at Arlington >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>> Best Regards, Edward J. Yoon >>>>>>>>>>>>> @eddieyoon >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Best Regards, Edward J. Yoon >>>>>>>>>>>> @eddieyoon >>>>>>>>>>>> >>>>>>>>>>>> . >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> -- >>>>>>>> Best Regards, Edward J. Yoon >>>>>>>> @eddieyoon >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >
