Re: [ANNOUNCEMENT] A query system for BSP processing

Thomas Jungblut Wed, 12 Sep 2012 05:23:13 -0700

Let's feature this project on our site and in our wiki.

2012/9/11 Leonidas Fegaras <[email protected]>


> I created a project on Github:
> https://github.com/fegaras/**mrql.git<https://github.com/fegaras/mrql.git>
>
> Thank you for your help
> Leonidas Fegaras
>
>
> On Sep 7, 2012, at 11:20 AM, Thomas Jungblut wrote:
>
>  Yep, a subproject would be the alternative.
>> In this case we would give you PMC and committer rights so you can
>> actively
>> work on that.
>> However this would make the mapreduce part more or less useless, so if you
>> want to go the hybrid way, feel free to submit an incubation request.
>>
>> 2012/9/7 Suraj Menon <[email protected]>
>>
>>  I think Thomas has a point. How about making it a sub-module/sub-project
>>> of
>>> Hama for now? If/When it gains enough community support to make it a top
>>> level project, you can fork it as a separate project.
>>> I am not completely aware of the procedures and requirements for getting
>>> external project as sub-project.
>>> We can look into it if you are ready to take this route.
>>>
>>>  Could you please send me a link for setting up an open-source Apache
>>>>
>>> project?
>>> If I am right this is what you are looking for -
>>> http://incubator.apache.org/**guides/proposal.html<http://incubator.apache.org/guides/proposal.html>
>>> http://incubator.apache.org/**sitemap.html<http://incubator.apache.org/sitemap.html>
>>>
>>> Good luck,
>>> Suraj
>>>
>>> On Fri, Sep 7, 2012 at 11:40 AM, Thomas Jungblut
>>> <[email protected]>**wrote:
>>>
>>>  Although I think this is a great project, I think that you will not meet
>>>> the requirements.
>>>> You need a community and a charter to get it into the incubation.
>>>>
>>>> What about hosting it on Github?
>>>>
>>>> 2012/9/7 Leonidas Fegaras <[email protected]>
>>>>
>>>>  Yes, this is a great idea. I have used GIT on my own server but I don't
>>>>> know how to do this for ASF. Could you please send me a link for
>>>>>
>>>> setting
>>>
>>>> up
>>>>
>>>>> an open-source Apache project?
>>>>>
>>>>>
>>>>> On 09/05/2012 10:51 AM, Edward J. Yoon wrote:
>>>>>
>>>>>  If you can open source this then I'm sure the ASF community can help
>>>>>> you and make this software better.
>>>>>>
>>>>>> Pls feel free to ask us if you need any assistance donating source
>>>>>> code to the ASF or contributing to the Hama project in the future.
>>>>>>
>>>>>> On Thu, Aug 30, 2012 at 11:40 PM, Leonidas Fegaras<
>>>>>>
>>>>> [email protected]>
>>>
>>>> wrote:
>>>>>>
>>>>>>  Yes sure. I have fixed the bug with the repeat stopping condition
>>>>>>>
>>>>>> but I
>>>
>>>> have
>>>>>>> only tested pagerank on my small cluster. I still need to fix the
>>>>>>>
>>>>>> k-means
>>>>
>>>>> clustering (it's a special case because you improve a fixed number of
>>>>>>> points).
>>>>>>> Leonidas
>>>>>>>
>>>>>>>
>>>>>>> On Aug 30, 2012, at 9:02 AM, Edward J. Yoon wrote:
>>>>>>>
>>>>>>> Shall we work together?
>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Aug 24, 2012 at 9:01 PM, Leonidas Fegaras<
>>>>>>>>
>>>>>>> [email protected]
>>>
>>>>
>>>>>  wrote:
>>>>>>>>
>>>>>>>>  Thank you very much for your interest and for testing my system.
>>>>>>>>> It seems that my release was premature: It worked for some random
>>>>>>>>>
>>>>>>>> data
>>>>
>>>>> but
>>>>>>>>> didn't for some others. It's a minor logical error that I will try
>>>>>>>>>
>>>>>>>> to
>>>
>>>> fix
>>>>>>>>> in
>>>>>>>>> the next few days. The problem is with the stopping condition of
>>>>>>>>>
>>>>>>>> the
>>>
>>>> repeat
>>>>>>>>> expression that calculates the new pagerank from the old. It must
>>>>>>>>>
>>>>>>>> stop
>>>>
>>>>> if
>>>>>>>>> ALL peers reach  the specified precision. This is done by having
>>>>>>>>>
>>>>>>>> those
>>>>
>>>>> peers
>>>>>>>>> that need to continue send a message to others to continue. It
>>>>>>>>>
>>>>>>>> seems
>>>
>>>> that
>>>>>>>>> now when all peers agree at the same time, the program works fine.
>>>>>>>>>
>>>>>>>> But
>>>>
>>>>> if
>>>>>>>>> one finishes sooner, instead of continuing the repeat loop, it runs
>>>>>>>>> away
>>>>>>>>> to
>>>>>>>>> the next BSP step that follows the repeat, then exits prematurely
>>>>>>>>>
>>>>>>>> and
>>>
>>>> the
>>>>>>>>> system hangs. The casting errors are due to the run-away peers
>>>>>>>>> executing
>>>>>>>>> the
>>>>>>>>> wrong BSP steps reading wrong messages. Queries without repeat
>>>>>>>>>
>>>>>>>> though
>>>
>>>> are
>>>>>>>>> OK.
>>>>>>>>> By the way, I had a problem exchanging large amount of data during
>>>>>>>>>
>>>>>>>> sync
>>>>
>>>>> (I
>>>>>>>>> discussed this with Thomas).  My solution was to to break a BSP
>>>>>>>>> superstep
>>>>>>>>> into multiple substeps so that each substep can handle a max number
>>>>>>>>>
>>>>>>>> of
>>>>
>>>>> messages. Of course my program has to collect all messages in a
>>>>>>>>>
>>>>>>>> vector
>>>>
>>>>> in
>>>>>>>>> memory. When the vector is too big, it is spilled in a local file.
>>>>>>>>>
>>>>>>>> This
>>>>
>>>>> moved the problem from the Hama side to my side and allowed me to
>>>>>>>>> handle
>>>>>>>>> larger data, especially in joins. I think this problem of
>>>>>>>>>
>>>>>>>> exchanging
>>>
>>>> large
>>>>>>>>> amount of data during a superstep is currently a weakness of Hama.
>>>>>>>>> Leonidas
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 08/24/2012 04:15 AM, Thomas Jungblut wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> BTW, should we feature this on our website?
>>>>>>>>>>
>>>>>>>>>> 2012/8/24 Thomas 
>>>>>>>>>> Jungblut<thomas.jungblut@**gma**il.com<http://gmail.com>
>>>>>>>>>> <
>>>>>>>>>>
>>>>>>>>> [email protected]>
>>>>
>>>>>
>>>>>>>>>>>
>>>>>>>>>> Hi Leonidas!
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I have to admit that I have known what is going on (and had to
>>>>>>>>>>>
>>>>>>>>>> keep
>>>
>>>> silent), but I have to say: Thank you very much!
>>>>>>>>>>> This will help many people writing BSPs in a more easier way.
>>>>>>>>>>>
>>>>>>>>>>> Of course this is not as fast as the native BSP code, Hive and
>>>>>>>>>>>
>>>>>>>>>> Pig
>>>
>>>> suffer
>>>>>>>>>>> from the same problems in MR.
>>>>>>>>>>> But it gives people the opportunity to develop faster and get
>>>>>>>>>>>
>>>>>>>>>> their
>>>
>>>> code
>>>>>>>>>>> in production with just a minor time expense.
>>>>>>>>>>>
>>>>>>>>>>> And I think, that we will help you gladly on improving the BSP
>>>>>>>>>>>
>>>>>>>>>> part
>>>
>>>> of
>>>>>>>>>>> your framework. At least I would do ;)
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>
>>>>>>>>>>> 2012/8/24 Edward J. Yoon<[email protected]>
>>>>>>>>>>>
>>>>>>>>>>> Here's my few test results on Oracle BDA (40G/s infiniband
>>>>>>>>>>>
>>>>>>>>>> network).
>>>>
>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> It seems slow than our PageRank example.
>>>>>>>>>>>>
>>>>>>>>>>>> P.S., There are some errors so I couldn't test large-scale.
>>>>>>>>>>>> (java.lang.ClassCastException: hadoop.mrql.MR_int cannot be cast
>>>>>>>>>>>>
>>>>>>>>>>> to
>>>>
>>>>> hadoop.mrql.Inv and java.lang.Error: Cannot clear a
>>>>>>>>>>>>
>>>>>>>>>>> non-materialized
>>>>
>>>>> sequence ..., etc.)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> == 100K nodes and 1M edges ==
>>>>>>>>>>>>
>>>>>>>>>>>> *** Using 10 BSP tasks (out of a max 10). Each task will handle
>>>>>>>>>>>> about
>>>>>>>>>>>> 2383611 bytes of input data.
>>>>>>>>>>>>
>>>>>>>>>>>> Run time: 30.384 secs
>>>>>>>>>>>>
>>>>>>>>>>>> *** Using 20 BSP tasks (out of a max 20). Each task will handle
>>>>>>>>>>>> about
>>>>>>>>>>>> 1191805 bytes of input data.
>>>>>>>>>>>>
>>>>>>>>>>>> Run time: 24.412 secs
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon
>>>>>>>>>>>> <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Wow, very interesting. I'm going to install and test on my
>>>>>>>>>>>>>
>>>>>>>>>>>> large
>>>
>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> cluster.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras
>>>>>>>>>>>>> <[email protected]>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>  Dear Hama users,
>>>>>>>>>>>>>> I am pleased to announce that the MRQL query processing system
>>>>>>>>>>>>>>
>>>>>>>>>>>>> can
>>>>
>>>>> now
>>>>>>>>>>>>>> evaluate SQL-like queries on a Hama cluster. MRQL is available
>>>>>>>>>>>>>>
>>>>>>>>>>>>> at:
>>>>
>>>>>
>>>>>>>>>>>>>> http://lambda.uta.edu/mrql/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> MRQL (the Map-Reduce Query Language) is an SQL-like query
>>>>>>>>>>>>>>
>>>>>>>>>>>>> language
>>>>
>>>>> for
>>>>>>>>>>>>>> large-scale, distributed data analysis. MRQL is powerful
>>>>>>>>>>>>>>
>>>>>>>>>>>>> enough
>>>
>>>> to
>>>>
>>>>> express most common data analysis tasks over many different
>>>>>>>>>>>>>>
>>>>>>>>>>>>> kinds
>>>>
>>>>> of
>>>>>>>>>>>>>> raw data, including hierarchical data and nested collections,
>>>>>>>>>>>>>>
>>>>>>>>>>>>> such
>>>>
>>>>> as
>>>>>>>>>>>>>> XML data. MRQL can run in two modes: in MR (Map-Reduce) mode
>>>>>>>>>>>>>>
>>>>>>>>>>>>> using
>>>>
>>>>> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode
>>>>>>>>>>>>>>
>>>>>>>>>>>>> using
>>>
>>>> Apache
>>>>>>>>>>>>>> Hama. Both modes use Apache's HDFS to read and write their
>>>>>>>>>>>>>>
>>>>>>>>>>>>> data.
>>>
>>>>
>>>>>>>>>>>>>> Note that, the BSP mode is currently experimental (not
>>>>>>>>>>>>>>
>>>>>>>>>>>>> fine-tuned
>>>>
>>>>> yet)
>>>>>>>>>>>>>> and lacks any fault-tolerance (if an error occurs, the entire
>>>>>>>>>>>>>>
>>>>>>>>>>>>> job
>>>>
>>>>> must
>>>>>>>>>>>>>> be restarted). Due to our limited resources, MRQL has only
>>>>>>>>>>>>>>
>>>>>>>>>>>>> been
>>>
>>>> tested
>>>>>>>>>>>>>> on a small cluster (7-nodes/28-cores). We compared the BSP
>>>>>>>>>>>>>>
>>>>>>>>>>>>> mode
>>>
>>>> with
>>>>>>>>>>>>>> the MR mode by evaluating a pagerank query over a small graph
>>>>>>>>>>>>>> (100K
>>>>>>>>>>>>>> nodes, 1M edges) and found that BSP mode is about 4.5 times
>>>>>>>>>>>>>>
>>>>>>>>>>>>> faster
>>>>
>>>>> than the MR mode. Please let me know if you'd like to
>>>>>>>>>>>>>>
>>>>>>>>>>>>> contribute
>>>
>>>> to
>>>>>>>>>>>>>> this project by testing MRQL on a larger cluster.
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> Leonidas Fegaras
>>>>>>>>>>>>>> University of Texas at Arlington
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  --
>>>>>>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>>>>>>> @eddieyoon
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>>>>>> @eddieyoon
>>>>>>>>>>>>
>>>>>>>>>>>> .
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> --
>>>>>>>> Best Regards, Edward J. Yoon
>>>>>>>> @eddieyoon
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Re: [ANNOUNCEMENT] A query system for BSP processing

Reply via email to