Re: [ANNOUNCEMENT] A query system for BSP processing

Suraj Menon Fri, 07 Sep 2012 09:03:01 -0700

I think Thomas has a point. How about making it a sub-module/sub-project of
Hama for now? If/When it gains enough community support to make it a top
level project, you can fork it as a separate project.
I am not completely aware of the procedures and requirements for getting
external project as sub-project.
We can look into it if you are ready to take this route.


> Could you please send me a link for setting up an open-source Apache
project?
If I am right this is what you are looking for -
http://incubator.apache.org/guides/proposal.html
http://incubator.apache.org/sitemap.html

Good luck,
Suraj

On Fri, Sep 7, 2012 at 11:40 AM, Thomas Jungblut
<[email protected]>wrote:

> Although I think this is a great project, I think that you will not meet
> the requirements.
> You need a community and a charter to get it into the incubation.
>
> What about hosting it on Github?
>
> 2012/9/7 Leonidas Fegaras <[email protected]>
>
> > Yes, this is a great idea. I have used GIT on my own server but I don't
> > know how to do this for ASF. Could you please send me a link for setting
> up
> > an open-source Apache project?
> >
> >
> > On 09/05/2012 10:51 AM, Edward J. Yoon wrote:
> >
> >> If you can open source this then I'm sure the ASF community can help
> >> you and make this software better.
> >>
> >> Pls feel free to ask us if you need any assistance donating source
> >> code to the ASF or contributing to the Hama project in the future.
> >>
> >> On Thu, Aug 30, 2012 at 11:40 PM, Leonidas Fegaras<[email protected]>
> >>  wrote:
> >>
> >>> Yes sure. I have fixed the bug with the repeat stopping condition but I
> >>> have
> >>> only tested pagerank on my small cluster. I still need to fix the
> k-means
> >>> clustering (it's a special case because you improve a fixed number of
> >>> points).
> >>> Leonidas
> >>>
> >>>
> >>> On Aug 30, 2012, at 9:02 AM, Edward J. Yoon wrote:
> >>>
> >>>  Shall we work together?
> >>>>
> >>>> On Fri, Aug 24, 2012 at 9:01 PM, Leonidas Fegaras<[email protected]
> >
> >>>> wrote:
> >>>>
> >>>>> Thank you very much for your interest and for testing my system.
> >>>>> It seems that my release was premature: It worked for some random
> data
> >>>>> but
> >>>>> didn't for some others. It's a minor logical error that I will try to
> >>>>> fix
> >>>>> in
> >>>>> the next few days. The problem is with the stopping condition of the
> >>>>> repeat
> >>>>> expression that calculates the new pagerank from the old. It must
> stop
> >>>>> if
> >>>>> ALL peers reach  the specified precision. This is done by having
> those
> >>>>> peers
> >>>>> that need to continue send a message to others to continue. It seems
> >>>>> that
> >>>>> now when all peers agree at the same time, the program works fine.
> But
> >>>>> if
> >>>>> one finishes sooner, instead of continuing the repeat loop, it runs
> >>>>> away
> >>>>> to
> >>>>> the next BSP step that follows the repeat, then exits prematurely and
> >>>>> the
> >>>>> system hangs. The casting errors are due to the run-away peers
> >>>>> executing
> >>>>> the
> >>>>> wrong BSP steps reading wrong messages. Queries without repeat though
> >>>>> are
> >>>>> OK.
> >>>>> By the way, I had a problem exchanging large amount of data during
> sync
> >>>>> (I
> >>>>> discussed this with Thomas).  My solution was to to break a BSP
> >>>>> superstep
> >>>>> into multiple substeps so that each substep can handle a max number
> of
> >>>>> messages. Of course my program has to collect all messages in a
> vector
> >>>>> in
> >>>>> memory. When the vector is too big, it is spilled in a local file.
> This
> >>>>> moved the problem from the Hama side to my side and allowed me to
> >>>>> handle
> >>>>> larger data, especially in joins. I think this problem of exchanging
> >>>>> large
> >>>>> amount of data during a superstep is currently a weakness of Hama.
> >>>>> Leonidas
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 08/24/2012 04:15 AM, Thomas Jungblut wrote:
> >>>>>
> >>>>>>
> >>>>>> BTW, should we feature this on our website?
> >>>>>>
> >>>>>> 2012/8/24 Thomas Jungblut<thomas.jungblut@**gmail.com<
> [email protected]>
> >>>>>> >
> >>>>>>
> >>>>>>  Hi Leonidas!
> >>>>>>>
> >>>>>>> I have to admit that I have known what is going on (and had to keep
> >>>>>>> silent), but I have to say: Thank you very much!
> >>>>>>> This will help many people writing BSPs in a more easier way.
> >>>>>>>
> >>>>>>> Of course this is not as fast as the native BSP code, Hive and Pig
> >>>>>>> suffer
> >>>>>>> from the same problems in MR.
> >>>>>>> But it gives people the opportunity to develop faster and get their
> >>>>>>> code
> >>>>>>> in production with just a minor time expense.
> >>>>>>>
> >>>>>>> And I think, that we will help you gladly on improving the BSP part
> >>>>>>> of
> >>>>>>> your framework. At least I would do ;)
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>> 2012/8/24 Edward J. Yoon<[email protected]>
> >>>>>>>
> >>>>>>> Here's my few test results on Oracle BDA (40G/s infiniband
> network).
> >>>>>>>
> >>>>>>>>
> >>>>>>>> It seems slow than our PageRank example.
> >>>>>>>>
> >>>>>>>> P.S., There are some errors so I couldn't test large-scale.
> >>>>>>>> (java.lang.ClassCastException: hadoop.mrql.MR_int cannot be cast
> to
> >>>>>>>> hadoop.mrql.Inv and java.lang.Error: Cannot clear a
> non-materialized
> >>>>>>>> sequence ..., etc.)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> == 100K nodes and 1M edges ==
> >>>>>>>>
> >>>>>>>> *** Using 10 BSP tasks (out of a max 10). Each task will handle
> >>>>>>>> about
> >>>>>>>> 2383611 bytes of input data.
> >>>>>>>>
> >>>>>>>> Run time: 30.384 secs
> >>>>>>>>
> >>>>>>>> *** Using 20 BSP tasks (out of a max 20). Each task will handle
> >>>>>>>> about
> >>>>>>>> 1191805 bytes of input data.
> >>>>>>>>
> >>>>>>>> Run time: 24.412 secs
> >>>>>>>>
> >>>>>>>> On Fri, Aug 24, 2012 at 9:36 AM, Edward J. Yoon
> >>>>>>>> <[email protected]>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Wow, very interesting. I'm going to install and test on my large
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> cluster.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Fri, Aug 24, 2012 at 4:41 AM, Leonidas Fegaras
> >>>>>>>>> <[email protected]>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Dear Hama users,
> >>>>>>>>>> I am pleased to announce that the MRQL query processing system
> can
> >>>>>>>>>> now
> >>>>>>>>>> evaluate SQL-like queries on a Hama cluster. MRQL is available
> at:
> >>>>>>>>>>
> >>>>>>>>>> http://lambda.uta.edu/mrql/
> >>>>>>>>>>
> >>>>>>>>>> MRQL (the Map-Reduce Query Language) is an SQL-like query
> language
> >>>>>>>>>> for
> >>>>>>>>>> large-scale, distributed data analysis. MRQL is powerful enough
> to
> >>>>>>>>>> express most common data analysis tasks over many different
> kinds
> >>>>>>>>>> of
> >>>>>>>>>> raw data, including hierarchical data and nested collections,
> such
> >>>>>>>>>> as
> >>>>>>>>>> XML data. MRQL can run in two modes: in MR (Map-Reduce) mode
> using
> >>>>>>>>>> Apache Hadoop and in BSP (Bulk Synchronous Parallel) mode using
> >>>>>>>>>> Apache
> >>>>>>>>>> Hama. Both modes use Apache's HDFS to read and write their data.
> >>>>>>>>>>
> >>>>>>>>>> Note that, the BSP mode is currently experimental (not
> fine-tuned
> >>>>>>>>>> yet)
> >>>>>>>>>> and lacks any fault-tolerance (if an error occurs, the entire
> job
> >>>>>>>>>> must
> >>>>>>>>>> be restarted). Due to our limited resources, MRQL has only been
> >>>>>>>>>> tested
> >>>>>>>>>> on a small cluster (7-nodes/28-cores). We compared the BSP mode
> >>>>>>>>>> with
> >>>>>>>>>> the MR mode by evaluating a pagerank query over a small graph
> >>>>>>>>>> (100K
> >>>>>>>>>> nodes, 1M edges) and found that BSP mode is about 4.5 times
> faster
> >>>>>>>>>> than the MR mode. Please let me know if you'd like to contribute
> >>>>>>>>>> to
> >>>>>>>>>> this project by testing MRQL on a larger cluster.
> >>>>>>>>>> Best regards,
> >>>>>>>>>> Leonidas Fegaras
> >>>>>>>>>> University of Texas at Arlington
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Best Regards, Edward J. Yoon
> >>>>>>>>> @eddieyoon
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Best Regards, Edward J. Yoon
> >>>>>>>> @eddieyoon
> >>>>>>>>
> >>>>>>>>  .
> >>>>>>
> >>>>>>
> >>>>
> >>>> --
> >>>> Best Regards, Edward J. Yoon
> >>>> @eddieyoon
> >>>>
> >>>
> >>>
> >>
> >>
> >
>

Re: [ANNOUNCEMENT] A query system for BSP processing

Reply via email to