Hi Vinayak, thanks for putting this down in detail!
I agree that this is a very interesting direction, and I think that we should steer this way. Till On Feb 8, 2012, at 10:35 AM, Vinayak Borkar wrote: > Thanks Cezar/Mike. > > Imagine we have a large number of relatively small XML documents stored on a > cluster of computers. For example, documents from the EDGAR dataset > (http://edgar.sec.gov/) which is all paperwork that public companies need to > file every quarter with the SEC, in XML format. > > Initially, owing to the side-effect-free nature of the core XQuery language, > VXQuery could target analytics-style queries against such data while > harnessing the power (CPU and I/O) of multiple possibly multi-core processors. > > An example query like, > > count( > for $d in collection('EDGAR') > where $d/COMPANY_NAME = 'IBM' > return $d > ) > > which counts the number of documents that have IBM as the company name would > be evaluated by essentially running the FLWOR inside the count independently > on each machine that stores a part of the entire collection and compute local > counts. Finally the local counts from each leaf machine would be summed up in > one place to produce the result of the query. > > For the past three years Mike and I have been involved with another Apache > Licensed project called Hyracks (http://code.google.com/p/hyracks/) that > provides an efficient runtime for data-parallel tasks. But what is more > interesting from the VXQuery point of view is a logical algebra abstraction > in the Hyracks project called Algebricks. > > Algebricks is an extended nested-relational algebra library that can be used > to express query semantics of a large set of declarative data processing > languages (to which XQuery belongs). The Algebricks framework automatically > optimizes the specified algebraic expression into a physical plan and > parallelizes it to use the Hyracks runtime. In fact, the goal of the > Algebricks platform was to provide a simple path for language implementors to > quickly get new declarative data languages running efficiently, in parallel, > on a shared-nothing cluster of machines. > > The high-level tasks that we will have to complete to get VXQuery running on > a cluster would be: > > 1. Build a translator that converts the existing XQuery AST object model that > is emitted by the parser into an Algebricks algebra expression. > 2. Build an implementation of the Metadata interface needed by algebricks > that help the runtime resolve things like location of base data. > 3. Build an implementation of the runtime function call interface so the > actual function work is done by code that already exists in VXQuery, but is > invoked by the Hyracks runtime. > 4. Implement serializers/deserializers for the various datamodel pieces in > VXQuery to be able to transport data across machines. > > If someone is looking for a project in the context of GSoC, I can see this > task of building a parallel XQuery engine could be an interesting one. > > > Thoughts? > > > Thanks, > Vinayak > > > On 02/08/2012 08:18 AM, Cezar Andrei wrote: >> I like the idea, sounds very interesting. >> Vinayak, will you put your thoughts in more detail and maybe make a list of >> features that we can use for the GSOC list? >> >> Cezar >> >> On Wed, Feb 8, 2012 at 8:52 AM, Michael Carey<[email protected]> wrote: >> >>> This sounds like a great direction, and one that would be very interesting >>> to the community! (Except maybe Marklogic? :-)) >>> >>> Cheers, >>> Mike >>> >>> >>> >>> On 2/8/12 1:44 AM, Vinayak Borkar wrote: >>> >>>> Guys, >>>> >>>> >>>> Given that we are at a juncture where either we try to build out this >>>> project and build a community OR remove the project from the incubator, I >>>> have a proposal that I feel will help us get community interest in the >>>> project. >>>> >>>> My proposal is to slightly change the focus of the project to cater to >>>> different use cases than originally proposed while still continuing to >>>> build an XQuery processor. >>>> >>>> In the original proposal, we proposed to build an XQuery processor to >>>> target multiple input formats. In the beginning there was a lot of interest >>>> in this direction from the mentors which has seemed to quiet down recently. >>>> In the meantime, people have been increasingly interested in processing >>>> large amounts of data. To this end, I propose that we switch the focus of >>>> the VXQuery project to target big XML data use cases. >>>> >>>> In terms of work done, we get to reuse a majority of the code that has >>>> already been built. In terms of the tasks to be done, the immediate focus >>>> will be on parallelizing the existing codebase to be able to handle large >>>> amounts of XML data. >>>> >>>> I feel that this slight change in focus will be a fun challenge from a >>>> development standpoint and also will help us gain a community given the >>>> growing interest in Big data processing. >>>> >>>> Looking forward to your thoughts. >>>> >>>> Thanks, >>>> Vinayak >>>> >>> >> >
