Hi Vinayak,

thanks for putting this down in detail!

I agree that this is a very interesting direction, and I think that we should 
steer this way.

Till

On Feb 8, 2012, at 10:35 AM, Vinayak Borkar wrote:

> Thanks Cezar/Mike.
> 
> Imagine we have a large number of relatively small XML documents stored on a 
> cluster of computers. For example, documents from the EDGAR dataset 
> (http://edgar.sec.gov/) which is all paperwork that public companies need to 
> file every quarter with the SEC, in XML format.
> 
> Initially, owing to the side-effect-free nature of the core XQuery language, 
> VXQuery could target analytics-style queries against such data while 
> harnessing the power (CPU and I/O) of multiple possibly multi-core processors.
> 
> An example query like,
> 
> count(
> for $d in collection('EDGAR')
> where $d/COMPANY_NAME = 'IBM'
> return $d
> )
> 
> which counts the number of documents that have IBM as the company name would 
> be evaluated by essentially running the FLWOR inside the count independently 
> on each machine that stores a part of the entire collection and compute local 
> counts. Finally the local counts from each leaf machine would be summed up in 
> one place to produce the result of the query.
> 
> For the past three years Mike and I have been involved with another Apache 
> Licensed project called Hyracks (http://code.google.com/p/hyracks/) that 
> provides an efficient runtime for data-parallel tasks. But what is more 
> interesting from the VXQuery point of view is a logical algebra abstraction 
> in the Hyracks project called Algebricks.
> 
> Algebricks is an extended nested-relational algebra library that can be used 
> to express query semantics of a large set of declarative data processing 
> languages (to which XQuery belongs). The Algebricks framework automatically 
> optimizes the specified algebraic expression into a physical plan and 
> parallelizes it to use the Hyracks runtime. In fact, the goal of the 
> Algebricks platform was to provide a simple path for language implementors to 
> quickly get new declarative data languages running efficiently, in parallel, 
> on a shared-nothing cluster of machines.
> 
> The high-level tasks that we will have to complete to get VXQuery running on 
> a cluster would be:
> 
> 1. Build a translator that converts the existing XQuery AST object model that 
> is emitted by the parser into an Algebricks algebra expression.
> 2. Build an implementation of the Metadata interface needed by algebricks 
> that help the runtime resolve things like location of base data.
> 3. Build an implementation of the runtime function call interface so the 
> actual function work is done by code that already exists in VXQuery, but is 
> invoked by the Hyracks runtime.
> 4. Implement serializers/deserializers for the various datamodel pieces in 
> VXQuery to be able to transport data across machines.
> 
> If someone is looking for a project in the context of GSoC, I can see this 
> task of building a parallel XQuery engine could be an interesting one.
> 
> 
> Thoughts?
> 
> 
> Thanks,
> Vinayak
> 
> 
> On 02/08/2012 08:18 AM, Cezar Andrei wrote:
>> I like the idea, sounds very interesting.
>> Vinayak, will you put your thoughts in more detail and maybe make a list of
>> features that we can use for the GSOC list?
>> 
>> Cezar
>> 
>> On Wed, Feb 8, 2012 at 8:52 AM, Michael Carey<[email protected]>  wrote:
>> 
>>> This sounds like a great direction, and one that would be very interesting
>>> to the community!  (Except maybe Marklogic? :-))
>>> 
>>> Cheers,
>>> Mike
>>> 
>>> 
>>> 
>>> On 2/8/12 1:44 AM, Vinayak Borkar wrote:
>>> 
>>>> Guys,
>>>> 
>>>> 
>>>> Given that we are at a juncture where either we try to build out this
>>>> project and build a community OR remove the project from the incubator, I
>>>> have a proposal that I feel will help us get community interest in the
>>>> project.
>>>> 
>>>> My proposal is to slightly change the focus of the project to cater to
>>>> different use cases than originally proposed while still continuing to
>>>> build an XQuery processor.
>>>> 
>>>> In the original proposal, we proposed to build an XQuery processor to
>>>> target multiple input formats. In the beginning there was a lot of interest
>>>> in this direction from the mentors which has seemed to quiet down recently.
>>>> In the meantime, people have been increasingly interested in processing
>>>> large amounts of data. To this end, I propose that we switch the focus of
>>>> the VXQuery project to target big XML data use cases.
>>>> 
>>>> In terms of work done, we get to reuse a majority of the code that has
>>>> already been built. In terms of the tasks to be done, the immediate focus
>>>> will be on parallelizing the existing codebase to be able to handle large
>>>> amounts of XML data.
>>>> 
>>>> I feel that this slight change in focus will be a fun challenge from a
>>>> development standpoint and also will help us gain a community given the
>>>> growing interest in Big data processing.
>>>> 
>>>> Looking forward to your thoughts.
>>>> 
>>>> Thanks,
>>>> Vinayak
>>>> 
>>> 
>> 
> 

Reply via email to