[ 
https://issues.apache.org/jira/browse/MRQL-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736407#comment-13736407
 ] 

Leonidas Fegaras commented on MRQL-12:
--------------------------------------

I think the spark extension is ready to be incorporated into MRQL as a patch. I 
am attaching the latest patch MRQL-12.2.patch and the Spark evaluator next. 
Please let me know by tomorrow (Monday August 12) if you approve this patch.

I have tested it on a laptop with 8 cores and on 16-node cluster with 64 cores. 
I have used the MRQL pagerank and kmeans queries. I also run the same queries 
on Hama 0.5.0. I still need to do more experiments. Here are the run-times in 
secs (eg, 500K/2M means 500000 nodes 2000000 edges):

On laptop with 8 cores:
                                  Hama        Spark
Pagerank 500K/2M:        211          341
KMeans   1M:                  31           22
KMeans   2M:                  41           40
KMeans   4M:                165           77

On cluster with 64 cores:
                                   Hama        Spark
Pagerank 1M/10M:        3590          428
KMeans   10M:                87            82
KMeans   20M:              129           134

On cluster with 32 cores:
                                   Hama        Spark
Pagerank 1M/10M:        4419          434
KMeans   10M:                 98           74
KMeans   20M:               273           74

To compile it, first install scala and spark (see the spark web site), edit 
conf/mrql-env.sh to match your installation and then do 'make spark' in mrql. 
To validate it, do 'make validate_spark'.
To run it on a Spark cluster, list the worker nodes in spark-*/conf/slaves and 
launch a standalone cluster using spark-*/bin/start-all.sh
To run the pagerank query on 32 workers, use bin/mrql.spark -dist -nodes 32 
queries/pagerank.mrql

                
> Support query evaluation in Spark mode
> --------------------------------------
>
>                 Key: MRQL-12
>                 URL: https://issues.apache.org/jira/browse/MRQL-12
>             Project: MRQL
>          Issue Type: Improvement
>          Components: Run-Time Data
>    Affects Versions: 0.9.0
>         Environment: Apache Spark http://spark-project.org/
>            Reporter: Leonidas Fegaras
>            Assignee: Leonidas Fegaras
>         Attachments: Evaluator.gen, MRQL-12.patch
>
>   Original Estimate: 240h
>  Remaining Estimate: 240h
>
> Spark provides primitives for in-memory cluster computing 
> (http://spark-project.org/). It has been developed at UC Berkeley and has 
> recently accepted as an ASF incubating project. It has already attracted many 
> developers and I think it will play a major role in the hadoop ecosystem. So, 
> I thought it will be nice to be able to evaluate MRQL queries in a Spark 
> cluster. Spark already supports Hive (called Shark). Like Hama, Spark can 
> evaluate queries in memory but unlike Hama, it supports full fault-tolerance. 
> I have already written all the code but I have only tested it in local mode 
> (on a single multi-core node). This task turned out to be easier than I 
> thought because MRQL plans are similar to Spark operations. The only 
> annoyance was that I had to make all data structures Serializable. I also had 
> to include the Gen source code (the Java preprocessor), with ASF licence, 
> which will make the transition to maven easier.
> I am attaching the patch below. The actual code that contains the Spark 
> evaluator is the file Evaluator.gen which is attached separately. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to