[
https://issues.apache.org/jira/browse/MRQL-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736407#comment-13736407
]
Leonidas Fegaras commented on MRQL-12:
--------------------------------------
I think the spark extension is ready to be incorporated into MRQL as a patch. I
am attaching the latest patch MRQL-12.2.patch and the Spark evaluator next.
Please let me know by tomorrow (Monday August 12) if you approve this patch.
I have tested it on a laptop with 8 cores and on 16-node cluster with 64 cores.
I have used the MRQL pagerank and kmeans queries. I also run the same queries
on Hama 0.5.0. I still need to do more experiments. Here are the run-times in
secs (eg, 500K/2M means 500000 nodes 2000000 edges):
On laptop with 8 cores:
Hama Spark
Pagerank 500K/2M: 211 341
KMeans 1M: 31 22
KMeans 2M: 41 40
KMeans 4M: 165 77
On cluster with 64 cores:
Hama Spark
Pagerank 1M/10M: 3590 428
KMeans 10M: 87 82
KMeans 20M: 129 134
On cluster with 32 cores:
Hama Spark
Pagerank 1M/10M: 4419 434
KMeans 10M: 98 74
KMeans 20M: 273 74
To compile it, first install scala and spark (see the spark web site), edit
conf/mrql-env.sh to match your installation and then do 'make spark' in mrql.
To validate it, do 'make validate_spark'.
To run it on a Spark cluster, list the worker nodes in spark-*/conf/slaves and
launch a standalone cluster using spark-*/bin/start-all.sh
To run the pagerank query on 32 workers, use bin/mrql.spark -dist -nodes 32
queries/pagerank.mrql
> Support query evaluation in Spark mode
> --------------------------------------
>
> Key: MRQL-12
> URL: https://issues.apache.org/jira/browse/MRQL-12
> Project: MRQL
> Issue Type: Improvement
> Components: Run-Time Data
> Affects Versions: 0.9.0
> Environment: Apache Spark http://spark-project.org/
> Reporter: Leonidas Fegaras
> Assignee: Leonidas Fegaras
> Attachments: Evaluator.gen, MRQL-12.patch
>
> Original Estimate: 240h
> Remaining Estimate: 240h
>
> Spark provides primitives for in-memory cluster computing
> (http://spark-project.org/). It has been developed at UC Berkeley and has
> recently accepted as an ASF incubating project. It has already attracted many
> developers and I think it will play a major role in the hadoop ecosystem. So,
> I thought it will be nice to be able to evaluate MRQL queries in a Spark
> cluster. Spark already supports Hive (called Shark). Like Hama, Spark can
> evaluate queries in memory but unlike Hama, it supports full fault-tolerance.
> I have already written all the code but I have only tested it in local mode
> (on a single multi-core node). This task turned out to be easier than I
> thought because MRQL plans are similar to Spark operations. The only
> annoyance was that I had to make all data structures Serializable. I also had
> to include the Gen source code (the Java preprocessor), with ASF licence,
> which will make the transition to maven easier.
> I am attaching the patch below. The actual code that contains the Spark
> evaluator is the file Evaluator.gen which is attached separately.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira