Happy New Year! I see MRQL in the same space as an optimizer inside a traditional RDBMS. Orchestrating, optimizing and redirecting user requests to the most appropriate layer of abstraction to fulfill a request without the end user even worrying about whether a query kicks of a Spark query that runs MR in the background with HAMA. The fact is this: the opensource community will continue to add new point solutions to address certain deficiencies in processing data that is (i) high volume, (ii)high variety or (iii) high velocity or a combination. Hadoop succeeds in (i) but fails in (iii) wherever access to a continuous 'in pipe' processing stream is required. For instance Online learning algo's, clickstream. This is where storm and spark mini batch processing can potentially fill the gap but there is no overarching single project that integrates the lot.
To increase usage of MRQL, one option would be to integrate with Hive/Impala/Presto and/or even Hue and Solr. Users should have a choice of API that is most appropriate for the application in hand. The real achievement of MRQL would be to provide a query optimization of ML and Scientific computation and user interface across all layers in the stack Hadoop/HAMA for batch and bulk processing and then Storm and Spark for continuous queries and optimizations. Big Data democracy is all about putting the power of open source into the hands of as many end users as possible, not just the devs / ML / Scientific computing community. I vote we integrate with Spark ASAP. Cloudera just supported it, Mike Olson announced it is the future direction of MapReduce. On 12 December 2013 02:15, Leonidas Fegaras <[email protected]> wrote: > Thanks Edward, > Our biggest concern is that there is no activity in the user@mrql > list. Does this mean that there no one using MRQL or that nobody > posts any messages? Is there a way to get the number of people > registered in this list? Can we also get the number of times MRQL has > been downloaded from Apache mirrors after its first release? It was > hoped that after the first release people will start downloading MRQL > and will register at user@mrql list to ask questions, report bugs, ask > for new features, etc. It hasn't happened yet. Maybe it's too soon. > > There are other query languages for big data analysis in ASF. All > except MRQL are SQL-based data warehousing systems for Hadoop > (eg, Hive and Tajo). MRQL is a query system for complex data analysis, > including machine learning and scientific computing. This is the main > difference from others. The fact that it can run on multiple platforms > is a big plus, but is secondary. Currently, most people use Hadoop for > big data analysis but soon this may change. I think people will start > using fault-tolerant in-memory distributed systems for data analysis, > such as Spark. Hama too may play a big role. So supporting multiple > platforms will allow users to deploy applications using MRQL very fast > and experiment with all these platforms without having to change the > query. The whole idea of expressing distributed applications using an > SQL-like query system is rapid and easy prototyping, without > sacrificing performance. So performance is a very important factor. > If MRQL is slow, nobody will use it. I think in this area, we are doing > an excellent job because of the very advanced optimizer that allows > operations such as matrix multiplication to be done using very fast > algorithms. > > Leonidas Fegaras > > > > On 12/10/2013 08:33 PM, Edward J. Yoon wrote: > >> All, >> >> Since there are too many similar projects, I'd like to suggest that we >> change the future direction of MRQL to a powerful *analytics* query >> language on top of Hadoop beyond ETL processing. In my eyes, >> supporting multi-platforms (MapReduce, Hama, Spark, ...,etc) also >> seems pointless. WDYT? >> >> >
