Hi all, I am sorry if this is spamming your inbox but I wanted to reach out. I would like to apply for the Google Summer Code project for implementing indexes on Hive and have some basic ideas and would like to discuss them with a mentor. I would highly appreciate if I am given direction,
Title: Using Indexes for Improved Performance of Queries Student: Vaibhav Shrivastava Abstract: Hive is used an SQL like abstraction for Map Reduce Jobs. The focus of the project shall be to implement Indexes for various tables to facilitate various operations performed on the results. In the proposal various types of indexes which can be used, how they can be implemented and their prospective applications are discussed. The content shall be updated as inputs ,reviews are provided. Content: Title:Using Indexes for Improved Performance of Queries Indexes are a common way of speeding up row retrieval in normal databases. The idea is to keep just an auxiliary pointer of a data member which is considered to be crucial in the query. One can consider an example of retrieving records where the id is equal to some desired value range. Looking up an Index will cause a lesser number of rows to be fetched. In a Hadoop environment the files are located on the HDFS. One can consider the Index to be pointing to say a particular location in the file. Another application could be to use an index for a particular aggregation operator. Hence when a particular count of say a paricular key is required we can obtain the individual rows in consideration without looking into the whole file, by just counting the offsets. For building an index, One can think of a Map Reduce job wherein the data fetched can be Reduced efficiently by using an combiner at an intermediate stage and using the data to get more sorted values. Also one can consider the development of a kd (k dimensional tree) for multi key queries, however much of the simplicity of the design may be lost in such a situation. Experience: I am currently working at Stony Brook University, NY as a Masters student in the Computer Science Dept. I am working on textmap.com which is a Text Analysis System and uses Map Reduce Jobs to extract sentiment information from the articles processed. I have experience using the Hadoop System and am interested in furthering my knowledge. Deliverables: After a design decision on as to how the index structure would be (Single Dense Index, B Tree Index or a kd Tree type index), trying to develop a prototype model. Emphasis during the mid term evaluation would be to compare a query using an index as opposed to a normal query or any such other application such as a Join or an subquery. Emphasis during the end would be to try to optimize the particular application designed in the mid term evaluation phase, or alternatively to implement more other queries which may have improved performance using indexes. Mentor: Unknown (Can I get some assistance?) On Tue, Apr 6, 2010 at 1:21 PM, Scott MacVicar <macvi...@facebook.com> wrote: > You can post it here though there is also an Apache GSoC list as well since > Hadoop is an Apache project. > > Scott > > On Apr 6, 2010, at 9:00 AM, Vaibhav wrote: > >> thanks for replying scott. >> Did you mean this mailing list ..? hive-dev@hadoop.apache.org ... I >> have joined this group but was hesitant to post it on the list as I >> didnt want to cause trouble to the developers. Is it ok if I post the >> proposal there? Else can you direct me to someone whom I can directly >> contact. >> >> Thanking you, >> Vaibhav. >> >> On Apr 6, 4:17 am, Scott MacVicar <macvi...@facebook.com> wrote: >>> You should use the Apache Mailing list too, if we have extra slots we'll be >>> using some of the Facebook ones. >>> >>> Scott >>> >>> On Apr 6, 2010, at 1:09 AM, Vaibhav wrote: >>> >>> >>> >>>> Hi Mentors, >>>> I have posted an simple proposal of my ideas. I would >>>> like to have inputs, comments and other views as to whether there >>>> needs to be more clarifications or updates on the proposal. I hope you >>>> could mail it to me at vaibhav.s.mnnit [at] gmail.com. I would like to >>>> thank you in advance for the time you give to my proposal. >>> >>>> Vaibhav. > > -- Vaibhav Shrivastava, Graduate Student, MS Computer Science, Stony Brook University.