Hey, On Sat, Mar 12, 2011 at 5:32 PM, Zhijie Shen <zjshe...@gmail.com> wrote: > Hi developers, > > I'm a graduate student from National University of Singapore, majoring in > Computer Science. The enthusiasm of open source and information retrieval > drives me to participate in GSoC'11 with your community. I first got to know > Lucene when I was in a software engineer intern in IBM, working on Lotus > Connections.
Awesome and welcome to Lucene :) > > Now I've already checked out the source code and successfully built it > locally. Meanwhile, I begin to read through the Jira issues, and are more > interested in Issue 2308, 2309 and 2621, which seem to be the refactoring > tasks (Please correct me if I'm wrong). My personal feeling is that these > tasks will be more appropriate for a beginner to get in. Moreover, I think > to start with such a big project, it is more efficient to read through the > discussion on Jira to understand the problem, and then dive into the related > code with the problem kept in mind. What is your opinion? I'm looking > forward to your guidance. Apparently you survived the first steps to get into lucene and solr! Great! You also looked at JIRA which is even better. So lemme tell you some words about the issues you have listed. LUCENE-2621 - Extend Codec to handle also stored fields and term vectors This is a very interesting and at the same time very much needed feature which involves API Design, Refactoring and in depth understanding of how IndexWriter and its internals work. The API which needs to be refactored (Codec API) was made to consume PostingLists once an in memory index segment is flushed to disc. Yet, to expose Stored Fields to this API we need to prepare it to consume data for every document while we build the in memory segment. So there is a little paradigm missmatch here which needs to be addressed. LUCENE-2309 - Fully decouple IndexWriter from analyzers This one is something I look forward to have for quite a while which would flatten the way for other analysis capabilities than the one lucene offers today. This seems to be refactoring-heavier that the other but might be require less knowledge about the IndexWriter (IW) internals than the codec one. Yet, it still is a very interesting issue / project to work on and fairly self-contained. LUCENE-2308 - Separately specify a field's type FieldType aims on the one hand to separate field properties from the actual value and on the other make Field's extensibility easier. Both seem equally important while far from easy to achieve. Fieldable and Field are a core API and changes to it need to well thought. Further this issue can easily cause drastic performance degradation if not done right. Consider this as a massive change since fields are used almost all over lucene and solr. I wrote those little summaries not to scare you away, not at all! I rather tried to find out what to expect from the issues and to make it easier for you to pick either one or another which you would like to work on. I will try to update the description of those issues if they are not already clear enough ( LUCENE-2621 seems kind of too brief though) in the next couple of days. If you have any question regarding those issues or any other, feel free to ask here on the list or on the issue directly (you might need a JIRA account if you don't have one already you should get one :) Reading the JIRA issue might help you to understand what those issues about but those are usually written by core devs or long time contributors so please as any question you have and don't hesitate to ask if you have problems with anything. Simon > > Regards, > Zhijie > > -- > Zhijie Shen > School of Computing > National University of Singapore > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org