Hi Jesus, I have a version of the task running, I'd love if you could take a look and let me know if there are any changes you'd like to see.
It's at https://github.com/heatherbooker/xen-outreachy It gets the mailboxes, analyzes them using Perceval and an implementation of the well known jwz's threading algorithm (https://www.jwz.org/doc/threading.html) then indexes them in Elasticsearch. Each document in ES is a message, with its id being the Message-ID and type being a modified Subject line from the first message in a thread. I hope this is what was intended for the task! PS - Should I continue copying these messages to the whole xen-devel mailing list, or is sending them to you sufficient? Thanks! Heather On Mon, Apr 17, 2017 at 2:04 AM, Jesus M. Gonzalez-Barahona < j...@bitergia.com> wrote: > On Sun, 2017-04-16 at 21:26 -0700, Heather Booker wrote: > > Hi Jesus! > > > > I appreciate the info on the unicode error. I might have missed it, > > but I also asked about the general microtask specifications. Here > > was my original inquiry: > > > And to clarify, my understanding is that the final result of > > this task > > > is an index of Xen data, with two types: commits and messages. > > > Each commit document should contain its original information > > > from git, plus the name of the branch it was developed in. And > > > should only the mbox messages which appear to be associated > > > with a specific commit exist in the final index? Is there some > > > key information in messages that is supposed to indicate the > > > association of a given commit with a git branch? I would be > > > grateful if you could specify the end goal a little more. :D > > > > Yeah, so overall I'm not sure I understand the relationship of > > branches to the mailing list messages. Is this to be a simple > > string parsing task wherein I should scan the message body > > for the word "branch"? (I am guessing not ;P) > > I'm sorry, I understood that text was about the project, not about the > microtask. The microtask is about either: > > * Producing an ES index with messages labeled by thread (by applying a > threading algorithm to messages retrieved from archives), or > > * Producing an ES index with commits labeled by branch (by following > refes, and parents information in the output produced by Perceval). > > In the complete project, both will be used to produce the final indexes > that power the code review dashboard. > > > I will be happy to get back on developing once I better grasp > > the goal! :) > > More clear now? > > If you want, let's schedule some IRC slot for clarifying whatever is > not clear. > > Jesus. > > > Thanks! > > > > Heather > > > > On Sun, Apr 16, 2017 at 4:23 PM, Jesus M. Gonzalez-Barahona <jgb@bite > > rgia.com> wrote: > > > On Thu, 2017-04-13 at 00:47 -0700, Heather Booker wrote: > > > > Hi, > > > > > > > > I submitted an application for this code review dashboard and > > > > would love to keep working on the microtask once I get some > > > > more info. :) > > > > > > Great! I answered your message, could you progress with the task? > > > > > > > I also came up with a general idea of how the project might be > > > > split up - any feedback on this would be welcome! I wrote: > > > > > > > > "As said by Jesus, the big picture of this project will be > > > porting > > > > everything behind the current code review dashboard to use > > > > Grimoire Lab tools, from the current state of using > > > > MetricsGrimoire and custom scripts. I expect this would involve > > > > Perceval for analyzing data, and Grimoire Elk may be useful in > > > > further stages, or may be too general - this is something I would > > > > wish to explore. > > > > This project will also involve a migration from SQL to > > > Elasticsearch > > > > - because I believe the relevant data is mostly / all available > > > in > > > > places online, I am unsure whether this would need to be a direct > > > > migration. However, looking at the current SQL setup would be > > > > beneficial to understanding the desired format of the > > > Elasticsearch > > > > indexes. > > > > I would love to dive into this project and have 3 main parts - > > > > getting > > > > data into ES, turning it into dashboard displays, and then fine > > > > tuning > > > > and perhaps augmenting the dashboard to improve its usefulness. > > > > Getting data into ES may seem simple but I believe that once it > > > > needs to be used for the dashboard, many realizations will pop up > > > > - thus I’d like to leave maybe 2-3 weeks for that first step, 6-7 > > > > weeks > > > > for the visualizations (which will include querying the data), > > > and > > > > the > > > > final 3 weeks for touch ups and improvements." > > > > > > The plan could be sound, but would need some tweaks, once your > > > skills > > > in Python are clear, which could be the main blocker for the first > > > stages. > > > > > > > Does this sound like an accurate summary and reasonable > > > timeline? > > > > And I am guessing that from Jesus's involvement with the threads > > > > that Jesus would be the mentor, is that correct? :) > > > > > > Yes, I would be ;-) > > > > > > Jesus. > > > > > > > Thanks! > > > > > > > > Heather > > > > > > > > > > > > On Sun, Apr 9, 2017 at 9:50 PM, Heather Booker <heather.j.booker@ > > > gmai > > > > l.com> wrote: > > > > > Hi Jesus, > > > > > > > > > > While using the Elasticsearch python library > > > > > (https://elasticsearch-py.readthedocs.io/en/master/) to add > > > mbox > > > > > messages to an index, I would get a UnicodeEncodeError: > > > > > "'utf-8' codec can't encode character '\udca0' in position 767: > > > > > surrogates not allowed". > > > > > > > > > > Investigating in Grimoire elk https://github.com/grim > > > > > oirelab/GrimoireELK/blob/96b00bc682485976104a6825ca63ae0 > > > > > 8639deacc/grimoire_elk/elk/mbox.py#L200 seems to show that > > > > > perhaps that tool instead uses Latin-1 encoding, but I found > > > that > > > > > to then produce a serialization error (their custom error > > > message: > > > > > "Unable to serialize %r (type: %s)"). I suppose this is because > > > > > now it's bytes; of course, converting back to string after > > > encoding > > > > > just cycles back to the first error. > > > > > > > > > > As somewhat of a Python newbie I don't really know how to > > > tackle > > > > > this! My thought atm is to splice the offending character out > > > > > of the message. > > > > > > > > > > And to clarify, my understanding is that the final result of > > > this > > > > > task > > > > > is an index of Xen data, with two types: commits and messages. > > > > > Each commit document should contain its original information > > > > > from git, plus the name of the branch it was developed in. And > > > > > should only the mbox messages which appear to be associated > > > > > with a specific commit exist in the final index? Is there some > > > > > key information in messages that is supposed to indicate the > > > > > association of a given commit with a git branch? I would be > > > > > grateful if you could specify the end goal a little more. :D > > > > > > > > > > Thanks so much! > > > > > > > > > > Heather > > > > > > > > > > > > > > > > > > > > On Sat, Apr 8, 2017 at 10:02 AM, Jesus M. Gonzalez-Barahona <jg > > > b@bi > > > > > tergia.com> wrote: > > > > > > On Fri, 2017-04-07 at 15:49 -0700, Heather Booker wrote: > > > > > > > Hi Jesus, > > > > > > > > > > > > > > Thanks for your reply! > > > > > > > > > > > > > > So about the task, instructions say after analyzing mboxes > > > with > > > > > > > Perceval to > > > > > > > "store the resulting raw index in ElasticSearch" - what > > > does > > > > > > raw > > > > > > > index mean? > > > > > > > > > > > > In this context, I mean "storing the JSON documents produced > > > by > > > > > > Perceval in an ElasticSearch index, as such". ElasticSearch > > > > > > stores JSON > > > > > > documents, so it is just uploading the output of Perceval to > > > it. > > > > > > > > > > > > > In terms of figuring out the elasticsearch structure, do I > > > want > > > > > > an > > > > > > > index > > > > > > > (xen-devel mbox) with a type (message) and each object from > > > the > > > > > > > perceval > > > > > > > output to be one document? Or should it be more fine- > > > grained? > > > > > > > > > > > > Exactly. > > > > > > > > > > > > Saludos, > > > > > > > > > > > > Jesus. > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > > Heather > > > > > > > > > > > > > > On Thu, Apr 6, 2017 at 7:05 AM, Jesus M. Gonzalez-Barahona > > > <jgb > > > > > > @biter > > > > > > > gia.com> wrote: > > > > > > > > On Wed, 2017-04-05 at 16:43 -0700, Heather Booker wrote: > > > > > > > > > Hi! > > > > > > > > > > > > > > > > > > I'd love to work on the Code Review Dashboard project > > > for > > > > > > this > > > > > > > > round > > > > > > > > > of Outreachy. > > > > > > > > > > > > > > > > Great!! > > > > > > > > > > > > > > > > > Are the steps outlined > > > > > > > > > here http://markmail.org/message/7adkmords3imkswd still > > > the > > > > > > first > > > > > > > > > contribution you'd like to see? > > > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > > So is this a project that has been worked on in > > > previous > > > > > > rounds > > > > > > > > of > > > > > > > > > GSOC/Outreachy also? > > > > > > > > > If so is there a place to find links to the previous > > > > > > participants > > > > > > > > > blogs? :) > > > > > > > > > > > > > > > > No. We had one participation at some point, but couldn't > > > even > > > > > > start > > > > > > > > for > > > > > > > > personal reasons. There are some people considering > > > working > > > > > > on this > > > > > > > > for > > > > > > > > this next round of Outreachy, however. You'll see their > > > > > > messages in > > > > > > > > this mailing list. > > > > > > > > > > > > > > > > > Should questions about how the > > > specifications/completion of > > > > > > the > > > > > > > > > microtask be addressed to > > > > > > > > > IRC or this list? If IRC, which channel - #xen-opw or > > > > > > #metrics- > > > > > > > > > grimoire? On that note, I'm > > > > > > > > > curious why #metrics-grimoire is the listed channel on > > > the > > > > > > > > project > > > > > > > > > page - are main contributors > > > > > > > > > involved in both projects? Or is it just because the > > > Xen > > > > > > > > dashboard > > > > > > > > > doesn't have a channel? > > > > > > > > > > > > > > > > The code review is for the Xen project, but it is done > > > with > > > > > > (I > > > > > > > > mean, > > > > > > > > the ssoftware used for it is) GrimoireLab, which for > > > > > > historical > > > > > > > > reasons > > > > > > > > uses the #metrics-grimoire channel. That's why it is > > > likely > > > > > > that > > > > > > > > you > > > > > > > > find somebody from the project there. > > > > > > > > > > > > > > > > If you have questions, and find me around in IRC, please > > > ping > > > > > > me. > > > > > > > > If > > > > > > > > I'm not available, please send an email message. > > > > > > > > > > > > > > > > Saludos, > > > > > > > > > > > > > > > > Jesus. > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > Heather > > > > > > > > > _______________________________________________ > > > > > > > > > Xen-devel mailing list > > > > > > > > > Xen-devel@lists.xen.org > > > > > > > > > https://lists.xen.org/xen-devel > > > > > > > > -- > > > > > > > > Bitergia: http://bitergia.com > > > > > > > > /me at Twitter: https://twitter.com/jgbarah > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > Xen-devel mailing list > > > > > > > Xen-devel@lists.xen.org > > > > > > > https://lists.xen.org/xen-devel > > > > > > -- > > > > > > Bitergia: http://bitergia.com > > > > > > /me at Twitter: https://twitter.com/jgbarah > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Xen-devel mailing list > > > > Xen-devel@lists.xen.org > > > > https://lists.xen.org/xen-devel > > > -- > > > Bitergia: http://bitergia.com > > > /me at Twitter: https://twitter.com/jgbarah > > > > > > > > > > > -- > Bitergia: http://bitergia.com > /me at Twitter: https://twitter.com/jgbarah > >
_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel