Hi Jesus,

I have a version of the task running, I'd love if you could take a
look and let me know if there are any changes you'd like to see.

It's at https://github.com/heatherbooker/xen-outreachy

It gets the mailboxes, analyzes them using Perceval and an
implementation of the well known jwz's threading algorithm
(https://www.jwz.org/doc/threading.html) then indexes them
in Elasticsearch.
Each document in ES is a message, with its id being the
Message-ID and type being a modified Subject line from the
first message in a thread.

I hope this is what was intended for the task!

PS - Should I continue copying these messages to the
whole xen-devel mailing list, or is sending them to you
sufficient?

Thanks!

Heather

On Mon, Apr 17, 2017 at 2:04 AM, Jesus M. Gonzalez-Barahona <
j...@bitergia.com> wrote:

> On Sun, 2017-04-16 at 21:26 -0700, Heather Booker wrote:
> > Hi Jesus!
> >
> > I appreciate the info on the unicode error. I might have missed it,
> > but I also asked about the general microtask specifications. Here
> > was my original inquiry:
> > > And to clarify, my understanding is that the final result of
> > this task
> > > is an index of Xen data, with two types: commits and messages.
> > > Each commit document should contain its original information
> > > from git, plus the name of the branch it was developed in. And
> > > should only the mbox messages which appear to be associated
> > > with a specific commit exist in the final index? Is there some
> > > key information in messages that is supposed to indicate the
> > > association of a given commit with a git branch? I would be
> > > grateful if you could specify the end goal a little more. :D
> >
> > Yeah, so overall I'm not sure I understand the relationship of
> > branches to the mailing list messages. Is this to be a simple
> > string parsing task wherein I should scan the message body
> > for the word "branch"? (I am guessing not ;P)
>
> I'm sorry, I understood that text was about the project, not about the
> microtask. The microtask is about either:
>
> * Producing an ES index with messages labeled by thread (by applying a
> threading algorithm to messages retrieved from archives), or
>
> * Producing an ES index with commits labeled by branch (by following
> refes, and parents information in the output produced by Perceval).
>
> In the complete project, both will be used to produce the final indexes
> that power the code review dashboard.
>
> > I will be happy to get back on developing once I better grasp
> > the goal! :)
>
> More clear now?
>
> If you want, let's schedule some IRC slot for clarifying whatever is
> not clear.
>
>         Jesus.
>
> > Thanks!
> >
> > Heather
> >
> > On Sun, Apr 16, 2017 at 4:23 PM, Jesus M. Gonzalez-Barahona <jgb@bite
> > rgia.com> wrote:
> > > On Thu, 2017-04-13 at 00:47 -0700, Heather Booker wrote:
> > > > Hi,
> > > >
> > > > I submitted an application for this code review dashboard and
> > > > would love to keep working on the microtask once I get some
> > > > more info. :)
> > >
> > > Great! I answered your message, could you progress with the task?
> > >
> > > > I also came up with a general idea of how the project might be
> > > > split up - any feedback on this would be welcome! I wrote:
> > > >
> > > > "As said by Jesus, the big picture of this project will be
> > > porting
> > > > everything behind the current code review dashboard to use
> > > > Grimoire Lab tools, from the current state of using
> > > > MetricsGrimoire and custom scripts. I expect this would involve
> > > > Perceval for analyzing data, and Grimoire Elk may be useful in
> > > > further stages, or may be too general - this is something I would
> > > > wish to explore.
> > > > This project will also involve a migration from SQL to
> > > Elasticsearch
> > > > - because I believe the relevant data is mostly / all available
> > > in
> > > > places online, I am unsure whether this would need to be a direct
> > > > migration. However, looking at the current SQL setup would be
> > > > beneficial to understanding the desired format of the
> > > Elasticsearch
> > > > indexes.
> > > > I would love to dive into this project and have 3 main parts -
> > > > getting
> > > > data into ES, turning it into dashboard displays, and then fine
> > > > tuning
> > > > and perhaps augmenting the dashboard to improve its usefulness.
> > > > Getting data into ES may seem simple but I believe that once it
> > > > needs to be used for the dashboard, many realizations will pop up
> > > > - thus I’d like to leave maybe 2-3 weeks for that first step, 6-7
> > > > weeks
> > > > for the visualizations (which will include querying the data),
> > > and
> > > > the
> > > > final 3 weeks for touch ups and improvements."
> > >
> > > The plan could be sound, but would need some tweaks, once your
> > > skills
> > > in Python are clear, which could be the main blocker for the first
> > > stages.
> > >
> > > > Does this sound like an accurate summary and reasonable
> > > timeline?
> > > > And I am guessing that from Jesus's involvement with the threads
> > > > that Jesus would be the mentor, is that correct? :)
> > >
> > > Yes, I would be ;-)
> > >
> > >         Jesus.
> > >
> > > > Thanks!
> > > >
> > > > Heather
> > > >
> > > >
> > > > On Sun, Apr 9, 2017 at 9:50 PM, Heather Booker <heather.j.booker@
> > > gmai
> > > > l.com> wrote:
> > > > > Hi Jesus,
> > > > >
> > > > > While using the Elasticsearch python library
> > > > > (https://elasticsearch-py.readthedocs.io/en/master/) to add
> > > mbox
> > > > > messages to an index, I would get a UnicodeEncodeError:
> > > > > "'utf-8' codec can't encode character '\udca0' in position 767:
> > > > > surrogates not allowed".
> > > > >
> > > > > Investigating in Grimoire elk https://github.com/grim
> > > > > oirelab/GrimoireELK/blob/96b00bc682485976104a6825ca63ae0
> > > > > 8639deacc/grimoire_elk/elk/mbox.py#L200 seems to show that
> > > > > perhaps that tool instead uses Latin-1 encoding, but I found
> > > that
> > > > > to then produce a serialization error (their custom error
> > > message:
> > > > > "Unable to serialize %r (type: %s)"). I suppose this is because
> > > > > now it's bytes; of course, converting back to string after
> > > encoding
> > > > > just cycles back to the first error.
> > > > >
> > > > > As somewhat of a Python newbie I don't really know how to
> > > tackle
> > > > > this! My thought atm is to splice the offending character out
> > > > > of the message.
> > > > >
> > > > > And to clarify, my understanding is that the final result of
> > > this
> > > > > task
> > > > > is an index of Xen data, with two types: commits and messages.
> > > > > Each commit document should contain its original information
> > > > > from git, plus the name of the branch it was developed in. And
> > > > > should only the mbox messages which appear to be associated
> > > > > with a specific commit exist in the final index? Is there some
> > > > > key information in messages that is supposed to indicate the
> > > > > association of a given commit with a git branch? I would be
> > > > > grateful if you could specify the end goal a little more. :D
> > > > >
> > > > > Thanks so much!
> > > > >
> > > > > Heather
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Apr 8, 2017 at 10:02 AM, Jesus M. Gonzalez-Barahona <jg
> > > b@bi
> > > > > tergia.com> wrote:
> > > > > > On Fri, 2017-04-07 at 15:49 -0700, Heather Booker wrote:
> > > > > > > Hi Jesus,
> > > > > > >
> > > > > > > Thanks for your reply!
> > > > > > >
> > > > > > > So about the task, instructions say after analyzing mboxes
> > > with
> > > > > > > Perceval to
> > > > > > > "store the resulting raw index in ElasticSearch" - what
> > > does
> > > > > > raw
> > > > > > > index mean?
> > > > > >
> > > > > > In this context, I mean "storing the JSON documents produced
> > > by
> > > > > > Perceval in an ElasticSearch index, as such". ElasticSearch
> > > > > > stores JSON
> > > > > > documents, so it is just uploading the output of Perceval to
> > > it.
> > > > > >
> > > > > > > In terms of figuring out the elasticsearch structure, do I
> > > want
> > > > > > an
> > > > > > > index
> > > > > > > (xen-devel mbox) with a type (message) and each object from
> > > the
> > > > > > > perceval
> > > > > > > output to be one document? Or should it be more fine-
> > > grained?
> > > > > >
> > > > > > Exactly.
> > > > > >
> > > > > > Saludos,
> > > > > >
> > > > > >         Jesus.
> > > > > >
> > > > > > > Cheers,
> > > > > > >
> > > > > > > Heather
> > > > > > >
> > > > > > > On Thu, Apr 6, 2017 at 7:05 AM, Jesus M. Gonzalez-Barahona
> > > <jgb
> > > > > > @biter
> > > > > > > gia.com> wrote:
> > > > > > > > On Wed, 2017-04-05 at 16:43 -0700, Heather Booker wrote:
> > > > > > > > > Hi!
> > > > > > > > >
> > > > > > > > > I'd love to work on the Code Review Dashboard project
> > > for
> > > > > > this
> > > > > > > > round
> > > > > > > > > of Outreachy.
> > > > > > > >
> > > > > > > > Great!!
> > > > > > > >
> > > > > > > > > Are the steps outlined
> > > > > > > > > here http://markmail.org/message/7adkmords3imkswd still
> > > the
> > > > > > first
> > > > > > > > > contribution you'd like to see?
> > > > > > > >
> > > > > > > > Yes.
> > > > > > > >
> > > > > > > > > So is this a project that has been worked on in
> > > previous
> > > > > > rounds
> > > > > > > > of
> > > > > > > > > GSOC/Outreachy also?
> > > > > > > > > If so is there a place to find links to the previous
> > > > > > participants
> > > > > > > > > blogs? :)
> > > > > > > >
> > > > > > > > No. We had one participation at some point, but couldn't
> > > even
> > > > > > start
> > > > > > > > for
> > > > > > > > personal reasons. There are some people considering
> > > working
> > > > > > on this
> > > > > > > > for
> > > > > > > > this next round of Outreachy, however. You'll see their
> > > > > > messages in
> > > > > > > > this mailing list.
> > > > > > > >
> > > > > > > > > Should questions about how the
> > > specifications/completion of
> > > > > > the
> > > > > > > > > microtask be addressed to
> > > > > > > > > IRC or this list? If IRC, which channel - #xen-opw or
> > > > > > #metrics-
> > > > > > > > > grimoire? On that note, I'm
> > > > > > > > > curious why #metrics-grimoire is the listed channel on
> > > the
> > > > > > > > project
> > > > > > > > > page - are main contributors
> > > > > > > > > involved in both projects? Or is it just because the
> > > Xen
> > > > > > > > dashboard
> > > > > > > > > doesn't have a channel?
> > > > > > > >
> > > > > > > > The code review is for the Xen project, but it is done
> > > with
> > > > > > (I
> > > > > > > > mean,
> > > > > > > > the ssoftware used for it is) GrimoireLab, which for
> > > > > > historical
> > > > > > > > reasons
> > > > > > > > uses the #metrics-grimoire channel. That's why it is
> > > likely
> > > > > > that
> > > > > > > > you
> > > > > > > > find somebody from the project there.
> > > > > > > >
> > > > > > > > If you have questions, and find me around in IRC, please
> > > ping
> > > > > > me.
> > > > > > > > If
> > > > > > > > I'm not available, please send an email message.
> > > > > > > >
> > > > > > > > Saludos,
> > > > > > > >
> > > > > > > >         Jesus.
> > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > >
> > > > > > > > > Heather
> > > > > > > > > _______________________________________________
> > > > > > > > > Xen-devel mailing list
> > > > > > > > > Xen-devel@lists.xen.org
> > > > > > > > > https://lists.xen.org/xen-devel
> > > > > > > > --
> > > > > > > > Bitergia: http://bitergia.com
> > > > > > > > /me at Twitter: https://twitter.com/jgbarah
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > Xen-devel mailing list
> > > > > > > Xen-devel@lists.xen.org
> > > > > > > https://lists.xen.org/xen-devel
> > > > > > --
> > > > > > Bitergia: http://bitergia.com
> > > > > > /me at Twitter: https://twitter.com/jgbarah
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > > _______________________________________________
> > > > Xen-devel mailing list
> > > > Xen-devel@lists.xen.org
> > > > https://lists.xen.org/xen-devel
> > > --
> > > Bitergia: http://bitergia.com
> > > /me at Twitter: https://twitter.com/jgbarah
> > >
> > >
> >
> >
> --
> Bitergia: http://bitergia.com
> /me at Twitter: https://twitter.com/jgbarah
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

Reply via email to