Re: Patch stack analysis

2019-08-07 Thread Lukas.Bulwahn
Hi Daniel,

> 
> > we (Ralf Ramsauer, Lukas Bulwahn and me) are currently working on
> > extending the capabilities of Patchwork by combining it with a tool
> > called PaStA [1] (Patch Stack Analysis). PaStA is the outcome of a
> > research project [2] by the Technical University of Applied Sciences
> > Regensburg. It analyses and compares all mails in a mailing list to
> > find related ones (e.g former versions of the patch, see [3]). Ralf
> > compared PaStA's results for the Linux kernel mailing list with a
> > manually created ground truth and achieved an accuracy of 91%. This
> > motivated us to integrate PaStA into Patchwork.
> 
> Cool, always interesting to see what people build on top of Patchwork!
> 

I hope we can nicely integrate that into what is already there.

> 
> One bit of relevant Patchwork history: that there's a long-running fork run by
> the freedesktop.org people: patchwork.freedesktop.org,
> https://gitlab.freedesktop.org/patchwork-fdo/patchwork-fdo/ . They took a
> different approach to series than we did: we focused on patches as the key
> 'unit' of patchworking, they focused on series as the key unit. They already
> have some support for multiple revisions of a series. I don't know how
> they've implemented their feature for detecting multiple revisions, but I'm
> guessing it's not based on analysis of (commit message, diff) tuples. There's
> an example here:
> https://patchwork.freedesktop.org/series/49692/
> 

Yes, we will certainly have a look at what they implemented and consider 
incorporating the good ideas they had.

> > Showing related patches (beside ones in the current series) allows
> > developers to understand the patch's evolution better. We have
> > adjusted the patch details view and renamed the series patch links
> > from "related" to "series". Our new related row shows the patches
> > related to each other by PaStA [3][4]. The relations between the
> > patches in the screenshot were made manually and the next steps will
> > be to automate this procedure with PaStA.
> 
> I'm really wary about incorporating something with so many dependencies
> (and with presumably higher resource usage) into the core of patchwork.
> 

Agree. That is also our main concern: we would like to set this up so that the 
use of pasta is optional and has little impact, e.g., other than exporting some 
REST API, on the main application. We also want that patchwork and pasta can be 
running on two different machines and that there is a clear low coupling 
interface between them. How to achieve this step-by-step is our current 
discussion. 

> I'd want to know a few things:
> 
>  - what is the accuracy of the FDO Patchwork approach (which I assume is
>100% metadata based)? Does it require that patch sumbitters do
>particular things (e.g. use the same cover letter title)? Sometimes
>we can train users to be helpful in how they submit things to the
>lists in order to have them work properly in more simple systems.
> 
>  - one key use case is the Linux kernel, where we have stable trees, and
>patches getting picked up for those trees. Sometimes those patches
>are identical and sometimes they need backporting. Some care would
>need to be taken around this.
> 
>An example would be:
> - I send this patch to the mailing list:
> http://patchwork.ozlabs.org/patch/1099934/
> - It is merged into mainline
> - It is proposed for stable trees. This involves multiple threads of
>   over 100 emails each, including:
>   * https://lkml.org/lkml/2019/5/29/1655
>   * https://lkml.org/lkml/2019/5/30/361
>   * (plus 3 others)
> 
>In this case, the original patch is related to the stable patches,
>(despite being sent by someone different), and it is interesting and
>useful to know what stable series a patch landed in. However, the
>patch is not really related to the entire stable patch _series_, and
>if you include all the hundreds of patches in your 'related' view in
>[3], you will drown out all the potentially useful signal in a bunch
>of noise.
> 
>It does get more complicated than this too, for example when there is
>a need to packport a patch for stable. (See
>e.g. http://patchwork.ozlabs.org/patch/1109024/ and friends)
> 

I agree. We will need to identify stable patches.
We already have multiple good indicators that we will investigate:

- date of upstream inclusion, when was the patch finally included in the main 
repository, i.e., date of the Linus' merge commit for a patch
- date of the stable patch email
- sender of the stable patch email, i.e., it is usually Greg KH or Sascha Levin 
in the linux kernel  development
- is some email CC-ed?
- does it contain some specific string in the commit message.

All these points are mostly project specific, so we will need to check how to 
make that configurable so that it fits to all the projects and reaches a good 
precision and recall.

>  - what's the resource 

Re: Patch stack analysis

2019-06-03 Thread Daniel Axtens
Hi,

> we (Ralf Ramsauer, Lukas Bulwahn and me) are currently working on
> extending the capabilities of Patchwork by combining it with a tool
> called PaStA [1] (Patch Stack Analysis). PaStA is the outcome of a
> research project [2] by the Technical University of Applied Sciences
> Regensburg. It analyses and compares all mails in a mailing list to
> find related ones (e.g former versions of the patch, see [3]). Ralf
> compared PaStA's results for the Linux kernel mailing list with a
> manually created ground truth and achieved an accuracy of 91%. This
> motivated us to integrate PaStA into Patchwork.

Cool, always interesting to see what people build on top of Patchwork!

We did consider having a feature like this and from memory we might even
have some infrastructure for it. (I get it confused with the feature
allowing a patch to belong to multiple series which we ripped out a
while ago.)

One bit of relevant Patchwork history: that there's a long-running fork
run by the freedesktop.org people: patchwork.freedesktop.org,
https://gitlab.freedesktop.org/patchwork-fdo/patchwork-fdo/ . They took
a different approach to series than we did: we focused on patches as the
key 'unit' of patchworking, they focused on series as the key unit. They
already have some support for multiple revisions of a series. I don't
know how they've implemented their feature for detecting multiple
revisions, but I'm guessing it's not based on analysis of (commit
message, diff) tuples. There's an example here:
https://patchwork.freedesktop.org/series/49692/

> Showing related patches (beside ones in the current series) allows
> developers to understand the patch's evolution better. We have
> adjusted the patch details view and renamed the series patch links
> from "related" to "series". Our new related row shows the patches
> related to each other by PaStA [3][4]. The relations between the
> patches in the screenshot were made manually and the next steps will
> be to automate this procedure with PaStA.

I'm really wary about incorporating something with so many dependencies
(and with presumably higher resource usage) into the core of
patchwork.

I'd want to know a few things:

 - what is the accuracy of the FDO Patchwork approach (which I assume is
   100% metadata based)? Does it require that patch sumbitters do
   particular things (e.g. use the same cover letter title)? Sometimes
   we can train users to be helpful in how they submit things to the
   lists in order to have them work properly in more simple systems.

 - one key use case is the Linux kernel, where we have stable trees, and
   patches getting picked up for those trees. Sometimes those patches
   are identical and sometimes they need backporting. Some care would
   need to be taken around this.

   An example would be:
- I send this patch to the mailing list: 
http://patchwork.ozlabs.org/patch/1099934/
- It is merged into mainline
- It is proposed for stable trees. This involves multiple threads of
  over 100 emails each, including:
  * https://lkml.org/lkml/2019/5/29/1655
  * https://lkml.org/lkml/2019/5/30/361
  * (plus 3 others)

   In this case, the original patch is related to the stable patches,
   (despite being sent by someone different), and it is interesting and
   useful to know what stable series a patch landed in. However, the
   patch is not really related to the entire stable patch _series_, and
   if you include all the hundreds of patches in your 'related' view in
   [3], you will drown out all the potentially useful signal in a bunch
   of noise.

   It does get more complicated than this too, for example when there is
   a need to packport a patch for stable. (See
   e.g. http://patchwork.ozlabs.org/patch/1109024/ and friends)

 - what's the resource usage, and how long does matching take?
   kernel.org has a patchwork instance that is hooked up to LKML, so
   this is a deeply practical concern for them!

I think a really good place to start would be to hook PaStA
up as an API consumer like Snowpatch. It wouldn't be able to report the
results back to patchwork just yet, but you'd be able to try it with
live data and demonstrate its value.

Thanks for letting us know about your research!

Regards,
Daniel

>
> Stephen, what's your opinion about this?
>
> Greetings,
>
> Mete Polat
>
> [1] https://github.com/lfd/pasta 
> [2] https://arxiv.org/pdf/1902.03147.pdf 
> 
> [3] 
> https://drive.google.com/drive/folders/18s9FzJUKnIUBp7FBL7dV8dqlGPXqTemq?usp=sharing
>  
> 
> [4] https://github.com/Honeybyte/patchwork/tree/pasta 
> 
> ___
> Patchwork mailing list
> Patchwork@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/patchwork
___