Re: Re: Re: Re: Big data is needed for unit test

2014-12-05 Thread Mattia Rizzolo
On Fri, Dec 05, 2014 at 10:37:39AM +0800, Paul Wise wrote:
> On Thu, Dec 4, 2014 at 11:16 PM, Corentin Desfarges wrote:
> 
> > The file is just used by one of the unit tests. But there are more than 20
> > unit tests which need their own specific data to work.
> 
> Personally I'm now leaning towards you doing the unit tests on your
> own hardware against the installed package, rather than doing the unit
> tests at build time on Debian hosts. You could also place the test
> files on your website and have a README file or test-me script that
> people could use to download the test files and run the tests against
> the installed package.

This, and write a DEP8 test to be run on ci.debian.net.
See http://ci.debian.net/doc/ for more info.

-- 
regards,
Mattia Rizzolo

GPG Key: 4096R/B9444540 http://goo.gl/I8TMB
more about me:  http://mapreri.org
Launchpad User: https://launchpad.net/~mapreri
Ubuntu Wiki page:   https://wiki.ubuntu.com/MattiaRizzolo


signature.asc
Description: Digital signature


Re: Re: Re: Re: Big data is needed for unit test

2014-12-04 Thread Paul Wise
On Thu, Dec 4, 2014 at 11:16 PM, Corentin Desfarges wrote:

> The file is just used by one of the unit tests. But there are more than 20
> unit tests which need their own specific data to work.

Ok, that makes this a bit more complicated, especially since the
amount of data could grow over time.

Personally I'm now leaning towards you doing the unit tests on your
own hardware against the installed package, rather than doing the unit
tests at build time on Debian hosts. You could also place the test
files on your website and have a README file or test-me script that
people could use to download the test files and run the tests against
the installed package.

-- 
bye,
pabs

https://wiki.debian.org/PaulWise


-- 
To UNSUBSCRIBE, email to debian-mentors-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: 
https://lists.debian.org/CAKTje6Hwwt=huqy_bjafmrztqg0hhxq-9ay9fnkcvlsbsi0...@mail.gmail.com



Re: Re: Re: Re: Big data is needed for unit test

2014-12-04 Thread Corentin Desfarges

I guess that this new orig.tar.gz would be created by using uscan (if the
link is added in d/watch) ?

Uscan requires the file to be on a webserver somewhere. I think you
would just create it manually using this:



tar Jcf fw4spl_0.1.orig-testdata.tar.xz md_1.jsonz



Then add a debian/README.source file explaining where the file came
from, how it was produced, the format and the command used to create
the orig tarball. Copyright and license information should go in
debian/copyright as usual.


Ok I think I will do that. ;)


So I have to upload my data (4GB) somewhere where uscan could find it. But
I've no idea about where upload it, given Github doesn't accept files bigger
than 100MB. Have you any idea ?

The test file is only 200MB, where is this 4GB coming from?


The file is just used by one of the unit tests. But there are more than 20 unit
tests which need their own specific data to work.


For the
200MB file it would be fine to upload the whole package including the
test file to mentors.debian.net. For larger things I think we would
need to implement data.debian.org, a service that has been wanted for
many years.


I'll will see with my collegues if we have an internal solution to upload the 
data.


One more time, thank you very much for your help and yourclarifications.


Best regards,

Corentin Desfarges


--
To UNSUBSCRIBE, email to debian-mentors-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/54807ace.80...@gmail.com



Re: Re: Re: Big data is needed for unit test

2014-12-04 Thread Corentin Desfarges

Hi


> Can you link to the file we are talking about?
With the authorization of the responsibles of the project, I published the file 
here [2]
[2]http://goo.gl/53sAzM



this looks a bit weird. I guess this google thing allows you to inspect the
content of zip files?


Yes indeed. I simply uploaded it on Google Drive.


The 178 MiB file in question named md_1.jsonz is not a gzipped json file as the
name suggests but a zip archive which contains a 1.4 MiB json file with
metadata and 470 MiB of what seems to be binary data.



If you want to add more complexity to compress this further and save space, you
could add the md_1.jsonz file to a Debian source package in its *unzipped*
version.  The xz compressor will then be able to compress the data down to 120
MiB (with both, -9 and -9e). During build time you would then zip the content
again (with --compression-method=store for speed because size doesn't matter
now) because I guess your software expects the data in this zipped format and
cannot handle it in unpacked form. This method would allow you to safe another
58 MiB in comparison to just adding the original file. I guess it is up to you
whether you want to do it like that to reduce file size or not because as pabs
already pointed out, there are already source packages in Debian that are
larger than 200 MiB. So this is just an idea :)


It seems me a good idea, but in a first time, I think that use a new orig.tar.gz
is lesser complicated regarding what I want.

But thank you very much for your solution. I take it in note.


Best regards,


Corentin


--
To UNSUBSCRIBE, email to debian-mentors-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/54807647.2060...@gmail.com



Re: Re: Re: Big data is needed for unit test

2014-12-04 Thread Paul Wise
On Thu, Dec 4, 2014 at 10:17 PM, Corentin Desfarges wrote:

> It's not about a real patient...
> So I don't think that there is any problem of confidentiality in this case.

Fair enough.

> I guess that this new orig.tar.gz would be created by using uscan (if the
> link is added in d/watch) ?

Uscan requires the file to be on a webserver somewhere. I think you
would just create it manually using this:

tar Jcf fw4spl_0.1.orig-testdata.tar.xz md_1.jsonz

Then add a debian/README.source file explaining where the file came
from, how it was produced, the format and the command used to create
the orig tarball. Copyright and license information should go in
debian/copyright as usual.

> So I have to upload my data (4GB) somewhere where uscan could find it. But
> I've no idea about where upload it, given Github doesn't accept files bigger
> than 100MB. Have you any idea ?

The test file is only 200MB, where is this 4GB coming from? For the
200MB file it would be fine to upload the whole package including the
test file to mentors.debian.net. For larger things I think we would
need to implement data.debian.org, a service that has been wanted for
many years.

https://dsa.debian.org/hardware-wishlist/

-- 
bye,
pabs

https://wiki.debian.org/PaulWise


-- 
To UNSUBSCRIBE, email to debian-mentors-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: 
https://lists.debian.org/CAKTje6G5=yTpG8ZHEahzDrxmc4aQWsgZ8RWDgdeKPkf9eURW=q...@mail.gmail.com



Re: Re: Re: Big data is needed for unit test

2014-12-04 Thread Corentin Desfarges

Hi


With the authorization of the responsibles of the project, I published the
file here [2]



It contains the names of one patient and his birth date so that
probably wasn't a good idea. This file appears to contain CT scan
results in a custom format? I can't view the scan itself as the
software isn't packaged yet :) I was able to view the metadata though.


It's not about a real patient. It's an aquisition done especially for the
tests of the software. The "patient" is in fact a developer of the framework,
and you can find the informations that you saw in the .json in the source
code of the software, on the googlecode repository [1].
So I don't think that there is any problem of confidentiality in this case.


Back to the original question of reducing the size of the data:
You could unzip the file, remove all of the large .raw files and leave
some small ones, modify root.json to remove the entries for .raw files
you removed and then zip the file up again. I'm not sure if this would
result in a valid file or not.


No, I can't do that because some unit tests could use .raw files, and I
can't delete one of them without breaking the data file's integrity.


You could also do another scan at a much lower resolution if that is
possible with the equipment you have.


Unfortunately, I haven't the required equipment to do that, and I'm not in
charge to create new unit tests (based on potential new data). But it's true
that it could be a great solution... Except that my problem for this unit
test is the same for all the others unit test... I've more than 4GB of data,
so in all cases, I will have a big data's quantity.


Anyway, I don't consider the size to be a big issue as long as you put
the data in a second orig.tar.gz.


I guess that this new orig.tar.gz would be created by using uscan (if the
link is added in d/watch) ?


Google Drive is very unfriendly to people who turn off JS, Cookies
etc, next time please upload the file to somewhere else and link
directly to the file download URL instead of indirect ways to find the
file.


So I have to upload my data (4GB) somewhere where uscan could find it. But
I've no idea about where upload it, given Github doesn't accept files bigger
than 100MB. Have you any idea ?



Thank you for your help


Best regards,

Corentin


[1] 
https://code.google.com/p/fw4spl/source/browse/Bundles/LeafPatch/patchMedicalData/test/tu/src/PatchTest.cpp


--
To UNSUBSCRIBE, email to debian-mentors-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/54806ce8.9090...@gmail.com



Re: Re: Big data is needed for unit test

2014-12-02 Thread Paul Wise
On Wed, Dec 3, 2014 at 12:29 AM, Corentin Desfarges wrote:

> With the authorization of the responsibles of the project, I published the
> file here [2]

It contains the names of one patient and his birth date so that
probably wasn't a good idea. This file appears to contain CT scan
results in a custom format? I can't view the scan itself as the
software isn't packaged yet :) I was able to view the metadata though.

Back to the original question of reducing the size of the data:

You could unzip the file, remove all of the large .raw files and leave
some small ones, modify root.json to remove the entries for .raw files
you removed and then zip the file up again. I'm not sure if this would
result in a valid file or not.

You could also do another scan at a much lower resolution if that is
possible with the equipment you have.

Anyway, I don't consider the size to be a big issue as long as you put
the data in a second orig.tar.gz. An example of this can be seen here:

http://snapshot.debian.org/package/megaglest-data/3.7.1-1/
http://snapshot.debian.org/archive/debian/20130918T21Z/pool/main/m/megaglest-data/megaglest-data_3.7.1-1.dsc

> [2] http://goo.gl/...

Google Drive is very unfriendly to people who turn off JS, Cookies
etc, next time please upload the file to somewhere else and link
directly to the file download URL instead of indirect ways to find the
file.

-- 
bye,
pabs

https://wiki.debian.org/PaulWise


-- 
To UNSUBSCRIBE, email to debian-mentors-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: 
https://lists.debian.org/caktje6ecs5jfdfmwezxckb-_9gsaw0og0twn7bgbybag02p...@mail.gmail.com



Re: Re: Big data is needed for unit test

2014-12-02 Thread Johannes Schauer
Hi,

Quoting Corentin Desfarges (2014-12-02 17:29:12)
> > Can you link to the file we are talking about?
> With the authorization of the responsibles of the project, I published the 
> file here [2]
> [2] http://goo.gl/53sAzM

this looks a bit weird. I guess this google thing allows you to inspect the
content of zip files?

The 178 MiB file in question named md_1.jsonz is not a gzipped json file as the
name suggests but a zip archive which contains a 1.4 MiB json file with
metadata and 470 MiB of what seems to be binary data.

If you want to add more complexity to compress this further and save space, you
could add the md_1.jsonz file to a Debian source package in its *unzipped*
version.  The xz compressor will then be able to compress the data down to 120
MiB (with both, -9 and -9e). During build time you would then zip the content
again (with --compression-method=store for speed because size doesn't matter
now) because I guess your software expects the data in this zipped format and
cannot handle it in unpacked form. This method would allow you to safe another
58 MiB in comparison to just adding the original file. I guess it is up to you
whether you want to do it like that to reduce file size or not because as pabs
already pointed out, there are already source packages in Debian that are
larger than 200 MiB. So this is just an idea :)

cheers, josch


--
To UNSUBSCRIBE, email to debian-mentors-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/20141202182705.6173.3052@hoothoot



Re: Re: Big data is needed for unit test

2014-12-02 Thread Corentin Desfarges

I don't know enough about the software and the file we are talking
about to answer that.


The "software" (fw4spl) is a framework focused on processing and visualization 
of medical images.
Here, the repository  of the project : [1]


Can you link to the file we are talking about?

With the authorization of the responsibles of the project, I published the file 
here [2]

Best Regards,

Corentin Desfarges


[1] https://code.google.com/p/fw4spl/
[2] http://goo.gl/53sAzM


--
To UNSUBSCRIBE, email to debian-mentors-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/547de8d8.6020...@gmail.com



Re: Big data is needed for unit test

2014-12-02 Thread Paul Wise
On Tue, Dec 2, 2014 at 6:12 PM, Corentin Desfarges wrote:

> It is a .jsonz file.

I assume that is a gzip compressed JSON file.

> Actually the file isn't into the source package. I've the choice, and it's
> why I would use the best practice.

Please keep it separate then, it is better for Debian to do that.

> I don't understand. I've just one single data file for the test. How can I
> include a "smaller" test-case ?

I don't know enough about the software and the file we are talking
about to answer that.

Can you link to the file we are talking about?

-- 
bye,
pabs

https://wiki.debian.org/PaulWise


-- 
To UNSUBSCRIBE, email to debian-mentors-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: 
https://lists.debian.org/CAKTje6FYYkZEpRdUL+rirSE=5jeoaoeodfqv4hogz0kuee4...@mail.gmail.com



Re: Big data is needed for unit test

2014-12-02 Thread Corentin Desfarges

Then it's license is the default "all rights reserved", you can't redistribute
it, you can't change it, you can't do anything, and you can't use it without
explicit permission from it's copyright holder.
Of course you can't upload it to the debian archive (see DFSG).



Please check better, given that it is the test case of a free software maybe
also the data are free.


In fact I'm not sure that the file hasn't license. I meant that I've not found
any indication about the license of this file. But, I work for the company which
develops the software, so I've asked the question to the responsible, and now
I'm waiting for their answer.

Thank you !

Best regards,

Corentin Desfarges


--
To UNSUBSCRIBE, email to debian-mentors-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/547d9e59.6000...@gmail.com



Re: Big data is needed for unit test

2014-12-02 Thread Mattia Rizzolo
On Tue, Dec 02, 2014 at 11:12:35AM +0100, Corentin Desfarges wrote:
> >What license is the file under?
> 
> It hasn't any license.

Then it's license is the default "all rights reserved", you can't redistribute
it, you can't change it, you can't do anything, and you can't use it without
explicit permission from it's copyright holder.
Of course you can't upload it to the debian archive (see DFSG).

Please check better, given that it is the test case of a free software maybe
also the data are free.

-- 
regards,
Mattia Rizzolo

GPG Key: 4096R/B9444540 http://goo.gl/I8TMB
more about me:  http://mapreri.org
Launchpad User: https://launchpad.net/~mapreri
Ubuntu Wiki page:   https://wiki.ubuntu.com/MattiaRizzolo


signature.asc
Description: Digital signature


Re: Big data is needed for unit test

2014-12-02 Thread Corentin Desfarges

Hi and thank you for your answer


I'm working on the packaging of fw4spl (a medical software), and I'm faced
with a new problematic : One of the unit tests needs to load an important
data file, which has a big size (~200 Mo).



I assume you mean 200MB here.


Yes, I do.


Should I simply remove this test, or can I include the data file in the
package ?



Can you include more details about this data file?
What data format is the file in?


It is a .jsonz file.


What license is the file under?


It hasn't any license.


Does upstream ship the file in the same source package?


Actually the file isn't into the source package. I've the choice, and it's why
I would use the best practice.
//

If you can get upstream to include a smaller test-case or generate one
at runtime, that would probably be a good idea.


I don't understand. I've just one single data file for the test. How can I 
include
a "smaller" test-case ?


If upstream ships the file separately to the source, please use the
multi-orig.tar.gz support in 3.0 source packages so that updates to
the source but not the test data don't add 200MB to snapshot.d.o for
every new upstream release. If upstream doesn't ship the file
separately it might be useful to do so for this reason.


It seems to be the better solution in my case...


Thank you for your advice !

Best regards,

Corentin Desfarges



Re: Big data is needed for unit test

2014-12-01 Thread Jérémy Lal
Le lundi 01 décembre 2014 à 17:28 +0100, Johannes Schauer a écrit :
> Hi,
> 
> Quoting Paul Wise (2014-12-01 17:03:39)
> > > Should I simply remove this test, or can I include the data file in the
> > > package ?
> > 
> > Can you include more details about this data file?
> > 
> > What data format is the file in?
> 
> depending on the answer to this question it might be very simple to compress
> the file to 5-10% of its original size (usually possible for XML based
> formats).

if that's the case then you have nothing special to do, since the
package tarball is compressed... so you just have to check the actual
size of the compressed upstream tarball.

Jérémy.



-- 
To UNSUBSCRIBE, email to debian-mentors-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/1417451738.11471.10.ca...@melix.org



Re: Big data is needed for unit test

2014-12-01 Thread Johannes Schauer
Hi,

Quoting Paul Wise (2014-12-01 17:03:39)
> > Should I simply remove this test, or can I include the data file in the
> > package ?
> 
> Can you include more details about this data file?
> 
> What data format is the file in?

depending on the answer to this question it might be very simple to compress
the file to 5-10% of its original size (usually possible for XML based
formats).

cheers, josch


--
To UNSUBSCRIBE, email to debian-mentors-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/20141201162850.6173.1592@hoothoot



Re: Big data is needed for unit test

2014-12-01 Thread Paul Wise
On Mon, Dec 1, 2014 at 10:01 PM, Corentin Desfarges wrote:

> I'm working on the packaging of fw4spl (a medical software), and I'm faced
> with a new problematic : One of the unit tests needs to load an important
> data file, which has a big size (~200 Mo).

I assume you mean 200MB here.

> Should I simply remove this test, or can I include the data file in the
> package ?

Can you include more details about this data file?

What data format is the file in?

What license is the file under?

Does upstream ship the file in the same source package?

If you can get upstream to include a smaller test-case or generate one
at runtime, that would probably be a good idea.

If upstream ships the file separately to the source, please use the
multi-orig.tar.gz support in 3.0 source packages so that updates to
the source but not the test data don't add 200MB to snapshot.d.o for
every new upstream release. If upstream doesn't ship the file
separately it might be useful to do so for this reason.

That said, we have larger source packages in Debian so it should be fine.

-- 
bye,
pabs

https://wiki.debian.org/PaulWise


-- 
To UNSUBSCRIBE, email to debian-mentors-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: 
https://lists.debian.org/CAKTje6EG_UH7L=et2dzuhw81zU9ED0B=dwatkq3y-v2ahtg...@mail.gmail.com



Big data is needed for unit test

2014-12-01 Thread Corentin Desfarges
Dear Mentors,


I'm working on the packaging of fw4spl (a medical software), and I'm faced
with a new problematic : One of the unit tests needs to load an important
data file, which has a big size (~200 Mo).

Should I simply remove this test, or can I include the data file in the
package ?


Thank you for your help,

Best regards,


Corentin Desfarges