Re: [VOTE] Release Apache Commons CSV 1.8 based on RC1

2020-01-20 Thread Alex Herbert


> On 21 Jan 2020, at 00:00, Gary Gregory  wrote:
> 
> On Mon, Jan 20, 2020 at 6:43 PM Rob Tompkins  > wrote:
> 
>> 
>> 
>>> On Jan 20, 2020, at 6:41 PM, Gary Gregory 
>> wrote:
>>> 
>>> On Mon, Jan 20, 2020 at 6:28 PM Gary Gregory 
>> wrote:
>>> 
> On Sun, Jan 19, 2020 at 7:39 AM Alex Herbert >> 
> wrote:
> 
> Hi Gary,
> 
> I raised a few niggles a while back with CSV and the discussion did not
> receive a response on how to proceed.
> 
> There is the major bug CSV-248 where the CSVRecord is not Serializable
> [1]. This requires a decision on what to do to fix it. This bug is
>> still
> present in 1.8 RC1 as found by FindBugs [2].
> 
> From what I can see the CSVRecord maintains a reference to the
>> CSVParser.
> This chain of objects maintained in memory is not serializable and
>> leads
> back to the original input Reader.
> 
> I can see from the JApiCmp report that the serial version id was
>> changed
> for CSVRecord this release so there is still an intention to support
> serialization. So this should be a blocker.
> 
> I could not find a serialisation test in the unit tests for CSVRecord.
> This quick test added to CSVRecordTest fails:
> 
> 
> @Test
> public void testSerialization() throws IOException {
>   CSVRecord shortRec;
>   try (final CSVParser parser = CSVParser.parse("a,b",
> CSVFormat.newFormat(','))) {
>   shortRec = parser.iterator().next();
>   }
>   final ByteArrayOutputStream out = new ByteArrayOutputStream();
>   try (ObjectOutputStream oos = new ObjectOutputStream(out)) {
>   oos.writeObject(shortRec);
>   }
> }
> 
> mvn test -Dtest=CSVRecordTest
> 
> [ERROR] testSerialization  Time elapsed: 0.032 s  <<< ERROR!
> java.io.NotSerializableException: org.apache.commons.csv.CSVParser
>   at
> 
>> org.apache.commons.csv.CSVRecordTest.testSerialization(CSVRecordTest.java:235)
> 
> If I mark the field csvParser as transient it passes. So this is a
> problem as raised by FindBugs.
> 
 
 Making the field transient would indeed fix this test but... some of the
 record APIs would then fail with NPEs... so we would be kicking the can
 down the road basically. Making the parser serializable is going in the
 wrong direction feature-wise IMO, so let's not go there. A serialization
 proxy would be less worse but should we provide such a thing? I would
>> say
 no. I am OK with making the field transient despite the can kicking, so
 let's do that.
 
>>> 
>>> I suppose we should bump the serialVersionUID...
>> 
>> +1 to that
>> 
> 
> The pickle if we do that is that we will be binary incompatible as JApiCmp
> notes:
> MODIFIED (Serializable incompatible(!): serialVersionUID modified) final
> public class org.apache.commons.csv.CSVRecord

If the field is transient then no need to change the serial version ID. That 
changes when the classes serialised from an old version cannot be deserialised 
with the new version. Since the ‘new’ parser field is transient it should be 
ignored.

No one will have a class serialised using version 1.7. It would have thrown an 
exception as the parser is not serialisable. So we go back to 1.6 and serialise 
a CSVRecord and check it can be deserialised by 1.8.

Alex


> 
> Gary
> 
>> 
>> 
>>> Gary
>>> 
>>> 
 
 Gary
 
 
> 
> 
> I also raised [3] the strange implementation of the CSVParser
> getHeaderNames() which ignores null headers as they cannot be used as
>> a key
> into the map. However the list of column names could contain the null
> values. This test currently fails:
> 
> @Test
> public void testHeaderNamesWithNull() throws IOException {
>   final Reader in = new
> StringReader("header1,null,header3\n1,2,3\n4,5,6");
>   final Iterator records = CSVFormat.DEFAULT.withHeader()
> 
> .withNullString("null")
> 
> .withAllowMissingColumnNames()
> 
> .parse(in).iterator();
>   final CSVRecord record = records.next();
>   assertEquals(Arrays.asList("header1", null, "header3"),
> record.getParser().getHeaderNames());
> }
> 
> I am not saying it should pass but at least the documentation should
> state the behaviour in this edge case. That is the list of header
>> names may
> be shorter than the number of columns when the parser is configured to
> allow null headers. I’ve not raised a bug ticket for this as it is
>> open to
> opinion if this is by design or actually a bug. This issue is still
>> present
> in 1.8 RC1.
> 
> Previously I suggested documentation changes for this and another edge
> case using the header map to be added to the javadoc for
>> getHeaderNames()
> and getHeaderMap():
> 
> - Documentation:
> 
> The mapping is only guaranteed to be a 

Re: [VOTE] Release Apache Commons CSV 1.8 based on RC1

2020-01-20 Thread Gary Gregory
On Mon, Jan 20, 2020 at 6:43 PM Rob Tompkins  wrote:

>
>
> > On Jan 20, 2020, at 6:41 PM, Gary Gregory 
> wrote:
> >
> > On Mon, Jan 20, 2020 at 6:28 PM Gary Gregory 
> wrote:
> >
> >>> On Sun, Jan 19, 2020 at 7:39 AM Alex Herbert  >
> >>> wrote:
> >>>
> >>> Hi Gary,
> >>>
> >>> I raised a few niggles a while back with CSV and the discussion did not
> >>> receive a response on how to proceed.
> >>>
> >>> There is the major bug CSV-248 where the CSVRecord is not Serializable
> >>> [1]. This requires a decision on what to do to fix it. This bug is
> still
> >>> present in 1.8 RC1 as found by FindBugs [2].
> >>>
> >>> From what I can see the CSVRecord maintains a reference to the
> CSVParser.
> >>> This chain of objects maintained in memory is not serializable and
> leads
> >>> back to the original input Reader.
> >>>
> >>> I can see from the JApiCmp report that the serial version id was
> changed
> >>> for CSVRecord this release so there is still an intention to support
> >>> serialization. So this should be a blocker.
> >>>
> >>> I could not find a serialisation test in the unit tests for CSVRecord.
> >>> This quick test added to CSVRecordTest fails:
> >>>
> >>>
> >>> @Test
> >>> public void testSerialization() throws IOException {
> >>>CSVRecord shortRec;
> >>>try (final CSVParser parser = CSVParser.parse("a,b",
> >>> CSVFormat.newFormat(','))) {
> >>>shortRec = parser.iterator().next();
> >>>}
> >>>final ByteArrayOutputStream out = new ByteArrayOutputStream();
> >>>try (ObjectOutputStream oos = new ObjectOutputStream(out)) {
> >>>oos.writeObject(shortRec);
> >>>}
> >>> }
> >>>
> >>> mvn test -Dtest=CSVRecordTest
> >>>
> >>> [ERROR] testSerialization  Time elapsed: 0.032 s  <<< ERROR!
> >>> java.io.NotSerializableException: org.apache.commons.csv.CSVParser
> >>>at
> >>>
> org.apache.commons.csv.CSVRecordTest.testSerialization(CSVRecordTest.java:235)
> >>>
> >>> If I mark the field csvParser as transient it passes. So this is a
> >>> problem as raised by FindBugs.
> >>>
> >>
> >> Making the field transient would indeed fix this test but... some of the
> >> record APIs would then fail with NPEs... so we would be kicking the can
> >> down the road basically. Making the parser serializable is going in the
> >> wrong direction feature-wise IMO, so let's not go there. A serialization
> >> proxy would be less worse but should we provide such a thing? I would
> say
> >> no. I am OK with making the field transient despite the can kicking, so
> >> let's do that.
> >>
> >
> > I suppose we should bump the serialVersionUID...
>
> +1 to that
>

The pickle if we do that is that we will be binary incompatible as JApiCmp
notes:
MODIFIED (Serializable incompatible(!): serialVersionUID modified) final
public class org.apache.commons.csv.CSVRecord

Gary

>
>
> > Gary
> >
> >
> >>
> >> Gary
> >>
> >>
> >>>
> >>>
> >>> I also raised [3] the strange implementation of the CSVParser
> >>> getHeaderNames() which ignores null headers as they cannot be used as
> a key
> >>> into the map. However the list of column names could contain the null
> >>> values. This test currently fails:
> >>>
> >>> @Test
> >>> public void testHeaderNamesWithNull() throws IOException {
> >>>final Reader in = new
> >>> StringReader("header1,null,header3\n1,2,3\n4,5,6");
> >>>final Iterator records = CSVFormat.DEFAULT.withHeader()
> >>>
> >>> .withNullString("null")
> >>>
> >>> .withAllowMissingColumnNames()
> >>>
> >>> .parse(in).iterator();
> >>>final CSVRecord record = records.next();
> >>>assertEquals(Arrays.asList("header1", null, "header3"),
> >>> record.getParser().getHeaderNames());
> >>> }
> >>>
> >>> I am not saying it should pass but at least the documentation should
> >>> state the behaviour in this edge case. That is the list of header
> names may
> >>> be shorter than the number of columns when the parser is configured to
> >>> allow null headers. I’ve not raised a bug ticket for this as it is
> open to
> >>> opinion if this is by design or actually a bug. This issue is still
> present
> >>> in 1.8 RC1.
> >>>
> >>> Previously I suggested documentation changes for this and another edge
> >>> case using the header map to be added to the javadoc for
> getHeaderNames()
> >>> and getHeaderMap():
> >>>
> >>> - Documentation:
> >>>
> >>> The mapping is only guaranteed to be a one-to-one mapping if the record
> >>> was created with a format that does not allow duplicate or null header
> >>> names. Null headers are excluded from the map and duplicates can only
> map
> >>> to 1 column.
> >>>
> >>>
> >>> - Bug / Documentation
> >>>
> >>> The CSVParser only stores headers names in a list of header names if
> they
> >>> are not null. So the list can be shorter than the number of columns if
> you
> >>> use a format that allows empty headers and contains null column names.
> >>>
> >>>
> >>> The ultimate result is that we should document that the purpose of the
> >>> header names is to provide 

Re: [codec] Base32/Base64 trailing bit validation is incomplete

2020-01-20 Thread Alex Herbert


> On 20 Jan 2020, at 23:54, Gary Gregory  wrote:
> 
> On Mon, Jan 20, 2020 at 6:45 PM Alex Herbert  >
> wrote:
> 
>> 
>> 
>>> On 20 Jan 2020, at 23:20, Gary Gregory  wrote:
>>> 
>>> I hope this would be at the level of BaseNCodec, not just Base64.
>>> 
>>> I am not sure I like BaseNCodec#setAllowTrailingBits(boolean), maybe
>>> someone more general like BaseNCodec.setStrictDecoding(boolean) where the
>>> default is false for backward compatibility.
>> 
>> I prefer that name.
>> 
>>> 
>>> I do not think we want to go as far as creating an enum for various
>>> enforcement features.
>> 
>> At the moment the implementation is half way between strict and relaxed.
>> Setting it back to relaxed by default removes changes attempted in
>> CODEC-134. This perhaps needs some input from the originator of CODEC-134
>> for their use case. If it is for an uncommon edge case with a secure
>> context then having a non-strict default seems more sensible. The default
>> just throws away stuff that it cannot decode at the end. In most cases this
>> is fine. User reports on this contain statements along the lines of "other
>> libraries can do the decoding, why does codec error?"
>> 
>> So set this to strict=false by default and then allow a strict=true via a
>> BaseNCodec property.
>> 
>> I don’t think we want to add more static overrides with this flag so the
>> default for the static methods would go back to non-strict.
>> 
>> If no objections I’ll raise a Jira ticket and PR to implement this.
>> 
> 
> Sure, and let's use the term 'lenient' instead of 'non-strict’.

OK. I’ll work on this in the next few days.

> 
> Gary
> 
> 
>>> 
>>> Gary
>>> 
>>> On Mon, Jan 20, 2020 at 10:25 AM Alex Herbert 
>>> wrote:
>>> 
 In release 1.14 I fixed a patch made in 1.13 to reject decoding bytes
 that cannot be re-encoded to the same bytes (Codec-134 [1]). My fix was
 to correct the use of a mask to check the trailing bytes.
 
 The implementation checks trailing bits that are to be discarded are all
 zero.
 
 However the check does not go so far as to throw an exception when there
 are trailing bits below the count of 8. In this case there cannot be any
 more bytes and the bits are discarded.
 
 I assume this is because before the trailing bit validation it was not
 necessary to do anything when the number of trailing bits was less than
 a single byte.
 
 However this means it is still possible to decode some bytes and then
 encode them to create different bytes as shown here:
 
 @Test
 public void testTrailing6bits() {
final Base64 codec = new Base64();
// A 4 byte block plus an extra one
byte[] bytes = new byte[] { 'A', 'B', 'C', 'D' };
byte[] decoded = codec.decode(bytes);
byte[] encoded = codec.encode(decoded);
Assert.assertTrue("The round trip is possible",
 Arrays.equals(bytes, encoded));
// A 4 byte block plus an extra one
bytes = new byte[] { 'A', 'B', 'C', 'D', 'E' };
decoded = codec.decode(bytes);
encoded = codec.encode(decoded);
Assert.assertFalse("The round trip is not possible",
 Arrays.equals(bytes, encoded));
 }
 
 @Test
 public void testTrailing5bits() {
final Base32 codec = new Base32();
// An 8 byte block plus an extra one
byte[] bytes = new byte[] { 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'
>> };
byte[] decoded = codec.decode(bytes);
byte[] encoded = codec.encode(decoded);
Assert.assertTrue("The round trip is possible",
 Arrays.equals(bytes, encoded));
// An 8 byte block plus an extra one
bytes = new byte[] { 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I' };
decoded = codec.decode(bytes);
encoded = codec.encode(decoded);
Assert.assertFalse("The round trip is not possible",
 Arrays.equals(bytes, encoded));
 }
 
 Previously when fixing the trailing bit validation I suggested
 validating all trailing bits, even when they cannot be converted into
 any bytes. However this was not done and there are still some invalid
 inputs that do not trigger an exception.
 
 I propose to update the fix by throwing an IllegalArgumentException if
 there are left-over bits that cannot be decoded. That seems to be the
 intention of CODEC-134 to prevent security exploitation.
 
 However note that stricter validation is causing users to complain about
 exceptions when they should be questioning their own data. Recently a
 user raised Codec-279 which stated that exceptions were being thrown
 [2]. I believe the code is functioning as expected. So adding extra
 validation may prove to be more of a headache to users who have somehow
 obtained invalid encodings.
 
 One workaround is to fix the implementation to reject anything that
 cannot be re-encoded to the 

Re: [VOTE] Release Apache Commons CSV 1.8 based on RC1

2020-01-20 Thread Alex Herbert


> On 20 Jan 2020, at 23:28, Gary Gregory  wrote:
> 
> On Sun, Jan 19, 2020 at 7:39 AM Alex Herbert  >
> wrote:
> 
>> Hi Gary,
>> 
>> I raised a few niggles a while back with CSV and the discussion did not
>> receive a response on how to proceed.
>> 
>> There is the major bug CSV-248 where the CSVRecord is not Serializable
>> [1]. This requires a decision on what to do to fix it. This bug is still
>> present in 1.8 RC1 as found by FindBugs [2].
>> 
>> From what I can see the CSVRecord maintains a reference to the CSVParser.
>> This chain of objects maintained in memory is not serializable and leads
>> back to the original input Reader.
>> 
>> I can see from the JApiCmp report that the serial version id was changed
>> for CSVRecord this release so there is still an intention to support
>> serialization. So this should be a blocker.
>> 
>> I could not find a serialisation test in the unit tests for CSVRecord.
>> This quick test added to CSVRecordTest fails:
>> 
>> 
>> @Test
>> public void testSerialization() throws IOException {
>>CSVRecord shortRec;
>>try (final CSVParser parser = CSVParser.parse("a,b",
>> CSVFormat.newFormat(','))) {
>>shortRec = parser.iterator().next();
>>}
>>final ByteArrayOutputStream out = new ByteArrayOutputStream();
>>try (ObjectOutputStream oos = new ObjectOutputStream(out)) {
>>oos.writeObject(shortRec);
>>}
>> }
>> 
>> mvn test -Dtest=CSVRecordTest
>> 
>> [ERROR] testSerialization  Time elapsed: 0.032 s  <<< ERROR!
>> java.io.NotSerializableException: org.apache.commons.csv.CSVParser
>>at
>> org.apache.commons.csv.CSVRecordTest.testSerialization(CSVRecordTest.java:235)
>> 
>> If I mark the field csvParser as transient it passes. So this is a problem
>> as raised by FindBugs.
>> 
> 
> Making the field transient would indeed fix this test but... some of the
> record APIs would then fail with NPEs... so we would be kicking the can
> down the road basically. Making the parser serializable is going in the
> wrong direction feature-wise IMO, so let's not go there. A serialization
> proxy would be less worse but should we provide such a thing? I would say
> no. I am OK with making the field transient despite the can kicking, so
> let's do that.

This is all for a hypothetical problem at the moment. No one has reported being 
unable to (de)serialise when their app depends on this. So minimal effort to 
fix this makes sense.

Looking at the fix you pushed to master I think that the javadoc on getParser() 
should state that the parser is not part of the serialised state of the record.

> 
> Gary
> 
> 
>> 
>> 
>> I also raised [3] the strange implementation of the CSVParser
>> getHeaderNames() which ignores null headers as they cannot be used as a key
>> into the map. However the list of column names could contain the null
>> values. This test currently fails:
>> 
>> @Test
>> public void testHeaderNamesWithNull() throws IOException {
>>final Reader in = new
>> StringReader("header1,null,header3\n1,2,3\n4,5,6");
>>final Iterator records = CSVFormat.DEFAULT.withHeader()
>> 
>> .withNullString("null")
>> 
>> .withAllowMissingColumnNames()
>> 
>> .parse(in).iterator();
>>final CSVRecord record = records.next();
>>assertEquals(Arrays.asList("header1", null, "header3"),
>> record.getParser().getHeaderNames());
>> }
>> 
>> I am not saying it should pass but at least the documentation should state
>> the behaviour in this edge case. That is the list of header names may be
>> shorter than the number of columns when the parser is configured to allow
>> null headers. I’ve not raised a bug ticket for this as it is open to
>> opinion if this is by design or actually a bug. This issue is still present
>> in 1.8 RC1.
>> 
>> Previously I suggested documentation changes for this and another edge
>> case using the header map to be added to the javadoc for getHeaderNames()
>> and getHeaderMap():
>> 
>> - Documentation:
>> 
>> The mapping is only guaranteed to be a one-to-one mapping if the record
>> was created with a format that does not allow duplicate or null header
>> names. Null headers are excluded from the map and duplicates can only map
>> to 1 column.
>> 
>> 
>> - Bug / Documentation
>> 
>> The CSVParser only stores headers names in a list of header names if they
>> are not null. So the list can be shorter than the number of columns if you
>> use a format that allows empty headers and contains null column names.
>> 
>> 
>> The ultimate result is that we should document that the purpose of the
>> header names is to provide a list of non-null header names in the order
>> they occur in the header and thus represent keys that can be used in the
>> header map. In certain circumstances there may be more columns in the data
>> than there are header names.
>> 
>> 
>> Alex
>> 
>> 
>> [1] https://issues.apache.org/jira/browse/CSV-248 <
>> https://issues.apache.org/jira/browse/CSV-248 
>> 

Re: [codec] Base32/Base64 trailing bit validation is incomplete

2020-01-20 Thread Gary Gregory
On Mon, Jan 20, 2020 at 6:45 PM Alex Herbert 
wrote:

>
>
> > On 20 Jan 2020, at 23:20, Gary Gregory  wrote:
> >
> > I hope this would be at the level of BaseNCodec, not just Base64.
> >
> > I am not sure I like BaseNCodec#setAllowTrailingBits(boolean), maybe
> > someone more general like BaseNCodec.setStrictDecoding(boolean) where the
> > default is false for backward compatibility.
>
> I prefer that name.
>
> >
> > I do not think we want to go as far as creating an enum for various
> > enforcement features.
>
> At the moment the implementation is half way between strict and relaxed.
> Setting it back to relaxed by default removes changes attempted in
> CODEC-134. This perhaps needs some input from the originator of CODEC-134
> for their use case. If it is for an uncommon edge case with a secure
> context then having a non-strict default seems more sensible. The default
> just throws away stuff that it cannot decode at the end. In most cases this
> is fine. User reports on this contain statements along the lines of "other
> libraries can do the decoding, why does codec error?"
>
> So set this to strict=false by default and then allow a strict=true via a
> BaseNCodec property.
>
> I don’t think we want to add more static overrides with this flag so the
> default for the static methods would go back to non-strict.
>
> If no objections I’ll raise a Jira ticket and PR to implement this.
>

Sure, and let's use the term 'lenient' instead of 'non-strict'.

Gary


> >
> > Gary
> >
> > On Mon, Jan 20, 2020 at 10:25 AM Alex Herbert 
> > wrote:
> >
> >> In release 1.14 I fixed a patch made in 1.13 to reject decoding bytes
> >> that cannot be re-encoded to the same bytes (Codec-134 [1]). My fix was
> >> to correct the use of a mask to check the trailing bytes.
> >>
> >> The implementation checks trailing bits that are to be discarded are all
> >> zero.
> >>
> >> However the check does not go so far as to throw an exception when there
> >> are trailing bits below the count of 8. In this case there cannot be any
> >> more bytes and the bits are discarded.
> >>
> >> I assume this is because before the trailing bit validation it was not
> >> necessary to do anything when the number of trailing bits was less than
> >> a single byte.
> >>
> >> However this means it is still possible to decode some bytes and then
> >> encode them to create different bytes as shown here:
> >>
> >> @Test
> >> public void testTrailing6bits() {
> >> final Base64 codec = new Base64();
> >> // A 4 byte block plus an extra one
> >> byte[] bytes = new byte[] { 'A', 'B', 'C', 'D' };
> >> byte[] decoded = codec.decode(bytes);
> >> byte[] encoded = codec.encode(decoded);
> >> Assert.assertTrue("The round trip is possible",
> >> Arrays.equals(bytes, encoded));
> >> // A 4 byte block plus an extra one
> >> bytes = new byte[] { 'A', 'B', 'C', 'D', 'E' };
> >> decoded = codec.decode(bytes);
> >> encoded = codec.encode(decoded);
> >> Assert.assertFalse("The round trip is not possible",
> >> Arrays.equals(bytes, encoded));
> >> }
> >>
> >> @Test
> >> public void testTrailing5bits() {
> >> final Base32 codec = new Base32();
> >> // An 8 byte block plus an extra one
> >> byte[] bytes = new byte[] { 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'
> };
> >> byte[] decoded = codec.decode(bytes);
> >> byte[] encoded = codec.encode(decoded);
> >> Assert.assertTrue("The round trip is possible",
> >> Arrays.equals(bytes, encoded));
> >> // An 8 byte block plus an extra one
> >> bytes = new byte[] { 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I' };
> >> decoded = codec.decode(bytes);
> >> encoded = codec.encode(decoded);
> >> Assert.assertFalse("The round trip is not possible",
> >> Arrays.equals(bytes, encoded));
> >> }
> >>
> >> Previously when fixing the trailing bit validation I suggested
> >> validating all trailing bits, even when they cannot be converted into
> >> any bytes. However this was not done and there are still some invalid
> >> inputs that do not trigger an exception.
> >>
> >> I propose to update the fix by throwing an IllegalArgumentException if
> >> there are left-over bits that cannot be decoded. That seems to be the
> >> intention of CODEC-134 to prevent security exploitation.
> >>
> >> However note that stricter validation is causing users to complain about
> >> exceptions when they should be questioning their own data. Recently a
> >> user raised Codec-279 which stated that exceptions were being thrown
> >> [2]. I believe the code is functioning as expected. So adding extra
> >> validation may prove to be more of a headache to users who have somehow
> >> obtained invalid encodings.
> >>
> >> One workaround is to fix the implementation to reject anything that
> >> cannot be re-encoded to the same bytes. Then add a property that allows
> >> this behaviour to be suppressed allowing for example:
> >>
> >> Base64 codec = new Base64();
> >> byte[] bytes = new byte[] { 'A', 

Re: [VOTE] Release Apache Commons CSV 1.8 based on RC1

2020-01-20 Thread Gary Gregory
On Sun, Jan 19, 2020 at 7:39 AM Alex Herbert 
wrote:

> Hi Gary,
>
> I raised a few niggles a while back with CSV and the discussion did not
> receive a response on how to proceed.
>
> There is the major bug CSV-248 where the CSVRecord is not Serializable
> [1]. This requires a decision on what to do to fix it. This bug is still
> present in 1.8 RC1 as found by FindBugs [2].
>
> From what I can see the CSVRecord maintains a reference to the CSVParser.
> This chain of objects maintained in memory is not serializable and leads
> back to the original input Reader.
>
> I can see from the JApiCmp report that the serial version id was changed
> for CSVRecord this release so there is still an intention to support
> serialization. So this should be a blocker.
>
> I could not find a serialisation test in the unit tests for CSVRecord.
> This quick test added to CSVRecordTest fails:
>
>
> @Test
> public void testSerialization() throws IOException {
> CSVRecord shortRec;
> try (final CSVParser parser = CSVParser.parse("a,b",
> CSVFormat.newFormat(','))) {
> shortRec = parser.iterator().next();
> }
> final ByteArrayOutputStream out = new ByteArrayOutputStream();
> try (ObjectOutputStream oos = new ObjectOutputStream(out)) {
> oos.writeObject(shortRec);
> }
> }
>
> mvn test -Dtest=CSVRecordTest
>
> [ERROR] testSerialization  Time elapsed: 0.032 s  <<< ERROR!
> java.io.NotSerializableException: org.apache.commons.csv.CSVParser
> at
> org.apache.commons.csv.CSVRecordTest.testSerialization(CSVRecordTest.java:235)
>
> If I mark the field csvParser as transient it passes. So this is a problem
> as raised by FindBugs.
>
>
>
> I also raised [3] the strange implementation of the CSVParser
> getHeaderNames() which ignores null headers as they cannot be used as a key
> into the map. However the list of column names could contain the null
> values. This test currently fails:
>
> @Test
> public void testHeaderNamesWithNull() throws IOException {
> final Reader in = new
> StringReader("header1,null,header3\n1,2,3\n4,5,6");
> final Iterator records = CSVFormat.DEFAULT.withHeader()
>
>  .withNullString("null")
>
>  .withAllowMissingColumnNames()
>
>  .parse(in).iterator();
> final CSVRecord record = records.next();
> assertEquals(Arrays.asList("header1", null, "header3"),
> record.getParser().getHeaderNames());
> }
>
>
It seems obvious to me that withNullString applies to the header line like
any other but maybe this should be pointed out specifically in the Javadoc.

I am not saying it should pass but at least the documentation should state
> the behaviour in this edge case. That is the list of header names may be
> shorter than the number of columns when the parser is configured to allow
> null headers. I’ve not raised a bug ticket for this as it is open to
> opinion if this is by design or actually a bug. This issue is still present
> in 1.8 RC1.


design vs bug: If this falls into the bug bucket, how would you fix it?  We
should do what is least surprising here IMO.


>
> Previously I suggested documentation changes for this and another edge
> case using the header map to be added to the javadoc for getHeaderNames()
> and getHeaderMap():
>
> - Documentation:
>
> The mapping is only guaranteed to be a one-to-one mapping if the record
> was created with a format that does not allow duplicate or null header
> names. Null headers are excluded from the map and duplicates can only map
> to 1 column.
>

I think I saw a feature request a while back to allow for duplicate column
names. That would obviously not work for querying a record, but you could
imagine that one should be able to write back out what was read in. I'm not
sure we have a test for this case.


>
> - Bug / Documentation
>
> The CSVParser only stores headers names in a list of header names if they
> are not null. So the list can be shorter than the number of columns if you
> use a format that allows empty headers and contains null column names.
>
>
> The ultimate result is that we should document that the purpose of the
> header names is to provide a list of non-null header names in the order
> they occur in the header and thus represent keys that can be used in the
> header map. In certain circumstances there may be more columns in the data
> than there are header names.
>

More docs is fine with me.

Gary


>
>
> Alex
>
>
> [1] https://issues.apache.org/jira/browse/CSV-248 <
> https://issues.apache.org/jira/browse/CSV-248>
>
> [2]
> https://dist.apache.org/repos/dist/dev/commons/csv/1.8-RC1/site/findbugs.html
> <
> https://dist.apache.org/repos/dist/dev/commons/csv/1.8-RC1/site/findbugs.html
> >
>
> [3] https://markmail.org/message/woti2iymecosihx6 <
> https://markmail.org/message/woti2iymecosihx6>
>
>
>
> > On 18 Jan 2020, at 17:52, Gary Gregory  wrote:
> >
> > We have fixed quite a few bugs and added some significant enhancements
> > since Apache Commons CSV 1.7 was released, so I would like to release
> 

Re: [codec] Base32/Base64 trailing bit validation is incomplete

2020-01-20 Thread Alex Herbert



> On 20 Jan 2020, at 23:20, Gary Gregory  wrote:
> 
> I hope this would be at the level of BaseNCodec, not just Base64.
> 
> I am not sure I like BaseNCodec#setAllowTrailingBits(boolean), maybe
> someone more general like BaseNCodec.setStrictDecoding(boolean) where the
> default is false for backward compatibility.

I prefer that name.

> 
> I do not think we want to go as far as creating an enum for various
> enforcement features.

At the moment the implementation is half way between strict and relaxed. 
Setting it back to relaxed by default removes changes attempted in CODEC-134. 
This perhaps needs some input from the originator of CODEC-134 for their use 
case. If it is for an uncommon edge case with a secure context then having a 
non-strict default seems more sensible. The default just throws away stuff that 
it cannot decode at the end. In most cases this is fine. User reports on this 
contain statements along the lines of "other libraries can do the decoding, why 
does codec error?"

So set this to strict=false by default and then allow a strict=true via a 
BaseNCodec property.

I don’t think we want to add more static overrides with this flag so the 
default for the static methods would go back to non-strict.

If no objections I’ll raise a Jira ticket and PR to implement this.

> 
> Gary
> 
> On Mon, Jan 20, 2020 at 10:25 AM Alex Herbert 
> wrote:
> 
>> In release 1.14 I fixed a patch made in 1.13 to reject decoding bytes
>> that cannot be re-encoded to the same bytes (Codec-134 [1]). My fix was
>> to correct the use of a mask to check the trailing bytes.
>> 
>> The implementation checks trailing bits that are to be discarded are all
>> zero.
>> 
>> However the check does not go so far as to throw an exception when there
>> are trailing bits below the count of 8. In this case there cannot be any
>> more bytes and the bits are discarded.
>> 
>> I assume this is because before the trailing bit validation it was not
>> necessary to do anything when the number of trailing bits was less than
>> a single byte.
>> 
>> However this means it is still possible to decode some bytes and then
>> encode them to create different bytes as shown here:
>> 
>> @Test
>> public void testTrailing6bits() {
>> final Base64 codec = new Base64();
>> // A 4 byte block plus an extra one
>> byte[] bytes = new byte[] { 'A', 'B', 'C', 'D' };
>> byte[] decoded = codec.decode(bytes);
>> byte[] encoded = codec.encode(decoded);
>> Assert.assertTrue("The round trip is possible",
>> Arrays.equals(bytes, encoded));
>> // A 4 byte block plus an extra one
>> bytes = new byte[] { 'A', 'B', 'C', 'D', 'E' };
>> decoded = codec.decode(bytes);
>> encoded = codec.encode(decoded);
>> Assert.assertFalse("The round trip is not possible",
>> Arrays.equals(bytes, encoded));
>> }
>> 
>> @Test
>> public void testTrailing5bits() {
>> final Base32 codec = new Base32();
>> // An 8 byte block plus an extra one
>> byte[] bytes = new byte[] { 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H' };
>> byte[] decoded = codec.decode(bytes);
>> byte[] encoded = codec.encode(decoded);
>> Assert.assertTrue("The round trip is possible",
>> Arrays.equals(bytes, encoded));
>> // An 8 byte block plus an extra one
>> bytes = new byte[] { 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I' };
>> decoded = codec.decode(bytes);
>> encoded = codec.encode(decoded);
>> Assert.assertFalse("The round trip is not possible",
>> Arrays.equals(bytes, encoded));
>> }
>> 
>> Previously when fixing the trailing bit validation I suggested
>> validating all trailing bits, even when they cannot be converted into
>> any bytes. However this was not done and there are still some invalid
>> inputs that do not trigger an exception.
>> 
>> I propose to update the fix by throwing an IllegalArgumentException if
>> there are left-over bits that cannot be decoded. That seems to be the
>> intention of CODEC-134 to prevent security exploitation.
>> 
>> However note that stricter validation is causing users to complain about
>> exceptions when they should be questioning their own data. Recently a
>> user raised Codec-279 which stated that exceptions were being thrown
>> [2]. I believe the code is functioning as expected. So adding extra
>> validation may prove to be more of a headache to users who have somehow
>> obtained invalid encodings.
>> 
>> One workaround is to fix the implementation to reject anything that
>> cannot be re-encoded to the same bytes. Then add a property that allows
>> this behaviour to be suppressed allowing for example:
>> 
>> Base64 codec = new Base64();
>> byte[] bytes = new byte[] { 'A', 'B', 'C', 'D', 'E' };
>> byte[] decoded;
>> try {
>> decoded = codec.decode(bytes);
>> } catch (IllegalArgumentException ignore) {
>> codec.setAllowTrailingBits(true);
>> // OK
>> decoded = codec.decode(bytes);
>> }
>> 
>> WDYT?
>> 
>> 1. Fully fix CODEC-134
>> 2. Optionally allow CODEC-134 behaviour 

Re: [VOTE] Release Apache Commons CSV 1.8 based on RC1

2020-01-20 Thread Rob Tompkins



> On Jan 20, 2020, at 6:41 PM, Gary Gregory  wrote:
> 
> On Mon, Jan 20, 2020 at 6:28 PM Gary Gregory  wrote:
> 
>>> On Sun, Jan 19, 2020 at 7:39 AM Alex Herbert 
>>> wrote:
>>> 
>>> Hi Gary,
>>> 
>>> I raised a few niggles a while back with CSV and the discussion did not
>>> receive a response on how to proceed.
>>> 
>>> There is the major bug CSV-248 where the CSVRecord is not Serializable
>>> [1]. This requires a decision on what to do to fix it. This bug is still
>>> present in 1.8 RC1 as found by FindBugs [2].
>>> 
>>> From what I can see the CSVRecord maintains a reference to the CSVParser.
>>> This chain of objects maintained in memory is not serializable and leads
>>> back to the original input Reader.
>>> 
>>> I can see from the JApiCmp report that the serial version id was changed
>>> for CSVRecord this release so there is still an intention to support
>>> serialization. So this should be a blocker.
>>> 
>>> I could not find a serialisation test in the unit tests for CSVRecord.
>>> This quick test added to CSVRecordTest fails:
>>> 
>>> 
>>> @Test
>>> public void testSerialization() throws IOException {
>>>CSVRecord shortRec;
>>>try (final CSVParser parser = CSVParser.parse("a,b",
>>> CSVFormat.newFormat(','))) {
>>>shortRec = parser.iterator().next();
>>>}
>>>final ByteArrayOutputStream out = new ByteArrayOutputStream();
>>>try (ObjectOutputStream oos = new ObjectOutputStream(out)) {
>>>oos.writeObject(shortRec);
>>>}
>>> }
>>> 
>>> mvn test -Dtest=CSVRecordTest
>>> 
>>> [ERROR] testSerialization  Time elapsed: 0.032 s  <<< ERROR!
>>> java.io.NotSerializableException: org.apache.commons.csv.CSVParser
>>>at
>>> org.apache.commons.csv.CSVRecordTest.testSerialization(CSVRecordTest.java:235)
>>> 
>>> If I mark the field csvParser as transient it passes. So this is a
>>> problem as raised by FindBugs.
>>> 
>> 
>> Making the field transient would indeed fix this test but... some of the
>> record APIs would then fail with NPEs... so we would be kicking the can
>> down the road basically. Making the parser serializable is going in the
>> wrong direction feature-wise IMO, so let's not go there. A serialization
>> proxy would be less worse but should we provide such a thing? I would say
>> no. I am OK with making the field transient despite the can kicking, so
>> let's do that.
>> 
> 
> I suppose we should bump the serialVersionUID...

+1 to that


> Gary
> 
> 
>> 
>> Gary
>> 
>> 
>>> 
>>> 
>>> I also raised [3] the strange implementation of the CSVParser
>>> getHeaderNames() which ignores null headers as they cannot be used as a key
>>> into the map. However the list of column names could contain the null
>>> values. This test currently fails:
>>> 
>>> @Test
>>> public void testHeaderNamesWithNull() throws IOException {
>>>final Reader in = new
>>> StringReader("header1,null,header3\n1,2,3\n4,5,6");
>>>final Iterator records = CSVFormat.DEFAULT.withHeader()
>>> 
>>> .withNullString("null")
>>> 
>>> .withAllowMissingColumnNames()
>>> 
>>> .parse(in).iterator();
>>>final CSVRecord record = records.next();
>>>assertEquals(Arrays.asList("header1", null, "header3"),
>>> record.getParser().getHeaderNames());
>>> }
>>> 
>>> I am not saying it should pass but at least the documentation should
>>> state the behaviour in this edge case. That is the list of header names may
>>> be shorter than the number of columns when the parser is configured to
>>> allow null headers. I’ve not raised a bug ticket for this as it is open to
>>> opinion if this is by design or actually a bug. This issue is still present
>>> in 1.8 RC1.
>>> 
>>> Previously I suggested documentation changes for this and another edge
>>> case using the header map to be added to the javadoc for getHeaderNames()
>>> and getHeaderMap():
>>> 
>>> - Documentation:
>>> 
>>> The mapping is only guaranteed to be a one-to-one mapping if the record
>>> was created with a format that does not allow duplicate or null header
>>> names. Null headers are excluded from the map and duplicates can only map
>>> to 1 column.
>>> 
>>> 
>>> - Bug / Documentation
>>> 
>>> The CSVParser only stores headers names in a list of header names if they
>>> are not null. So the list can be shorter than the number of columns if you
>>> use a format that allows empty headers and contains null column names.
>>> 
>>> 
>>> The ultimate result is that we should document that the purpose of the
>>> header names is to provide a list of non-null header names in the order
>>> they occur in the header and thus represent keys that can be used in the
>>> header map. In certain circumstances there may be more columns in the data
>>> than there are header names.
>>> 
>>> 
>>> Alex
>>> 
>>> 
>>> [1] https://issues.apache.org/jira/browse/CSV-248 <
>>> https://issues.apache.org/jira/browse/CSV-248>
>>> 
>>> [2]
>>> https://dist.apache.org/repos/dist/dev/commons/csv/1.8-RC1/site/findbugs.html
>>> <
>>> 

Re: [VOTE] Release Apache Commons CSV 1.8 based on RC1

2020-01-20 Thread Gary Gregory
On Mon, Jan 20, 2020 at 6:28 PM Gary Gregory  wrote:

> On Sun, Jan 19, 2020 at 7:39 AM Alex Herbert 
> wrote:
>
>> Hi Gary,
>>
>> I raised a few niggles a while back with CSV and the discussion did not
>> receive a response on how to proceed.
>>
>> There is the major bug CSV-248 where the CSVRecord is not Serializable
>> [1]. This requires a decision on what to do to fix it. This bug is still
>> present in 1.8 RC1 as found by FindBugs [2].
>>
>> From what I can see the CSVRecord maintains a reference to the CSVParser.
>> This chain of objects maintained in memory is not serializable and leads
>> back to the original input Reader.
>>
>> I can see from the JApiCmp report that the serial version id was changed
>> for CSVRecord this release so there is still an intention to support
>> serialization. So this should be a blocker.
>>
>> I could not find a serialisation test in the unit tests for CSVRecord.
>> This quick test added to CSVRecordTest fails:
>>
>>
>> @Test
>> public void testSerialization() throws IOException {
>> CSVRecord shortRec;
>> try (final CSVParser parser = CSVParser.parse("a,b",
>> CSVFormat.newFormat(','))) {
>> shortRec = parser.iterator().next();
>> }
>> final ByteArrayOutputStream out = new ByteArrayOutputStream();
>> try (ObjectOutputStream oos = new ObjectOutputStream(out)) {
>> oos.writeObject(shortRec);
>> }
>> }
>>
>> mvn test -Dtest=CSVRecordTest
>>
>> [ERROR] testSerialization  Time elapsed: 0.032 s  <<< ERROR!
>> java.io.NotSerializableException: org.apache.commons.csv.CSVParser
>> at
>> org.apache.commons.csv.CSVRecordTest.testSerialization(CSVRecordTest.java:235)
>>
>> If I mark the field csvParser as transient it passes. So this is a
>> problem as raised by FindBugs.
>>
>
> Making the field transient would indeed fix this test but... some of the
> record APIs would then fail with NPEs... so we would be kicking the can
> down the road basically. Making the parser serializable is going in the
> wrong direction feature-wise IMO, so let's not go there. A serialization
> proxy would be less worse but should we provide such a thing? I would say
> no. I am OK with making the field transient despite the can kicking, so
> let's do that.
>

I suppose we should bump the serialVersionUID...

Gary


>
> Gary
>
>
>>
>>
>> I also raised [3] the strange implementation of the CSVParser
>> getHeaderNames() which ignores null headers as they cannot be used as a key
>> into the map. However the list of column names could contain the null
>> values. This test currently fails:
>>
>> @Test
>> public void testHeaderNamesWithNull() throws IOException {
>> final Reader in = new
>> StringReader("header1,null,header3\n1,2,3\n4,5,6");
>> final Iterator records = CSVFormat.DEFAULT.withHeader()
>>
>>  .withNullString("null")
>>
>>  .withAllowMissingColumnNames()
>>
>>  .parse(in).iterator();
>> final CSVRecord record = records.next();
>> assertEquals(Arrays.asList("header1", null, "header3"),
>> record.getParser().getHeaderNames());
>> }
>>
>> I am not saying it should pass but at least the documentation should
>> state the behaviour in this edge case. That is the list of header names may
>> be shorter than the number of columns when the parser is configured to
>> allow null headers. I’ve not raised a bug ticket for this as it is open to
>> opinion if this is by design or actually a bug. This issue is still present
>> in 1.8 RC1.
>>
>> Previously I suggested documentation changes for this and another edge
>> case using the header map to be added to the javadoc for getHeaderNames()
>> and getHeaderMap():
>>
>> - Documentation:
>>
>> The mapping is only guaranteed to be a one-to-one mapping if the record
>> was created with a format that does not allow duplicate or null header
>> names. Null headers are excluded from the map and duplicates can only map
>> to 1 column.
>>
>>
>> - Bug / Documentation
>>
>> The CSVParser only stores headers names in a list of header names if they
>> are not null. So the list can be shorter than the number of columns if you
>> use a format that allows empty headers and contains null column names.
>>
>>
>> The ultimate result is that we should document that the purpose of the
>> header names is to provide a list of non-null header names in the order
>> they occur in the header and thus represent keys that can be used in the
>> header map. In certain circumstances there may be more columns in the data
>> than there are header names.
>>
>>
>> Alex
>>
>>
>> [1] https://issues.apache.org/jira/browse/CSV-248 <
>> https://issues.apache.org/jira/browse/CSV-248>
>>
>> [2]
>> https://dist.apache.org/repos/dist/dev/commons/csv/1.8-RC1/site/findbugs.html
>> <
>> https://dist.apache.org/repos/dist/dev/commons/csv/1.8-RC1/site/findbugs.html
>> >
>>
>> [3] https://markmail.org/message/woti2iymecosihx6 <
>> https://markmail.org/message/woti2iymecosihx6>
>>
>>
>>
>> > On 18 Jan 2020, at 17:52, Gary Gregory  wrote:
>> >
>> > We 

Re: [VOTE] Release Apache Commons CSV 1.8 based on RC1

2020-01-20 Thread Gary Gregory
On Sun, Jan 19, 2020 at 7:39 AM Alex Herbert 
wrote:

> Hi Gary,
>
> I raised a few niggles a while back with CSV and the discussion did not
> receive a response on how to proceed.
>
> There is the major bug CSV-248 where the CSVRecord is not Serializable
> [1]. This requires a decision on what to do to fix it. This bug is still
> present in 1.8 RC1 as found by FindBugs [2].
>
> From what I can see the CSVRecord maintains a reference to the CSVParser.
> This chain of objects maintained in memory is not serializable and leads
> back to the original input Reader.
>
> I can see from the JApiCmp report that the serial version id was changed
> for CSVRecord this release so there is still an intention to support
> serialization. So this should be a blocker.
>
> I could not find a serialisation test in the unit tests for CSVRecord.
> This quick test added to CSVRecordTest fails:
>
>
> @Test
> public void testSerialization() throws IOException {
> CSVRecord shortRec;
> try (final CSVParser parser = CSVParser.parse("a,b",
> CSVFormat.newFormat(','))) {
> shortRec = parser.iterator().next();
> }
> final ByteArrayOutputStream out = new ByteArrayOutputStream();
> try (ObjectOutputStream oos = new ObjectOutputStream(out)) {
> oos.writeObject(shortRec);
> }
> }
>
> mvn test -Dtest=CSVRecordTest
>
> [ERROR] testSerialization  Time elapsed: 0.032 s  <<< ERROR!
> java.io.NotSerializableException: org.apache.commons.csv.CSVParser
> at
> org.apache.commons.csv.CSVRecordTest.testSerialization(CSVRecordTest.java:235)
>
> If I mark the field csvParser as transient it passes. So this is a problem
> as raised by FindBugs.
>

Making the field transient would indeed fix this test but... some of the
record APIs would then fail with NPEs... so we would be kicking the can
down the road basically. Making the parser serializable is going in the
wrong direction feature-wise IMO, so let's not go there. A serialization
proxy would be less worse but should we provide such a thing? I would say
no. I am OK with making the field transient despite the can kicking, so
let's do that.

Gary


>
>
> I also raised [3] the strange implementation of the CSVParser
> getHeaderNames() which ignores null headers as they cannot be used as a key
> into the map. However the list of column names could contain the null
> values. This test currently fails:
>
> @Test
> public void testHeaderNamesWithNull() throws IOException {
> final Reader in = new
> StringReader("header1,null,header3\n1,2,3\n4,5,6");
> final Iterator records = CSVFormat.DEFAULT.withHeader()
>
>  .withNullString("null")
>
>  .withAllowMissingColumnNames()
>
>  .parse(in).iterator();
> final CSVRecord record = records.next();
> assertEquals(Arrays.asList("header1", null, "header3"),
> record.getParser().getHeaderNames());
> }
>
> I am not saying it should pass but at least the documentation should state
> the behaviour in this edge case. That is the list of header names may be
> shorter than the number of columns when the parser is configured to allow
> null headers. I’ve not raised a bug ticket for this as it is open to
> opinion if this is by design or actually a bug. This issue is still present
> in 1.8 RC1.
>
> Previously I suggested documentation changes for this and another edge
> case using the header map to be added to the javadoc for getHeaderNames()
> and getHeaderMap():
>
> - Documentation:
>
> The mapping is only guaranteed to be a one-to-one mapping if the record
> was created with a format that does not allow duplicate or null header
> names. Null headers are excluded from the map and duplicates can only map
> to 1 column.
>
>
> - Bug / Documentation
>
> The CSVParser only stores headers names in a list of header names if they
> are not null. So the list can be shorter than the number of columns if you
> use a format that allows empty headers and contains null column names.
>
>
> The ultimate result is that we should document that the purpose of the
> header names is to provide a list of non-null header names in the order
> they occur in the header and thus represent keys that can be used in the
> header map. In certain circumstances there may be more columns in the data
> than there are header names.
>
>
> Alex
>
>
> [1] https://issues.apache.org/jira/browse/CSV-248 <
> https://issues.apache.org/jira/browse/CSV-248>
>
> [2]
> https://dist.apache.org/repos/dist/dev/commons/csv/1.8-RC1/site/findbugs.html
> <
> https://dist.apache.org/repos/dist/dev/commons/csv/1.8-RC1/site/findbugs.html
> >
>
> [3] https://markmail.org/message/woti2iymecosihx6 <
> https://markmail.org/message/woti2iymecosihx6>
>
>
>
> > On 18 Jan 2020, at 17:52, Gary Gregory  wrote:
> >
> > We have fixed quite a few bugs and added some significant enhancements
> > since Apache Commons CSV 1.7 was released, so I would like to release
> > Apache Commons CSV 1.8.
> >
> > Apache Commons CSV 1.8 RC1 is available for review here:
> >

Re: [codec] Base32/Base64 trailing bit validation is incomplete

2020-01-20 Thread Gary Gregory
I hope this would be at the level of BaseNCodec, not just Base64.

I am not sure I like BaseNCodec#setAllowTrailingBits(boolean), maybe
someone more general like BaseNCodec.setStrictDecoding(boolean) where the
default is false for backward compatibility.

I do not think we want to go as far as creating an enum for various
enforcement features.

Gary

On Mon, Jan 20, 2020 at 10:25 AM Alex Herbert 
wrote:

> In release 1.14 I fixed a patch made in 1.13 to reject decoding bytes
> that cannot be re-encoded to the same bytes (Codec-134 [1]). My fix was
> to correct the use of a mask to check the trailing bytes.
>
> The implementation checks trailing bits that are to be discarded are all
> zero.
>
> However the check does not go so far as to throw an exception when there
> are trailing bits below the count of 8. In this case there cannot be any
> more bytes and the bits are discarded.
>
> I assume this is because before the trailing bit validation it was not
> necessary to do anything when the number of trailing bits was less than
> a single byte.
>
> However this means it is still possible to decode some bytes and then
> encode them to create different bytes as shown here:
>
> @Test
> public void testTrailing6bits() {
>  final Base64 codec = new Base64();
>  // A 4 byte block plus an extra one
>  byte[] bytes = new byte[] { 'A', 'B', 'C', 'D' };
>  byte[] decoded = codec.decode(bytes);
>  byte[] encoded = codec.encode(decoded);
>  Assert.assertTrue("The round trip is possible",
> Arrays.equals(bytes, encoded));
>  // A 4 byte block plus an extra one
>  bytes = new byte[] { 'A', 'B', 'C', 'D', 'E' };
>  decoded = codec.decode(bytes);
>  encoded = codec.encode(decoded);
>  Assert.assertFalse("The round trip is not possible",
> Arrays.equals(bytes, encoded));
> }
>
> @Test
> public void testTrailing5bits() {
>  final Base32 codec = new Base32();
>  // An 8 byte block plus an extra one
>  byte[] bytes = new byte[] { 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H' };
>  byte[] decoded = codec.decode(bytes);
>  byte[] encoded = codec.encode(decoded);
>  Assert.assertTrue("The round trip is possible",
> Arrays.equals(bytes, encoded));
>  // An 8 byte block plus an extra one
>  bytes = new byte[] { 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I' };
>  decoded = codec.decode(bytes);
>  encoded = codec.encode(decoded);
>  Assert.assertFalse("The round trip is not possible",
> Arrays.equals(bytes, encoded));
> }
>
> Previously when fixing the trailing bit validation I suggested
> validating all trailing bits, even when they cannot be converted into
> any bytes. However this was not done and there are still some invalid
> inputs that do not trigger an exception.
>
> I propose to update the fix by throwing an IllegalArgumentException if
> there are left-over bits that cannot be decoded. That seems to be the
> intention of CODEC-134 to prevent security exploitation.
>
> However note that stricter validation is causing users to complain about
> exceptions when they should be questioning their own data. Recently a
> user raised Codec-279 which stated that exceptions were being thrown
> [2]. I believe the code is functioning as expected. So adding extra
> validation may prove to be more of a headache to users who have somehow
> obtained invalid encodings.
>
> One workaround is to fix the implementation to reject anything that
> cannot be re-encoded to the same bytes. Then add a property that allows
> this behaviour to be suppressed allowing for example:
>
> Base64 codec = new Base64();
> byte[] bytes = new byte[] { 'A', 'B', 'C', 'D', 'E' };
> byte[] decoded;
> try {
>  decoded = codec.decode(bytes);
> } catch (IllegalArgumentException ignore) {
>  codec.setAllowTrailingBits(true);
>  // OK
>  decoded = codec.decode(bytes);
> }
>
> WDYT?
>
> 1. Fully fix CODEC-134
> 2. Optionally allow CODEC-134 behaviour to be suppressed
>
> I would vote for option 1. It requires 1 line change in the code for
> Base32 and Base64.
>
> I am open to opinions for option 2. It would allow users to upgrade and
> then opt-in to previous functionality.
>
> Alex
>
>
> [1] https://issues.apache.org/jira/browse/CODEC-134
>
> [2] https://issues.apache.org/jira/browse/CODEC-279
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>


Re: [VOTE] Release Apache Commons CSV 1.8 based on RC1

2020-01-20 Thread Gary Gregory
On Sun, Jan 19, 2020 at 7:55 PM sebb  wrote:

> What is the use case for needing serialisation?
> It's a lot of effort to maintain a serialisable class, and it opens
> the class to deserialisation attacks.
>

I think the larger context is whether we can effectively remove (or leave
broken) serialization outside of a major version.

Gary


>
> On Sun, 19 Jan 2020 at 12:39, Alex Herbert 
> wrote:
> >
> > Hi Gary,
> >
> > I raised a few niggles a while back with CSV and the discussion did not
> receive a response on how to proceed.
> >
> > There is the major bug CSV-248 where the CSVRecord is not Serializable
> [1]. This requires a decision on what to do to fix it. This bug is still
> present in 1.8 RC1 as found by FindBugs [2].
> >
> > From what I can see the CSVRecord maintains a reference to the
> CSVParser. This chain of objects maintained in memory is not serializable
> and leads back to the original input Reader.
> >
> > I can see from the JApiCmp report that the serial version id was changed
> for CSVRecord this release so there is still an intention to support
> serialization. So this should be a blocker.
> >
> > I could not find a serialisation test in the unit tests for CSVRecord.
> This quick test added to CSVRecordTest fails:
> >
> >
> > @Test
> > public void testSerialization() throws IOException {
> > CSVRecord shortRec;
> > try (final CSVParser parser = CSVParser.parse("a,b",
> CSVFormat.newFormat(','))) {
> > shortRec = parser.iterator().next();
> > }
> > final ByteArrayOutputStream out = new ByteArrayOutputStream();
> > try (ObjectOutputStream oos = new ObjectOutputStream(out)) {
> > oos.writeObject(shortRec);
> > }
> > }
> >
> > mvn test -Dtest=CSVRecordTest
> >
> > [ERROR] testSerialization  Time elapsed: 0.032 s  <<< ERROR!
> > java.io.NotSerializableException: org.apache.commons.csv.CSVParser
> > at
> org.apache.commons.csv.CSVRecordTest.testSerialization(CSVRecordTest.java:235)
> >
> > If I mark the field csvParser as transient it passes. So this is a
> problem as raised by FindBugs.
> >
> >
> >
> > I also raised [3] the strange implementation of the CSVParser
> getHeaderNames() which ignores null headers as they cannot be used as a key
> into the map. However the list of column names could contain the null
> values. This test currently fails:
> >
> > @Test
> > public void testHeaderNamesWithNull() throws IOException {
> > final Reader in = new
> StringReader("header1,null,header3\n1,2,3\n4,5,6");
> > final Iterator records = CSVFormat.DEFAULT.withHeader()
> >
> .withNullString("null")
> >
> .withAllowMissingColumnNames()
> >
> .parse(in).iterator();
> > final CSVRecord record = records.next();
> > assertEquals(Arrays.asList("header1", null, "header3"),
> record.getParser().getHeaderNames());
> > }
> >
> > I am not saying it should pass but at least the documentation should
> state the behaviour in this edge case. That is the list of header names may
> be shorter than the number of columns when the parser is configured to
> allow null headers. I’ve not raised a bug ticket for this as it is open to
> opinion if this is by design or actually a bug. This issue is still present
> in 1.8 RC1.
> >
> > Previously I suggested documentation changes for this and another edge
> case using the header map to be added to the javadoc for getHeaderNames()
> and getHeaderMap():
> >
> > - Documentation:
> >
> > The mapping is only guaranteed to be a one-to-one mapping if the record
> was created with a format that does not allow duplicate or null header
> names. Null headers are excluded from the map and duplicates can only map
> to 1 column.
> >
> >
> > - Bug / Documentation
> >
> > The CSVParser only stores headers names in a list of header names if
> they are not null. So the list can be shorter than the number of columns if
> you use a format that allows empty headers and contains null column names.
> >
> >
> > The ultimate result is that we should document that the purpose of the
> header names is to provide a list of non-null header names in the order
> they occur in the header and thus represent keys that can be used in the
> header map. In certain circumstances there may be more columns in the data
> than there are header names.
> >
> >
> > Alex
> >
> >
> > [1] https://issues.apache.org/jira/browse/CSV-248 <
> https://issues.apache.org/jira/browse/CSV-248>
> >
> > [2]
> https://dist.apache.org/repos/dist/dev/commons/csv/1.8-RC1/site/findbugs.html
> <
> https://dist.apache.org/repos/dist/dev/commons/csv/1.8-RC1/site/findbugs.html
> >
> >
> > [3] https://markmail.org/message/woti2iymecosihx6 <
> https://markmail.org/message/woti2iymecosihx6>
> >
> >
> >
> > > On 18 Jan 2020, at 17:52, Gary Gregory  wrote:
> > >
> > > We have fixed quite a few bugs and added some significant enhancements
> > > since Apache Commons CSV 1.7 was released, so I would like to release
> > > Apache Commons CSV 1.8.
> > >
> > > Apache Commons CSV 1.8 RC1 is 

Re: [geometry] Rename Transform to AffineTransform

2020-01-20 Thread Matt Juntunen
Gilles,

> I was not indicating that the name "EuclideanTransform" would be
> better than "AffineTransform", I was wondering about whether the
> class itself is redundant.

Oh, I misunderstood. The "EuclideanTransform" interface is important because it 
adds the "applyVector(Vector)" method, which has different behavior than the 
standard "apply" method. All transforms in the euclidean packages have this 
method but it is not present in the core Transform because not all spaces have 
associated Vector types (eg, spherical). I had renamed it AffineTransform in 
the previous PR not because it exposed new functionality or behavior that made 
it affine, but because it was located in a module defining an affine space. 
What do you suggest for the name here?

> My understanding is that "Transform" can be documented as:
> ---CUT---
> In Euclidean space, this must be an affine transform.
> ---CUT---

That's part of what the documentation now says.

-Matt

From: Gilles Sadowski 
Sent: Monday, January 20, 2020 2:28 PM
To: Commons Developers List 
Subject: Re: [geometry] Rename Transform to AffineTransform

Hello.

Le lun. 20 janv. 2020 à 16:57, Matt Juntunen
 a écrit :
>
> Gilles,
>
> > From a design POV, I still think that "AffineTransform" is redundant:
>
> The "AffineTransform" name change has been reverted. It is now the original 
> "EuclideanTransform".

I was not indicating that the name "EuclideanTransform" would be
better than "AffineTransform", I was wondering about whether the
class itself is redundant.

> I've closed PR #58 and created PR #59 with the latest commits squashed.

I've not looked yet.  But answering below, to hopefully clarify
the misunderstanding.

> > IIUC, the required (not just "desired") properties should stand out.
> > And, for the mathematically-inclined, the relationship to affine
> > transforms would illustrate it (for Euclidean spaces).
>
> I'm not sure what you're saying here.

My understanding is that "Transform" can be documented as:
---CUT---
In Euclidean space, this must be an affine transform.
---CUT---

Gilles

> The current documentation is the most complete and mathematically accurate.
>
> -Matt
>>> [...]

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [geometry] Rename Transform to AffineTransform

2020-01-20 Thread Gilles Sadowski
Hello.

Le lun. 20 janv. 2020 à 16:57, Matt Juntunen
 a écrit :
>
> Gilles,
>
> > From a design POV, I still think that "AffineTransform" is redundant:
>
> The "AffineTransform" name change has been reverted. It is now the original 
> "EuclideanTransform".

I was not indicating that the name "EuclideanTransform" would be
better than "AffineTransform", I was wondering about whether the
class itself is redundant.

> I've closed PR #58 and created PR #59 with the latest commits squashed.

I've not looked yet.  But answering below, to hopefully clarify
the misunderstanding.

> > IIUC, the required (not just "desired") properties should stand out.
> > And, for the mathematically-inclined, the relationship to affine
> > transforms would illustrate it (for Euclidean spaces).
>
> I'm not sure what you're saying here.

My understanding is that "Transform" can be documented as:
---CUT---
In Euclidean space, this must be an affine transform.
---CUT---

Gilles

> The current documentation is the most complete and mathematically accurate.
>
> -Matt
>>> [...]

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [geometry] Rename Transform to AffineTransform

2020-01-20 Thread Matt Juntunen
Gilles,

> From a design POV, I still think that "AffineTransform" is redundant:

The "AffineTransform" name change has been reverted. It is now the original 
"EuclideanTransform". I've closed PR #58 and created PR #59 with the latest 
commits squashed.

> IIUC, the required (not just "desired") properties should stand out.
> And, for the mathematically-inclined, the relationship to affine
> transforms would illustrate it (for Euclidean spaces).

I'm not sure what you're saying here. The current documentation is the most 
complete and mathematically accurate.

-Matt

From: Gilles Sadowski 
Sent: Monday, January 20, 2020 8:52 AM
To: Commons Developers List 
Subject: Re: [geometry] Rename Transform to AffineTransform

Hello.

2020-01-20 14:28 UTC+01:00, Matt Juntunen :
> Gilles,
>
>> I had a (quick) look; is it necessary to split functionality among
>> "Transform"
>> (in "core") and its subinterfaces/classes in other modules?  IOW, if
>> "Transform"
>> can only be affine, it looks strange to have "AffineTransform"
>> (re)defined.
>
> This is a documentation issue.
> The name "affine transform" only applies to
> affine spaces such as Euclidean space. Spherical space is not an affine
> space. The "Transform" interface is intended to represent transforms with
> the desired properties regardless of whether the space is affine or not.

>From a design POV, I still think that "AffineTransform" is redundant:
 * If "Transform" has the "desired properties"
 * Then, in an affine space, "Transform" is an affine transform.

> This was not clear in the docs since the word "affine" is listed as an
> implementation requirement on the "Transform" interface. I've updated the
> docs and userguide to clarify this.

IIUC, the required (not just "desired") properties should stand out.
And, for the mathematically-inclined, the relationship to affine
transforms would illustrate it (for Euclidean spaces).

>
>
>> I'm also a bit puzzled by the "AbstractAffineTransformMatrix" that seems
>> to
>> only contain convenience methods for internal use (whereas having them
>> "protected" put them in the public API).
>
> That class also contains other matrix-specific methods (eg, "determinant")
> and the overridden "preservesOrientation". Good point on the protected
> methods, though. I've moved them into the internal "Matrices" utility
> class.

Thanks.

Gilles

>
> -Matt
> 
> From: Gilles Sadowski 
> Sent: Sunday, January 19, 2020 9:06 AM
> To: Commons Developers List 
> Subject: Re: [geometry] Rename Transform to AffineTransform
>
> Hi.
>
> Le sam. 18 janv. 2020 à 23:14, Matt Juntunen
>  a écrit :
>>
>> Gilles,
>>
>> >> There, we can simply sample the user-defined function
>> > I'm not sure I understand.
>>
>> Just an implementation detail. We need to pass some sample points through
>> the user-defined function in order to construct an equivalent matrix.
>>
>> > Throwing an exception if the transform does not abide by
>> > the requirements?
>>
>> Yes.
>>
>> I just submitted a PR on Github with these changes. I also realized that
>> the EuclideanTransform class as discussed exactly matches the definition
>> of an affine transform so I renamed it to AffineTransform. No other names
>> were changed.
>
> I had a (quick) look; is it necessary to split functionality among
> "Transform"
> (in "core") and its subinterfaces/classes in other modules?  IOW, if
> "Transform"
> can only be affine, it looks strange to have "AffineTransform" (re)defined.
>
> I'm also a bit puzzled by the "AbstractAffineTransformMatrix" that seems to
> only contain convenience methods for internal use (whereas having them
> "protected" put them in the public API).
>
> Regards,
> Gilles
>
>>
>> -Matt
>> 
>> From: Gilles Sadowski 
>> Sent: Saturday, January 18, 2020 1:40 PM
>> To: Commons Developers List 
>> Subject: Re: [geometry] Rename Transform to AffineTransform
>>
>> Hello.
>>
>> 2020-01-18 15:40 UTC+01:00, Matt Juntunen :
>> > Gilles,
>> >
>> >> If the "Transform" is intimately related to the "core" and there is a
>> >> single
>> >> set of properties that make it "affine" (and work correctly), I'd tend
>> >> to
>> >> keep the name "Transform".
>> >
>> > So, if I'm understanding you correctly, you're saying that since the
>> > partitioning code in the library only works with these types of
>> > parallelism-preserving transforms, it can be safely assumed that
>> > o.a.c.geometry.core.Transform represents such a transform. Is this
>> > correct?
>>
>> Indeed.
>>
>> > One thing that's causing some issues with the implementation here is
>> > that
>> > the Euclidean TransformXD interfaces have static
>> > "from(UnaryOperator)"
>> > methods that allow users to wrap their own, arbitrary vector operations
>> > as
>> > Transform instances. We don't (and really can't) do any validation on
>> > these
>> > user-defined functions to ensure that they meet the library
>> > requirements. It

[codec] Base32/Base64 trailing bit validation is incomplete

2020-01-20 Thread Alex Herbert
In release 1.14 I fixed a patch made in 1.13 to reject decoding bytes 
that cannot be re-encoded to the same bytes (Codec-134 [1]). My fix was 
to correct the use of a mask to check the trailing bytes.


The implementation checks trailing bits that are to be discarded are all 
zero.


However the check does not go so far as to throw an exception when there 
are trailing bits below the count of 8. In this case there cannot be any 
more bytes and the bits are discarded.


I assume this is because before the trailing bit validation it was not 
necessary to do anything when the number of trailing bits was less than 
a single byte.


However this means it is still possible to decode some bytes and then 
encode them to create different bytes as shown here:


@Test
public void testTrailing6bits() {
    final Base64 codec = new Base64();
    // A 4 byte block plus an extra one
    byte[] bytes = new byte[] { 'A', 'B', 'C', 'D' };
    byte[] decoded = codec.decode(bytes);
    byte[] encoded = codec.encode(decoded);
    Assert.assertTrue("The round trip is possible", 
Arrays.equals(bytes, encoded));

    // A 4 byte block plus an extra one
    bytes = new byte[] { 'A', 'B', 'C', 'D', 'E' };
    decoded = codec.decode(bytes);
    encoded = codec.encode(decoded);
    Assert.assertFalse("The round trip is not possible", 
Arrays.equals(bytes, encoded));

}

@Test
public void testTrailing5bits() {
    final Base32 codec = new Base32();
    // An 8 byte block plus an extra one
    byte[] bytes = new byte[] { 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H' };
    byte[] decoded = codec.decode(bytes);
    byte[] encoded = codec.encode(decoded);
    Assert.assertTrue("The round trip is possible", 
Arrays.equals(bytes, encoded));

    // An 8 byte block plus an extra one
    bytes = new byte[] { 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I' };
    decoded = codec.decode(bytes);
    encoded = codec.encode(decoded);
    Assert.assertFalse("The round trip is not possible", 
Arrays.equals(bytes, encoded));

}

Previously when fixing the trailing bit validation I suggested 
validating all trailing bits, even when they cannot be converted into 
any bytes. However this was not done and there are still some invalid 
inputs that do not trigger an exception.


I propose to update the fix by throwing an IllegalArgumentException if 
there are left-over bits that cannot be decoded. That seems to be the 
intention of CODEC-134 to prevent security exploitation.


However note that stricter validation is causing users to complain about 
exceptions when they should be questioning their own data. Recently a 
user raised Codec-279 which stated that exceptions were being thrown 
[2]. I believe the code is functioning as expected. So adding extra 
validation may prove to be more of a headache to users who have somehow 
obtained invalid encodings.


One workaround is to fix the implementation to reject anything that 
cannot be re-encoded to the same bytes. Then add a property that allows 
this behaviour to be suppressed allowing for example:


Base64 codec = new Base64();
byte[] bytes = new byte[] { 'A', 'B', 'C', 'D', 'E' };
byte[] decoded;
try {
    decoded = codec.decode(bytes);
} catch (IllegalArgumentException ignore) {
    codec.setAllowTrailingBits(true);
    // OK
    decoded = codec.decode(bytes);
}

WDYT?

1. Fully fix CODEC-134
2. Optionally allow CODEC-134 behaviour to be suppressed

I would vote for option 1. It requires 1 line change in the code for 
Base32 and Base64.


I am open to opinions for option 2. It would allow users to upgrade and 
then opt-in to previous functionality.


Alex


[1] https://issues.apache.org/jira/browse/CODEC-134

[2] https://issues.apache.org/jira/browse/CODEC-279


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [geometry] Rename Transform to AffineTransform

2020-01-20 Thread Gilles Sadowski
Hello.

2020-01-20 14:28 UTC+01:00, Matt Juntunen :
> Gilles,
>
>> I had a (quick) look; is it necessary to split functionality among
>> "Transform"
>> (in "core") and its subinterfaces/classes in other modules?  IOW, if
>> "Transform"
>> can only be affine, it looks strange to have "AffineTransform"
>> (re)defined.
>
> This is a documentation issue.
> The name "affine transform" only applies to
> affine spaces such as Euclidean space. Spherical space is not an affine
> space. The "Transform" interface is intended to represent transforms with
> the desired properties regardless of whether the space is affine or not.

>From a design POV, I still think that "AffineTransform" is redundant:
 * If "Transform" has the "desired properties"
 * Then, in an affine space, "Transform" is an affine transform.

> This was not clear in the docs since the word "affine" is listed as an
> implementation requirement on the "Transform" interface. I've updated the
> docs and userguide to clarify this.

IIUC, the required (not just "desired") properties should stand out.
And, for the mathematically-inclined, the relationship to affine
transforms would illustrate it (for Euclidean spaces).

>
>
>> I'm also a bit puzzled by the "AbstractAffineTransformMatrix" that seems
>> to
>> only contain convenience methods for internal use (whereas having them
>> "protected" put them in the public API).
>
> That class also contains other matrix-specific methods (eg, "determinant")
> and the overridden "preservesOrientation". Good point on the protected
> methods, though. I've moved them into the internal "Matrices" utility
> class.

Thanks.

Gilles

>
> -Matt
> 
> From: Gilles Sadowski 
> Sent: Sunday, January 19, 2020 9:06 AM
> To: Commons Developers List 
> Subject: Re: [geometry] Rename Transform to AffineTransform
>
> Hi.
>
> Le sam. 18 janv. 2020 à 23:14, Matt Juntunen
>  a écrit :
>>
>> Gilles,
>>
>> >> There, we can simply sample the user-defined function
>> > I'm not sure I understand.
>>
>> Just an implementation detail. We need to pass some sample points through
>> the user-defined function in order to construct an equivalent matrix.
>>
>> > Throwing an exception if the transform does not abide by
>> > the requirements?
>>
>> Yes.
>>
>> I just submitted a PR on Github with these changes. I also realized that
>> the EuclideanTransform class as discussed exactly matches the definition
>> of an affine transform so I renamed it to AffineTransform. No other names
>> were changed.
>
> I had a (quick) look; is it necessary to split functionality among
> "Transform"
> (in "core") and its subinterfaces/classes in other modules?  IOW, if
> "Transform"
> can only be affine, it looks strange to have "AffineTransform" (re)defined.
>
> I'm also a bit puzzled by the "AbstractAffineTransformMatrix" that seems to
> only contain convenience methods for internal use (whereas having them
> "protected" put them in the public API).
>
> Regards,
> Gilles
>
>>
>> -Matt
>> 
>> From: Gilles Sadowski 
>> Sent: Saturday, January 18, 2020 1:40 PM
>> To: Commons Developers List 
>> Subject: Re: [geometry] Rename Transform to AffineTransform
>>
>> Hello.
>>
>> 2020-01-18 15:40 UTC+01:00, Matt Juntunen :
>> > Gilles,
>> >
>> >> If the "Transform" is intimately related to the "core" and there is a
>> >> single
>> >> set of properties that make it "affine" (and work correctly), I'd tend
>> >> to
>> >> keep the name "Transform".
>> >
>> > So, if I'm understanding you correctly, you're saying that since the
>> > partitioning code in the library only works with these types of
>> > parallelism-preserving transforms, it can be safely assumed that
>> > o.a.c.geometry.core.Transform represents such a transform. Is this
>> > correct?
>>
>> Indeed.
>>
>> > One thing that's causing some issues with the implementation here is
>> > that
>> > the Euclidean TransformXD interfaces have static
>> > "from(UnaryOperator)"
>> > methods that allow users to wrap their own, arbitrary vector operations
>> > as
>> > Transform instances. We don't (and really can't) do any validation on
>> > these
>> > user-defined functions to ensure that they meet the library
>> > requirements. It
>> > is therefore easy for users to pass in invalid operators. To avoid this,
>> > I'm
>> > thinking of removing the TransformXD interfaces completely and moving
>> > the
>> > "from(UnaryOperator)" methods into the AffineTransformMatrixXD
>> > classes.
>>
>> +1
>> It is generally good to prevent the creation of invalid objects.
>>
>> > There, we can simply sample the user-defined function
>>
>> I'm not sure I understand.
>>
>> > as needed and produce
>> > matrices that are guaranteed to be affine.
>>
>> Throwing an exception if the transform does not abide by
>> the requirements?
>>
>> > Following the above, the class hierarchy would then be as below, which
>> > is
>> > basically what it was before I added the TransformXD interfaces.
>> >
>> > 

Re: [geometry] Rename Transform to AffineTransform

2020-01-20 Thread Matt Juntunen
Gilles,

> I had a (quick) look; is it necessary to split functionality among "Transform"
> (in "core") and its subinterfaces/classes in other modules?  IOW, if 
> "Transform"
> can only be affine, it looks strange to have "AffineTransform" (re)defined.

This is a documentation issue. The name "affine transform" only applies to 
affine spaces such as Euclidean space. Spherical space is not an affine space. 
The "Transform" interface is intended to represent transforms with the desired 
properties regardless of whether the space is affine or not. This was not clear 
in the docs since the word "affine" is listed as an implementation requirement 
on the "Transform" interface. I've updated the docs and userguide to clarify 
this.


> I'm also a bit puzzled by the "AbstractAffineTransformMatrix" that seems to
> only contain convenience methods for internal use (whereas having them
> "protected" put them in the public API).

That class also contains other matrix-specific methods (eg, "determinant") and 
the overridden "preservesOrientation". Good point on the protected methods, 
though. I've moved them into the internal "Matrices" utility class.

-Matt

From: Gilles Sadowski 
Sent: Sunday, January 19, 2020 9:06 AM
To: Commons Developers List 
Subject: Re: [geometry] Rename Transform to AffineTransform

Hi.

Le sam. 18 janv. 2020 à 23:14, Matt Juntunen
 a écrit :
>
> Gilles,
>
> >> There, we can simply sample the user-defined function
> > I'm not sure I understand.
>
> Just an implementation detail. We need to pass some sample points through the 
> user-defined function in order to construct an equivalent matrix.
>
> > Throwing an exception if the transform does not abide by
> > the requirements?
>
> Yes.
>
> I just submitted a PR on Github with these changes. I also realized that the 
> EuclideanTransform class as discussed exactly matches the definition of an 
> affine transform so I renamed it to AffineTransform. No other names were 
> changed.

I had a (quick) look; is it necessary to split functionality among "Transform"
(in "core") and its subinterfaces/classes in other modules?  IOW, if "Transform"
can only be affine, it looks strange to have "AffineTransform" (re)defined.

I'm also a bit puzzled by the "AbstractAffineTransformMatrix" that seems to
only contain convenience methods for internal use (whereas having them
"protected" put them in the public API).

Regards,
Gilles

>
> -Matt
> 
> From: Gilles Sadowski 
> Sent: Saturday, January 18, 2020 1:40 PM
> To: Commons Developers List 
> Subject: Re: [geometry] Rename Transform to AffineTransform
>
> Hello.
>
> 2020-01-18 15:40 UTC+01:00, Matt Juntunen :
> > Gilles,
> >
> >> If the "Transform" is intimately related to the "core" and there is a
> >> single
> >> set of properties that make it "affine" (and work correctly), I'd tend to
> >> keep the name "Transform".
> >
> > So, if I'm understanding you correctly, you're saying that since the
> > partitioning code in the library only works with these types of
> > parallelism-preserving transforms, it can be safely assumed that
> > o.a.c.geometry.core.Transform represents such a transform. Is this correct?
>
> Indeed.
>
> > One thing that's causing some issues with the implementation here is that
> > the Euclidean TransformXD interfaces have static "from(UnaryOperator)"
> > methods that allow users to wrap their own, arbitrary vector operations as
> > Transform instances. We don't (and really can't) do any validation on these
> > user-defined functions to ensure that they meet the library requirements. It
> > is therefore easy for users to pass in invalid operators. To avoid this, I'm
> > thinking of removing the TransformXD interfaces completely and moving the
> > "from(UnaryOperator)" methods into the AffineTransformMatrixXD classes.
>
> +1
> It is generally good to prevent the creation of invalid objects.
>
> > There, we can simply sample the user-defined function
>
> I'm not sure I understand.
>
> > as needed and produce
> > matrices that are guaranteed to be affine.
>
> Throwing an exception if the transform does not abide by
> the requirements?
>
> > Following the above, the class hierarchy would then be as below, which is
> > basically what it was before I added the TransformXD interfaces.
> >
> > commons-geometry-core
> >Transform
> >
> > commons-geometry-euclidean
> > EuclideanTransform extends Transform
> > AffineTransformMatrixXD implements EuclideanTransform
> > Rotation3D extends EuclideanTransform
> > QuaternionRotation implements Rotation3D
> >
> > commons-geometry-spherical
> > Transform1S implements Transform
> > Transform2S implements Transform
> >
> > WDYT?
>
> +1
>
> Best,
> Gilles
>
> >
> > -Matt
> >
> >
> > 
> > From: Gilles Sadowski 
> > Sent: Monday, January 13, 2020 8:03 PM
> > To: Commons Developers List 
> > Subject: Re: [geometry] Rename Transform to 

Re: [math]New feature MiniBatchKMeansClusterer

2020-01-20 Thread Gilles Sadowski
Hi.

2020-01-20 3:08 UTC+01:00, CT :
> Hi, In my picture search project, I need a cluster algorithm to narrow
> the dataset, for accelerate the search on millions of pictures.
>  First we use python+pytorch+kmean, with the growing data from
> thousands to millions, the KMeans clustering became slower and
> slower(seconds to minutes), then we find MiniBatchKMeans could amazing
> finish the clustering in 1~2 seconds on millions of data.

Impressive improvement.

>  Meanwhile we still faced the insufficient concurrent capacity of
> python, so we switch to kotlin on jvm.
>  But there did not a MinibatchKMeans algorithm in jvm yet, so I wrote
> one in kotlin, refer to the (python)sklearn MinibatchKMeans and Apache
> Commons Math(Deeplearning4j was also considered, but it is too slow because
> of ND4j's design).
>
>
>  I'd like to contribute it to Apache Commons Math,

Thanks!

> and I wrote a java
> version:
> https://github.com/chentao106/commons-math/tree/feature-MiniBatchKMeans

Some remarks:

* I didn't get why the "KMeansPlusPlusCentroidInitializer" class
does not call the existing "KMeansPlusPlusClusterer".
Code seems duplicated: As there is a case for reuse, the currently
"private" centroid initialization code should be factored out.

* In "CentroidInitializer", I'd rename "chooseCentroids" to emphasize
that a computation is performed (as opposed to selecting from an
existing data structure).

* Not convinced that there should be so many constructors (in most
cases, it makes no sense to pick default values that are likely to
be heavily application-dependent.

* Input should generally be validated: e.g. the maximum number of
iterations should not be changed unwittingly; rather, an exception
should be raised if the user passed a negative value.

>
>  From my test(Kotlin version), it is very fast, but gives slightly
> different results withKMeans++ in most case, but sometimes has big
> different(May be affected by the randomness of the mini batch):

Could be nice to illustrate (not just with a picture, but in a table
with entries average over several runs) the differences in result
between the implementations, using various setups (number of
clusters, stopping criterion, etc.).

>
>
>
> Some bad case:
>
> It even worse when I use RandomSource.create(RandomSource.MT_64, 0)for
> the random generator ┐(´-`)┌.

"MT_64" is probably not the best default.  And this is one of the
parameters from which there should not be a default IMO.

[Note: there are spurious characters in your message (see e.g. the
paragraph quoted just above) that make it difficult to read.]

Best regards,
Gilles

>
>
> My brief understanding of MiniBatchKMeans:
> Use a partial points in initialize cluster centers, and random mini batch in
> training iterations.
> It can finish in few seconds when clustering millions of data, and has few
> differences between KMeans.
>
>
> More information about MiniBatchKMeans
>  https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
> 
> https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org