Re: [CSV] Strategies to handle duplicate headers

2023-06-21 Thread David Dellsperger
I've always had a big concern with this kind of behavior, because what
happens if the "new column" already exists but later in the header? It
seems like python/pandas deals with this by incrementing AGAIN, so they
read the header and THEN decide what to do with the values for duplicates
(make sense).  The following CSV
A, A, A.1, C, C, C.1
1, 2, 3, 4, 5, 6

would lead to the headers of A, A.2, A.1, C, C.2, C.1 in python/pandas.

I assume appending '.1' has fewer clashes than just appending '1' at the
end and might be why pandas chose that path.  Idea would be you want a
strategy that would have as little clash as possible when it comes to
extending the names

David

On Tue, Jun 20, 2023 at 11:24 PM Bruno Kinoshita  wrote:

> Hi,
>
>
> > However, I could imagine situations where we define
> > DuplicateHeaderMode.DEDUPLICATE, and a user isn't satisfied with our
> > normalization strategy. For example, dots in the headers breaks ingesting
> > the data in a third-party system. An interface could resolve this, but I
> > guess in such a scenario, they can also just opt for another mode and
> > normalize it themselves to bypass ours.
>
>
>  Good point. I think the only advantage of using dots is following the same
> pattern used in Python+Pandas, and also in the R base functions.
>
> # This is in R
>
> > read.csv('/tmp/1.csv')
>   A A.1 B B.1
> 1 1   2 3   4
> 2 a   b c   d
> >
>
> However, there are other R libraries that use underscore too (I think
> tidyverse does so). So users may have to normalize it themselves already
> when using different libraries in R.
>
> So I think we can use underscore or any other strategy to deduplicate
> column names, and allowing users to customize how names are repaired sounds
> good too, as long as we can find a good API for that.
>
> With that in mind, appending the enum does make sense. I'd still be wary
> > about making it default behavior anytime soon, unless there's evidence
> that
> > deduplication is really what users expect.
> >
> +1
>
> > Something to consider though. We allow configuring the delimiter. I think
> > parsing would be fine, but it might introduce edge-cases for printing if
> > the delimiter and normalization strategy overlap. For example, "A,A"
> > becomes "A.1,A,2" but the delimiter is ".", effectively making it
> > "A.1.A.2". We'll need test cases for that.
> >
>
> I don't know if wrapping the column names with quotes would help in this
> case (i.e. "A1."."A.2"), but definitely a good scenario for a test case,
> +1.
>
> -Bruno
>
> On Wed, 21 Jun 2023 at 02:12, Seth Falco  wrote:
>
> > I don't have a strong enough opinion to conclude what's best.
> >
> > Giving it more thought, I think the interface approach I proposed is
> > overcomplicated tbh. I can't imagine needing another duplicate header
> mode
> > after this.
> >
> > However, I could imagine situations where we define
> > DuplicateHeaderMode.DEDUPLICATE, and a user isn't satisfied with our
> > normalization strategy. For example, dots in the headers breaks ingesting
> > the data in a third-party system. An interface could resolve this, but I
> > guess in such a scenario, they can also just opt for another mode and
> > normalize it themselves to bypass ours.
> >
> > With that in mind, appending the enum does make sense. I'd still be wary
> > about making it default behavior anytime soon, unless there's evidence
> that
> > deduplication is really what users expect.
> > Something to consider though. We allow configuring the delimiter. I think
> > parsing would be fine, but it might introduce edge-cases for printing if
> > the delimiter and normalization strategy overlap. For example, "A,A"
> > becomes "A.1,A,2" but the delimiter is ".", effectively making it
> > "A.1.A.2". We'll need test cases for that.
> >
> > PS: Sorry if this message goes through twice. Looked to me that the email
> > didn't go through the first time.
> >
> > On 2023/06/20 21:28:16 Gary Gregory wrote:
> > > That's clever. So we could implement a new enum value
> > > DuplicateHeaderMode.DEDUPLICATE...
> > >
> > > Gary
> > >
> > > On Tue, Jun 20, 2023, 14:09 Bruno Kinoshita 
> >  wrote:
> > >
> > > > Hi,
> > > >
> > > > Bruno says:
> > > > > "With Pandas it automatically deduplicates the column names. Maybe
> > > > > that's a feature that we could have in Commons CSV too?"
> > > > >
> > > > > What does that mean and actually do? Say I have column A with row 1
> > > > > value of "X" and 2nd column A with row 1 value of 2. What do I get
> > > > > when I ask for column A row 1?
> > > > >
> > > >
> > > > When you ask for column A, you get the first column A with row 1
> value
> > of
> > > > "X". Then Pandas renames the other A column as "A.1". If you want to
> > access
> > > > rows in the second A column, then you will use "A.1" as index.
> > > >
> > > > This is useful when you work with CSV's with many headers so that you
> > still
> > > > have a valid name to use as index to access data, instead of having
> to
> > rely
> > > > on th

Re: [VFS] Help wanted for VFS-838

2023-05-25 Thread David Dellsperger
The other option would be to migrate to the new fork of jsch -
https://github.com/mwiede/jsch, there's a few issues with connection
algorithms, etc but there's now at least support to add algorithms to the
list of existing ones in 3 ways (though not sure how VFS might support this
currently).

David

On Thu, May 25, 2023 at 6:58 AM Gary Gregory  wrote:

> Hi all,
>
> It seems that Jsch is unmaintained.
>
> I propose we port to Apache Mina, which has an active and experienced
> community.
>
> It this easy or hard? A large or small task? Thoughts?
>
> Help wanted ;-) PR welcome :-)
>
> https://issues.apache.org/jira/browse/VFS-838
>
> TY!
> Gary
>


Re: [VOTE] Release Apache Commons CSV 1.10.0 based on RC1

2022-10-20 Thread David Dellsperger
I had just started to look into this and was going to call out the same
thing.  I'm concerned with those changes, especially the ones regarding the
allowDuplicates change, I made a note in my ticket for work to make sure we
have appropriate test cases on our end, with the RC, we didn't see any
issues with the compatibility between 1.9.0 and 1.10.0.RC.

David

On Thu, Oct 20, 2022 at 5:56 PM Alex Herbert 
wrote:

> On Thu, 20 Oct 2022 at 23:43, Alex Herbert 
> wrote:
> >
> > I did not have time to track through whether this behaviour changed
> > after the initial implementation of the flag. I would think not as the
> > original behaviour is from 1.0. This would map to:
> >
> > true -> ALLOW_ALL
> > false -> ALLOW_EMPTY
> > new -> DISALLOW
> >
> > Which is what we currently have in 1.10.0 RC1. Thus the PR #276 [7] to
> > change the use of the flag to 'false -> DISALLOW' is not maintaining
> > behavioural compatibility (to 1.7, or back to 1.0).
>
> PS. I just verified that PR 276 changes the DuplicateHeaderMode value
> for allowDuplicates=false and does not change any tests.
>
> So the test suite is currently not enforcing behavioural
> compatibility. This seems like a glaring hole in the tests and should
> be addressed to prevent regressions.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>


[csv] Planned release for 1.10.0?

2022-10-04 Thread David Dellsperger
There's a few fixes in commons-csv 1.10.0, especially dealing with String
Delimiter fixes that we could really use in a released version, is there a
plan to release this soon?

Happy to help where I can to get it released, if that's needed.

David