another example is if i have a Dataset[(K, V)] and i want to repartition it by the key K. repartition requires a Column which means i am suddenly back to worrying about duplicate field names. i would like to be able to say:
dataset.repartition(dataset(0)) On Thu, Feb 2, 2017 at 10:39 AM, Koert Kuipers <ko...@tresata.com> wrote: > since a dataset is a typed object you ideally don't have to think about > field names. > > however there are operations on Dataset that require you to provide a > Column, like for example joinWith (and joinWith returns a strongly typed > Dataset, not DataFrame). once you have to provide a Column you are back to > thinking in field names, and worrying about duplicate field names, which is > something that can easily happen in a Dataset without you realizing it. > > so under the hood Dataset has unique identifiers for every column, as in > dataset.queryExecution.logical.output, but these are expressions > (attributes) that i cannot turn back into columns since the constructors > for this are private in spark. > > so.... how about having Dataset.apply(i: Int): Column to allow me to pick > columns by position without having to worry about (duplicate) field names? > then i could do something like: > > dataset.joinWith(otherDataset, dataset(0) === otherDataset(0), joinType) >