Hi Kasper, Sorry for not voting on this. I am not yet up to speed about rules regarding the voting.
Hans -----Original Message----- From: Kasper Sørensen [mailto:[email protected]] Sent: Friday, August 23, 2013 9:37 AM To: [email protected] Subject: Re: [DISCUSS] use folder name as schema name for file based DataContexts OK I'm going to commit this on the basis of lazy concensus. But as a small side note, I'd like to also invite more people to vote :-) 2013/8/21 Ankit Kumar <[email protected]>: > +1 > > Regards > Ankit > > > On Tue, Aug 20, 2013 at 4:26 PM, Kasper Sørensen < > [email protected]> wrote: > >> I've updated my gist/patch [1] with also support for using quotes in >> the table/column paths. Let's have a vote on this patch, to see if we >> can get this in. >> >> [1] https://gist.github.com/kaspersorensen/6210970 >> >> 2013/8/20 Kasper Sørensen <[email protected]>: >> > Agreed on all. Except why should dots in column names be any >> > different than schema and table names? >> > >> > 2013/8/16 Hans Drexler <[email protected]>: >> >> I believe that probably, *every* convention will have its drawbacks. >> using a factory can help on one hand, but it can also cause great >> confusion if things get mixed. It also makes things more complex. If >> we clearly document the choice made, I will live with that. >> >> >> >> My main point is that we should try to write and document the >> >> software >> in such way that MetaModel users will not get confused. I like the >> quotes idea, since that will allow the user to explicitely express >> what is intended. But then, lets extend it to something like this: >> >> >> >> "schema_name"."table_name"."column_name" >> >> >> >> Where schema_name and table_name can contain dots ("."). (I guess >> column names cannot...) >> >> >> >> I hope you don't mind me rambling about this... >> >> >> >> kind regards, >> >> >> >> Hans >> >> >> >> -----Original Message----- >> >> From: Kasper Sørensen [mailto:[email protected]] >> >> Sent: Wednesday, August 14, 2013 2:59 PM >> >> To: [email protected] >> >> Subject: Re: [DISCUSS] use folder name as schema name for file >> >> based >> DataContexts >> >> >> >> With those different preferences, we could even consider making >> something like a "TableNameFactory" which converts filenames into >> table names. But I guess the crucial point is which default convention to >> use. >> >> >> >> Underscoring makes it a bit cleaner to look at the column or table >> paths, but it also makes the representation less direct. A user could >> start wondering if there are other characters than dots that will be >> replaced by underscores etc. >> >> >> >> It should be noted that MM's parser does support dots in both >> >> table and >> schema names, so this is probably mostly a question of aesthetics. >> >> >> >> The ambiguity that you point out is also interesting. So far I >> >> haven't >> seen it appear in real life, but technically it could occur that you >> had two pairs of schemas and tables that would generate a ambigious table >> path. >> For instance: >> >> >> >> Schema: foo.bar >> >> Table: baz >> >> >> >> and >> >> >> >> Schema: foo >> >> Table: bar.baz >> >> >> >> The parser would currently favor the second schema ("foo") since >> >> it >> incrementally tries for schema/table/column matches with every >> dot-separated token. An improvement to the parser would be to allow >> quote characters, so that you could express your table path like this >> >> then: >> >> >> >> "foo.bar".baz >> >> >> >> Also I want to note that some databases do support dots in >> schema/table/column names, so this ambiguity can (although rarely) >> also occur in a RDBMS or other data sources. It would also be quite >> common with some separator (not necesarily a dot) in NoSQL database >> column names, to indicate a nested field. In HBase for instance they >> are referred using colon, like this: "columnFamily:column". >> >> >> >> All in all I am mostly feeling like preserving the dots from the >> filenames, but am also very curious what other people think! >> >> >> >> 2013/8/14 Hans Drexler <[email protected]>: >> >>> Hi, >> >>> >> >>> First I agree with bumping this issue. When at the customer, this >> thing caused a lot of time spent in figuring out what was going on. I >> am not sure if I like the extension as part of the table name, because: >> >>> - I would never create a table in a relational database with a >> >>> dot in the name >> >>> - It creates a ambiguity. If you have a "full" path name to a >> >>> column, >> like " documents.people.csv.name ", then it is not clear if the >> schema name is "documents.people" and the table name is "csv", or >> that the schema name is "documents" and the table name is >> "people.csv". It seems natural to me that schema names contain dots, but not >> table names. >> >>> >> >>> Alternatives: >> >>> - Leave the extension out of the name (probably not acceptable, >> because then you can no longer have two "tables" differing only in >> extension). Although I must say that personally I think this would be >> the best solution. >> >>> >> >>> - Use a conventional name, like: >> >>> Schema name: Folder name >> >>> Table name: The filename, including extension (all dots replaced >> >>> by >> underscores). >> >>> Resulting in e.g. a column path like this: >> >>> documents.people_csv.name >> >>> >> >>> At the customer site, the file I needed to use was actually >> >>> called >> like this pattern: "bar/FOO.PEOPLE.IN.FILE". Using the convention, >> this would become: >> >>> bar.FOO_PEOPLE_IN_FILE >> >>> >> >>> IMHO this is preferable to "bar.foo.people.in.file" >> >>> >> >>> The problem is of course that it would now be impossible to have >> >>> another file "bar/FOO_PEOPLE_IN_FILE" :-( >> >>> >> >>> I am happy to hear other peoples thougths. >> >>> >> >>> >> >>> Hans >> >>> >> >>> >> >>> -----Original Message----- >> >>> From: Kasper Sørensen [mailto:[email protected]] >> >>> Sent: Wednesday, August 14, 2013 10:18 AM >> >>> To: [email protected] >> >>> Subject: Re: [DISCUSS] use folder name as schema name for file >> >>> based DataContexts >> >>> >> >>> Rats, made a mistake in that diff. The Gist has been updated [1] >> >>> and >> now contains the ResourceUtils class which was missing before. >> >>> [1] https://gist.github.com/kaspersorensen/6210970 >> >>> >> >>> 2013/8/12 Kasper Sørensen <[email protected]>: >> >>>> Here's a proposed patch (implemented for CSV and fixedwidth >> >>>> files which are the modules that implemented the old schema naming >> >>>> pattern): >> >>>> https://gist.github.com/kaspersorensen/6210970 >> >>>> >> >>>> 2013/8/10 Kasper Sørensen <[email protected]>: >> >>>>> https://issues.apache.org/jira/browse/METAMODEL-4 >> >>>>> >> >>>>> 2013/8/10 Henry Saputra <[email protected]>: >> >>>>>> What is the JIRA for this one? >> >>>>>> >> >>>>>> >> >>>>>> On Fri, Aug 9, 2013 at 2:26 AM, Manuel van den Berg < >> >>>>>> [email protected]> wrote: >> >>>>>> >> >>>>>>> +1 >> >>>>>>> >> >>>>>>> (shouldn't I just vote on the Jira for this?) >> >>>>>>> >> >>>>>>> manuel >> >>>>>>> >> >>>>>>> > -----Original Message----- >> >>>>>>> > From: Kasper Sørensen >> >>>>>>> > [mailto:[email protected]] >> >>>>>>> > Sent: Friday, August 09, 2013 9:03 >> >>>>>>> > To: [email protected] >> >>>>>>> > Subject: Re: [DISCUSS] use folder name as schema name for >> >>>>>>> > file based DataContexts >> >>>>>>> > >> >>>>>>> > Allow me to bump this issue (it's my impression that more >> >>>>>>> > people have >> >>>>>>> joined >> >>>>>>> > in a bit late, after this topic was posted). >> >>>>>>> > >> >>>>>>> > I think this is one of the more important issues that I >> >>>>>>> > would want to fix before we make our first release at Apache. >> >>>>>>> > >> >>>>>>> > 2013/7/24 Kasper Sørensen <[email protected]>: >> >>>>>>> > > Right now we have this slightly odd naming convention for >> >>>>>>> > > schema and table names when building metadata for e.g. a >> >>>>>>> > > CSV file or a fixed width value file. >> >>>>>>> > > >> >>>>>>> > > Schema name: The filename, including file extension. >> >>>>>>> > > Table name: The filename without extension. >> >>>>>>> > > Resulting in e.g. a column path like this: >> >>>>>>> > > people.csv.people.name >> >>>>>>> > > >> >>>>>>> > > I suggest we change it to this convention: >> >>>>>>> > > >> >>>>>>> > > Schema name: Folder name >> >>>>>>> > > Table name: The filename, including file extension. >> >>>>>>> > > Resulting in e.g. a column path like this: >> >>>>>>> > > documents.people.csv.name >> >>>>>>> > > >> >>>>>>> > > Why do I think this would be an improvement? >> >>>>>>> > > >> >>>>>>> > > 1) Because this would first of all make a kind of sense >> >>>>>>> > > to the user to see the file system's hierarchy reflected >> >>>>>>> > > in the >> schema model. >> >>>>>>> > > 2) Because it allows us to make these DataContext's >> >>>>>>> > > operate not on a single file, but on a directory of >> >>>>>>> > > files. I have seen this quite a number of times by now >> >>>>>>> > > that users of MetaModel, >> or users of e.g. >> >>>>>>> > > DataCleaner, which uses MetaModel quite heavily, wants to >> >>>>>>> > > do this sort >> >>>>>>> of >> >>>>>>> > stuff. >> >>>>>>> > > 3) The removing of the file extension stuff is kind of >> >>>>>>> > > broken and a strange convention in the first place. >> >>>>>>> > > >> >>>>>>> > > While this doesn't really break backwards compatibility >> >>>>>>> > > in terms of Java code, it would break configuration files >> >>>>>>> > > and other stuff of applications that use MetaModel. But I >> >>>>>>> > > do believe that can be communicated and handled through >> >>>>>>> > > carefully explaining the new convention on the migration >> >>>>>>> > > page (that I >> recently started writing [1]). >> >>>>>>> > > >> >>>>>>> > > What do you think? >> >>>>>>> > > >> >>>>>>> > > [1] >> >>>>>>> > > http://wiki.apache.org/metamodel/MigratingFromEobjectsMet >> >>>>>>> > > aMode >> >>>>>>> > > l >> >>>>>>> >>
