+1 Hans
-----Original Message----- From: Kasper Sørensen [mailto:[email protected]] Sent: Tuesday, August 20, 2013 4:27 PM To: [email protected] Subject: Re: [DISCUSS] use folder name as schema name for file based DataContexts I've updated my gist/patch [1] with also support for using quotes in the table/column paths. Let's have a vote on this patch, to see if we can get this in. [1] https://gist.github.com/kaspersorensen/6210970 2013/8/20 Kasper Sørensen <[email protected]>: > Agreed on all. Except why should dots in column names be any different > than schema and table names? > > 2013/8/16 Hans Drexler <[email protected]>: >> I believe that probably, *every* convention will have its drawbacks. using a >> factory can help on one hand, but it can also cause great confusion if >> things get mixed. It also makes things more complex. If we clearly document >> the choice made, I will live with that. >> >> My main point is that we should try to write and document the software in >> such way that MetaModel users will not get confused. I like the quotes idea, >> since that will allow the user to explicitely express what is intended. But >> then, lets extend it to something like this: >> >> "schema_name"."table_name"."column_name" >> >> Where schema_name and table_name can contain dots ("."). (I guess column >> names cannot...) >> >> I hope you don't mind me rambling about this... >> >> kind regards, >> >> Hans >> >> -----Original Message----- >> From: Kasper Sørensen [mailto:[email protected]] >> Sent: Wednesday, August 14, 2013 2:59 PM >> To: [email protected] >> Subject: Re: [DISCUSS] use folder name as schema name for file based >> DataContexts >> >> With those different preferences, we could even consider making something >> like a "TableNameFactory" which converts filenames into table names. But I >> guess the crucial point is which default convention to use. >> >> Underscoring makes it a bit cleaner to look at the column or table paths, >> but it also makes the representation less direct. A user could start >> wondering if there are other characters than dots that will be replaced by >> underscores etc. >> >> It should be noted that MM's parser does support dots in both table and >> schema names, so this is probably mostly a question of aesthetics. >> >> The ambiguity that you point out is also interesting. So far I haven't seen >> it appear in real life, but technically it could occur that you had two >> pairs of schemas and tables that would generate a ambigious table path. For >> instance: >> >> Schema: foo.bar >> Table: baz >> >> and >> >> Schema: foo >> Table: bar.baz >> >> The parser would currently favor the second schema ("foo") since it >> incrementally tries for schema/table/column matches with every >> dot-separated token. An improvement to the parser would be to allow >> quote characters, so that you could express your table path like this >> then: >> >> "foo.bar".baz >> >> Also I want to note that some databases do support dots in >> schema/table/column names, so this ambiguity can (although rarely) also >> occur in a RDBMS or other data sources. It would also be quite common with >> some separator (not necesarily a dot) in NoSQL database column names, to >> indicate a nested field. In HBase for instance they are referred using >> colon, like this: "columnFamily:column". >> >> All in all I am mostly feeling like preserving the dots from the filenames, >> but am also very curious what other people think! >> >> 2013/8/14 Hans Drexler <[email protected]>: >>> Hi, >>> >>> First I agree with bumping this issue. When at the customer, this thing >>> caused a lot of time spent in figuring out what was going on. I am not sure >>> if I like the extension as part of the table name, because: >>> - I would never create a table in a relational database with a dot >>> in the name >>> - It creates a ambiguity. If you have a "full" path name to a column, like >>> " documents.people.csv.name ", then it is not clear if the schema name is >>> "documents.people" and the table name is "csv", or that the schema name is >>> "documents" and the table name is "people.csv". It seems natural to me that >>> schema names contain dots, but not table names. >>> >>> Alternatives: >>> - Leave the extension out of the name (probably not acceptable, because >>> then you can no longer have two "tables" differing only in extension). >>> Although I must say that personally I think this would be the best solution. >>> >>> - Use a conventional name, like: >>> Schema name: Folder name >>> Table name: The filename, including extension (all dots replaced by >>> underscores). >>> Resulting in e.g. a column path like this: >>> documents.people_csv.name >>> >>> At the customer site, the file I needed to use was actually called like >>> this pattern: "bar/FOO.PEOPLE.IN.FILE". Using the convention, this would >>> become: >>> bar.FOO_PEOPLE_IN_FILE >>> >>> IMHO this is preferable to "bar.foo.people.in.file" >>> >>> The problem is of course that it would now be impossible to have >>> another file "bar/FOO_PEOPLE_IN_FILE" :-( >>> >>> I am happy to hear other peoples thougths. >>> >>> >>> Hans >>> >>> >>> -----Original Message----- >>> From: Kasper Sørensen [mailto:[email protected]] >>> Sent: Wednesday, August 14, 2013 10:18 AM >>> To: [email protected] >>> Subject: Re: [DISCUSS] use folder name as schema name for file based >>> DataContexts >>> >>> Rats, made a mistake in that diff. The Gist has been updated [1] and now >>> contains the ResourceUtils class which was missing before. >>> [1] https://gist.github.com/kaspersorensen/6210970 >>> >>> 2013/8/12 Kasper Sørensen <[email protected]>: >>>> Here's a proposed patch (implemented for CSV and fixedwidth files >>>> which are the modules that implemented the old schema naming pattern): >>>> https://gist.github.com/kaspersorensen/6210970 >>>> >>>> 2013/8/10 Kasper Sørensen <[email protected]>: >>>>> https://issues.apache.org/jira/browse/METAMODEL-4 >>>>> >>>>> 2013/8/10 Henry Saputra <[email protected]>: >>>>>> What is the JIRA for this one? >>>>>> >>>>>> >>>>>> On Fri, Aug 9, 2013 at 2:26 AM, Manuel van den Berg < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> +1 >>>>>>> >>>>>>> (shouldn't I just vote on the Jira for this?) >>>>>>> >>>>>>> manuel >>>>>>> >>>>>>> > -----Original Message----- >>>>>>> > From: Kasper Sørensen [mailto:[email protected]] >>>>>>> > Sent: Friday, August 09, 2013 9:03 >>>>>>> > To: [email protected] >>>>>>> > Subject: Re: [DISCUSS] use folder name as schema name for file >>>>>>> > based DataContexts >>>>>>> > >>>>>>> > Allow me to bump this issue (it's my impression that more >>>>>>> > people have >>>>>>> joined >>>>>>> > in a bit late, after this topic was posted). >>>>>>> > >>>>>>> > I think this is one of the more important issues that I would >>>>>>> > want to fix before we make our first release at Apache. >>>>>>> > >>>>>>> > 2013/7/24 Kasper Sørensen <[email protected]>: >>>>>>> > > Right now we have this slightly odd naming convention for >>>>>>> > > schema and table names when building metadata for e.g. a CSV >>>>>>> > > file or a fixed width value file. >>>>>>> > > >>>>>>> > > Schema name: The filename, including file extension. >>>>>>> > > Table name: The filename without extension. >>>>>>> > > Resulting in e.g. a column path like this: >>>>>>> > > people.csv.people.name >>>>>>> > > >>>>>>> > > I suggest we change it to this convention: >>>>>>> > > >>>>>>> > > Schema name: Folder name >>>>>>> > > Table name: The filename, including file extension. >>>>>>> > > Resulting in e.g. a column path like this: >>>>>>> > > documents.people.csv.name >>>>>>> > > >>>>>>> > > Why do I think this would be an improvement? >>>>>>> > > >>>>>>> > > 1) Because this would first of all make a kind of sense to >>>>>>> > > the user to see the file system's hierarchy reflected in the schema >>>>>>> > > model. >>>>>>> > > 2) Because it allows us to make these DataContext's operate >>>>>>> > > not on a single file, but on a directory of files. I have >>>>>>> > > seen this quite a number of times by now that users of MetaModel, >>>>>>> > > or users of e.g. >>>>>>> > > DataCleaner, which uses MetaModel quite heavily, wants to do >>>>>>> > > this sort >>>>>>> of >>>>>>> > stuff. >>>>>>> > > 3) The removing of the file extension stuff is kind of >>>>>>> > > broken and a strange convention in the first place. >>>>>>> > > >>>>>>> > > While this doesn't really break backwards compatibility in >>>>>>> > > terms of Java code, it would break configuration files and >>>>>>> > > other stuff of applications that use MetaModel. But I do >>>>>>> > > believe that can be communicated and handled through >>>>>>> > > carefully explaining the new convention on the migration page (that >>>>>>> > > I recently started writing [1]). >>>>>>> > > >>>>>>> > > What do you think? >>>>>>> > > >>>>>>> > > [1] >>>>>>> > > http://wiki.apache.org/metamodel/MigratingFromEobjectsMetaMo >>>>>>> > > de >>>>>>> > > l >>>>>>>
