+1

Hans

-----Original Message-----
From: Kasper Sørensen [mailto:[email protected]] 
Sent: Tuesday, August 20, 2013 4:27 PM
To: [email protected]
Subject: Re: [DISCUSS] use folder name as schema name for file based 
DataContexts

I've updated my gist/patch [1] with also support for using quotes in the 
table/column paths. Let's have a vote on this patch, to see if we can get this 
in.

[1] https://gist.github.com/kaspersorensen/6210970

2013/8/20 Kasper Sørensen <[email protected]>:
> Agreed on all. Except why should dots in column names be any different 
> than schema and table names?
>
> 2013/8/16 Hans Drexler <[email protected]>:
>> I believe that probably, *every* convention will have its drawbacks. using a 
>> factory can help on one hand, but it can also cause great confusion if 
>> things get mixed. It also makes things more complex. If we clearly document 
>> the choice made, I will live with that.
>>
>> My main point is that  we should try to write and document the software in 
>> such way that MetaModel users will not get confused. I like the quotes idea, 
>> since that will allow the user to explicitely express what is intended. But 
>> then, lets extend it to something like this:
>>
>> "schema_name"."table_name"."column_name"
>>
>> Where schema_name and table_name can contain dots (".").   (I guess column 
>> names cannot...)
>>
>> I hope you don't mind me rambling about this...
>>
>> kind regards,
>>
>> Hans
>>
>> -----Original Message-----
>> From: Kasper Sørensen [mailto:[email protected]]
>> Sent: Wednesday, August 14, 2013 2:59 PM
>> To: [email protected]
>> Subject: Re: [DISCUSS] use folder name as schema name for file based 
>> DataContexts
>>
>> With those different preferences, we could even consider making something 
>> like a "TableNameFactory" which converts filenames into table names. But I 
>> guess the crucial point is which default convention to use.
>>
>> Underscoring makes it a bit cleaner to look at the column or table paths, 
>> but it also makes the representation less direct. A user could start 
>> wondering if there are other characters than dots that will be replaced by 
>> underscores etc.
>>
>> It should be noted that MM's parser does support dots in both table and 
>> schema names, so this is probably mostly a question of aesthetics.
>>
>> The ambiguity that you point out is also interesting. So far I haven't seen 
>> it appear in real life, but technically it could occur that you had two 
>> pairs of schemas and tables that would generate a ambigious table path. For 
>> instance:
>>
>> Schema: foo.bar
>> Table: baz
>>
>> and
>>
>> Schema: foo
>> Table: bar.baz
>>
>> The parser would currently favor the second schema ("foo") since it 
>> incrementally tries for schema/table/column matches with every 
>> dot-separated token. An improvement to the parser would be to allow 
>> quote characters, so that you could express your table path like this
>> then:
>>
>> "foo.bar".baz
>>
>> Also I want to note that some databases do support dots in 
>> schema/table/column names, so this ambiguity can (although rarely) also 
>> occur in a RDBMS or other data sources. It would also be quite common with 
>> some separator (not necesarily a dot) in NoSQL database column names, to 
>> indicate a nested field. In HBase for instance they are referred using 
>> colon, like this: "columnFamily:column".
>>
>> All in all I am mostly feeling like preserving the dots from the filenames, 
>> but am also very curious what other people think!
>>
>> 2013/8/14 Hans Drexler <[email protected]>:
>>> Hi,
>>>
>>> First I agree with bumping this issue. When at the customer, this thing 
>>> caused a lot of time spent in figuring out what was going on. I am not sure 
>>> if I like the extension as part of the table name, because:
>>> - I would never create a table in a relational database with a dot 
>>> in the name
>>> - It creates a ambiguity. If you have a "full" path name to a column, like 
>>> " documents.people.csv.name ", then it is not clear if the schema name is 
>>> "documents.people" and the table name is "csv", or that the schema name is 
>>> "documents" and the table name is "people.csv". It seems natural to me that 
>>> schema names contain dots, but not table names.
>>>
>>> Alternatives:
>>> - Leave the extension out of the name (probably not acceptable, because 
>>> then you can no longer have two "tables" differing only in extension). 
>>> Although I must say that personally I think this would be the best solution.
>>>
>>> - Use a conventional name, like:
>>> Schema name: Folder name
>>> Table name: The filename, including extension (all dots replaced by 
>>> underscores).
>>> Resulting in e.g. a column path like this:
>>> documents.people_csv.name
>>>
>>> At the customer site, the file I needed to use was actually called like 
>>> this pattern: "bar/FOO.PEOPLE.IN.FILE". Using the convention, this would 
>>> become:
>>> bar.FOO_PEOPLE_IN_FILE
>>>
>>> IMHO this is preferable to  "bar.foo.people.in.file"
>>>
>>> The problem is of course that it would now be impossible to have 
>>> another file "bar/FOO_PEOPLE_IN_FILE" :-(
>>>
>>> I am happy to hear other peoples thougths.
>>>
>>>
>>> Hans
>>>
>>>
>>> -----Original Message-----
>>> From: Kasper Sørensen [mailto:[email protected]]
>>> Sent: Wednesday, August 14, 2013 10:18 AM
>>> To: [email protected]
>>> Subject: Re: [DISCUSS] use folder name as schema name for file based 
>>> DataContexts
>>>
>>> Rats, made a mistake in that diff. The Gist has been updated [1] and now 
>>> contains the ResourceUtils class which was missing before.
>>> [1] https://gist.github.com/kaspersorensen/6210970
>>>
>>> 2013/8/12 Kasper Sørensen <[email protected]>:
>>>> Here's a proposed patch (implemented for CSV and fixedwidth files 
>>>> which are the modules that implemented the old schema naming pattern):
>>>> https://gist.github.com/kaspersorensen/6210970
>>>>
>>>> 2013/8/10 Kasper Sørensen <[email protected]>:
>>>>> https://issues.apache.org/jira/browse/METAMODEL-4
>>>>>
>>>>> 2013/8/10 Henry Saputra <[email protected]>:
>>>>>> What is the JIRA for this one?
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 9, 2013 at 2:26 AM, Manuel van den Berg < 
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> (shouldn't I just vote on the Jira for this?)
>>>>>>>
>>>>>>> manuel
>>>>>>>
>>>>>>> > -----Original Message-----
>>>>>>> > From: Kasper Sørensen [mailto:[email protected]]
>>>>>>> > Sent: Friday, August 09, 2013 9:03
>>>>>>> > To: [email protected]
>>>>>>> > Subject: Re: [DISCUSS] use folder name as schema name for file 
>>>>>>> > based DataContexts
>>>>>>> >
>>>>>>> > Allow me to bump this issue (it's my impression that more 
>>>>>>> > people have
>>>>>>> joined
>>>>>>> > in a bit late, after this topic was posted).
>>>>>>> >
>>>>>>> > I think this is one of the more important issues that I would 
>>>>>>> > want to fix before we make our first release at Apache.
>>>>>>> >
>>>>>>> > 2013/7/24 Kasper Sørensen <[email protected]>:
>>>>>>> > > Right now we have this slightly odd naming convention for 
>>>>>>> > > schema and table names when building metadata for e.g. a CSV 
>>>>>>> > > file or a fixed width value file.
>>>>>>> > >
>>>>>>> > > Schema name: The filename, including file extension.
>>>>>>> > > Table name: The filename without extension.
>>>>>>> > > Resulting in e.g. a column path like this:
>>>>>>> > > people.csv.people.name
>>>>>>> > >
>>>>>>> > > I suggest we change it to this convention:
>>>>>>> > >
>>>>>>> > > Schema name: Folder name
>>>>>>> > > Table name: The filename, including file extension.
>>>>>>> > > Resulting in e.g. a column path like this:
>>>>>>> > > documents.people.csv.name
>>>>>>> > >
>>>>>>> > > Why do I think this would be an improvement?
>>>>>>> > >
>>>>>>> > > 1) Because this would first of all make a kind of sense to 
>>>>>>> > > the user to see the file system's hierarchy reflected in the schema 
>>>>>>> > > model.
>>>>>>> > > 2) Because it allows us to make these DataContext's operate 
>>>>>>> > > not on a single file, but on a directory of files. I have 
>>>>>>> > > seen this quite a number of times by now that users of MetaModel, 
>>>>>>> > > or users of e.g.
>>>>>>> > > DataCleaner, which uses MetaModel quite heavily, wants to do 
>>>>>>> > > this sort
>>>>>>> of
>>>>>>> > stuff.
>>>>>>> > > 3) The removing of the file extension stuff is kind of 
>>>>>>> > > broken and a strange convention in the first place.
>>>>>>> > >
>>>>>>> > > While this doesn't really break backwards compatibility in 
>>>>>>> > > terms of Java code, it would break configuration files and 
>>>>>>> > > other stuff of applications that use MetaModel. But I do 
>>>>>>> > > believe that can be communicated and handled through 
>>>>>>> > > carefully explaining the new convention on the migration page (that 
>>>>>>> > > I recently started writing [1]).
>>>>>>> > >
>>>>>>> > > What do you think?
>>>>>>> > >
>>>>>>> > > [1]
>>>>>>> > > http://wiki.apache.org/metamodel/MigratingFromEobjectsMetaMo
>>>>>>> > > de
>>>>>>> > > l
>>>>>>>

Reply via email to