[ https://issues.apache.org/jira/browse/HIVE-11023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590806#comment-14590806 ]
Ashutosh Chauhan commented on HIVE-11023: ----------------------------------------- +1 Longer term I liked your first suggestion of converting data in RDBMS from DN2 to DN1. > Disable directSQL if datanucleus.identifierFactory = datanucleus2 > ----------------------------------------------------------------- > > Key: HIVE-11023 > URL: https://issues.apache.org/jira/browse/HIVE-11023 > Project: Hive > Issue Type: Bug > Components: Metastore > Affects Versions: 1.3.0, 1.2.1, 2.0.0 > Reporter: Sushanth Sowmyan > Assignee: Sushanth Sowmyan > Priority: Critical > Attachments: HIVE-11023.patch > > > We hit an interesting bug in a case where datanucleus.identifierFactory = > datanucleus2 . > The problem is that directSql handgenerates SQL strings assuming > "datanucleus1" naming scheme. If a user has their metastore JDO managed by > datanucleus.identifierFactory = datanucleus2 , the SQL strings we generate > are incorrect. > One simple example of what this results in is the following: whenever DN > persists a field which is held as a List<T>, it winds up storing each T as a > separate line in the appropriate mapping table, and has a column called > INTEGER_IDX, which holds the position in the list. Then, upon reading, it > automatically reads all relevant lines with an ORDER BY INTEGER_IDX, which > results in the list retaining its order. In DN2 naming scheme, the column is > called IDX, instead of INTEGER_IDX. If the user has run appropriate metatool > upgrade scripts, it is highly likely that they have both columns, INTEGER_IDX > and IDX. > Whenever they use JDO, such as with all writes, it will then use the IDX > field, and when they do any sort of optimized reads, such as through > directSQL, it will ORDER BY INTEGER_IDX. > An immediate danger is seen when we consider that the schema of a table is > stored as a List<FieldSchema> , and while IDX has 0,1,2,3,... , INTEGER_IDX > will contain 0,0,0,0,... and thus, any attempt to describe the table or fetch > schema for the table can come up mixed up in the table's native hashing > order, rather than sorted by the index. > This can then result in schema ordering being different from the actual > table. For eg:, if a user has a (a:int,b:string,c:string), a describe on this > may return (c:string, a:int, b: string), and thus, queries which are > inserting after selecting from another table can have ClassCastExceptions > when trying to insert data in the wong order - this is how we discovered this > bug. This problem, however, can be far worse, if there are no type problems - > it is possible, for eg., that if a,b&c were all strings, that that insert > query would succeed but mix up the order, which then results in user table > data being mixed up. This has the potential to be very bad. > We should write a tool to help convert metastores that use "datanucleus2" to > "datanucleus1"(more difficult, needs more one-time testing) or change > directSql to support both(easier to code, but increases test-coverage matrix > significantly and we should really then be testing against both schemes). But > in the short term, we should disable directSql if we see that the > identifierfactory is "datanucleus2" -- This message was sent by Atlassian JIRA (v6.3.4#6332)