Zheng,
In LazySimpleSerDe.initSerdeParams:
String useJsonSerialize = tbl
.getProperty(Constants.SERIALIZATION_USE_JSON_OBJECTS);
serdeParams.jsonSerialize = (useJsonSerialize != null && useJsonSerialize
.equalsIgnoreCase("true"));
SERIALIZATION_USE_JSON_OBJECTS is set to true in PlanUtis.getTableDesc:
// It is not a very clean way, and should be modified later - due to
// compatiblity reasons,
// user sees the results as json for custom scripts and has no way for
// specifying that.
// Right now, it is hard-coded in the code
if (useJSONForLazy) {
properties.setProperty(Constants.SERIALIZATION_USE_JSON_OBJECTS, "true");
}
useJSONForLazy is true in the following 2 calls to PlanUtis.getTableDesc:
SemanticAnalyzer.genScriptPlan -> PlanUtis.getTableDesc
SemanticAnalyzer.genScriptPlan -> SemanticAnalyzer.getTableDescFromSerDe ->
PlanUtis.getTableDesc
What is it all about and how should we untangle it (ideally get rid of
SERIALIZATION_USE_JSON_OBJECTS)?
Thanks.
Steven
-----Original Message-----
From: Zheng Shao [mailto:[email protected]]
Sent: Wednesday, September 01, 2010 6:45 PM
To: Steven Wong; [email protected]; John Sichi
Cc: Jerome Boulon
Subject: RE: Deserializing map column via JDBC (HIVE-1378)
Hi Steven,
As far as I remember, the only use case of JSON logic in LazySimpleSerDe is the
FetchTask. Even if there are other cases, we should be able to catch it in
unit tests.
The potential risk is small enough, and the benefit of cleaning it up is pretty
big - it makes the code much easier to understand.
Thanks for getting to it Steven! I am very happy to see that this finally gets
cleaned up!
Zheng
-----Original Message-----
From: Steven Wong [mailto:[email protected]]
Sent: Thursday, September 02, 2010 7:45 AM
To: Zheng Shao; [email protected]; John Sichi
Cc: Jerome Boulon
Subject: RE: Deserializing map column via JDBC (HIVE-1378)
Your suggestion is in line with my earlier proposal of fixing FetchTask. The
only major difference is the moving of the JSON-related logic from
LazySimpleSerDe to a new serde called DelimitedJSONSerDe.
Is it safe to get rid of the JSON-related logic in LazySimpleSerDe? Sounds like
you're implying that it is safe, but I'd like to confirm with you. I don't
really know whether there are components other than FetchTask that rely on
LazySimpleSerDe and its JSON capability (the useJSONSerialize flag doesn't have
to be true for LazySimpleSerDe to use JSON).
If it is safe, I am totally fine with introducing DelimitedJSONSerDe.
Combining your suggestion and my proposal would look like:
0. Move JSON serialization logic from LazySimpleSerDe to a new serde called
DelimitedJSONSerDe.
1. By default, hive.fetch.output.serde = DelimitedJSONSerDe.
2. When JDBC driver connects to Hive server, execute "set
hive.fetch.output.serde = LazySimpleSerDe".
3. In Hive server:
(a) If hive.fetch.output.serde == DelimitedJSONSerDe, FetchTask uses
DelimitedJSONSerDe to maintain today's serialization behavior (tab for field
delimiter tab, "NULL" for null sequence, JSON for non-primitives).
(b) If hive.fetch.output.serde == LazySimpleSerDe, FetchTask uses
LazySimpleSerDe with a schema to ctrl-delimit everything.
4. JDBC driver deserializes with LazySimpleSerDe instead of DynamicSerDe.
Steven
-----Original Message-----
From: Zheng Shao [mailto:[email protected]]
Sent: Wednesday, September 01, 2010 3:22 AM
To: Steven Wong; [email protected]; John Sichi
Cc: Jerome Boulon
Subject: RE: Deserializing map column via JDBC (HIVE-1378)
Hi Steven,
Sorry for the late reply. The email slipped my eye...
This issue was brought up multiple times. In my opinion, using JSON in
LazySimpleSerDe (inherited from ColumnsetSerDe, MetadataColumnsetSerDe,
DynamicSerDe) was a long-time legacy problem that never got fixed.
LazySimpleSerDe was supposed to do delimited format only.
The cleanest way to do that is to:
1. Get rid of the JSON-related logic in LazySimpleSerDe;
2. Introduce another "DelimitedJSONSerDe" (without deserialization capability)
that does JSON serialization for complex fields. (We never have or need
deserialization for JSON yet)
3. Configure the FetchTask to use the new SerDe by default, and LazySimpleSerDe
in case it's JDBC. This is for serialization only. We might need to have 2
SerDe fields in FetchTask - one for deserialization the data from file, one for
serialization the data to stdout/jdbc etc.
I can help review the code (please ping me) if you decide to go down this route.
Zheng
-----Original Message-----
From: Steven Wong [mailto:[email protected]]
Sent: Monday, August 30, 2010 3:46 PM
To: [email protected]; John Sichi
Cc: Zheng Shao; Jerome Boulon
Subject: RE: Deserializing map column via JDBC (HIVE-1378)
Any guidance on how I/we should proceed on HIVE-1378 and HIVE-1606?
-----Original Message-----
From: Steven Wong
Sent: Friday, August 27, 2010 2:24 PM
To: [email protected]; 'John Sichi'
Cc: Zheng Shao; Jerome Boulon
Subject: RE: Deserializing map column via JDBC (HIVE-1378)
A related jira is HIVE-1606 (For a null value in a string column, JDBC driver
returns the string "NULL"). What happens is the sever-side serde already turns
the null into "NULL". Both null and "NULL" are serialized as "NULL"; the
client-side serde has no hope. I bring this jira up to point out that JDBC's
server side uses a serialization format that appears intended for display
(human consumption) instead of deserialization. The mixing of non-JSON and JSON
serializations is perhaps another manifestation.
Also, fixing HIVE-1606 will obviously require a server-side change. Both
HIVE-1606 and HIVE-1378 (the jira at hand) can share some server-side change,
if HIVE-1378 ends up changing the sever side too.
Steven
-----Original Message-----
From: John Sichi [mailto:[email protected]]
Sent: Friday, August 27, 2010 11:29 AM
To: Steven Wong
Cc: Zheng Shao; [email protected]; Jerome Boulon
Subject: Re: Deserializing map column via JDBC (HIVE-1378)
I don't know enough about the serdes to say whether that's a problem...maybe
someone else does? It seems like as long as the JSON form doesn't include the
delimiter unescaped, it might work?
JVS
On Aug 26, 2010, at 6:29 PM, Steven Wong wrote:
That sounds like it'll work, at least conceptually. But if the row contains
primitive and non-primitive columns, the row serialization will be a mix of
non-JSON and JSON serializations, right? Is that a good thing?
From: John Sichi [mailto:[email protected]]
Sent: Thursday, August 26, 2010 12:11 PM
To: Steven Wong
Cc: Zheng Shao; [email protected]<mailto:[email protected]>;
Jerome Boulon
Subject: Re: Deserializing map column via JDBC (HIVE-1378)
If you replace DynamicSerDe with LazySimpleSerDe on the JDBC client side, can't
you then tell it to expect JSON serialization for the maps? That way you can
leave the FetchTask server side as is.
JVS
On Aug 24, 2010, at 2:50 PM, Steven Wong wrote:
I got sidetracked for awhile....
Looking at client.fetchOne, it is a call to the Hive server, which shows the
following call stack:
SerDeUtils.getJSONString(Object, ObjectInspector) line: 205
LazySimpleSerDe.serialize(Object, ObjectInspector) line: 420
FetchTask.fetch(ArrayList<String>) line: 130
Driver.getResults(ArrayList<String>) line: 660
HiveServer$HiveServerHandler.fetchOne() line: 238
In other words, FetchTask.mSerde (an instance of LazySimpleSerDe) serializes
the map column into JSON strings. It's because FetchTask.mSerde has been
initialized by FetchTask.initialize to do it that way.
It appears that the fix is to initialize FetchTask.mSerde differently to do
ctrl-serialization instead - presumably for the JDBC use case only and not for
other use cases of FetchTask. Further, it appears that FetchTask.mSerde will do
ctrl-serialization if it is initialized (via the properties "columns" and
"columns.types") with the proper schema.
Are these right? Pointers on how to get the proper schema? (From
FetchTask.work?) And on how to restrict the change to JDBC only? (I have no
idea.)
For symmetry, LazySimpleSerDe should be used to do ctrl-deserialization on the
client side, per Zheng's suggestion.
Steven
From: Zheng Shao [mailto:[email protected]]
Sent: Monday, August 16, 2010 3:57 PM
To: Steven Wong; [email protected]<mailto:[email protected]>
Cc: Jerome Boulon
Subject: RE: Deserializing map column via JDBC (HIVE-1378)
I think the call to client.fetchOne should use delimited format, so that
DynamicSerDe can deserialize it.
This should be a good short-term fix.
Also on a higher level, DynamicSerDe is deprecated. It will be great to use
LazySimpleSerDe to handle all serialization/deserializations instead.
Zheng
From: Steven Wong [mailto:[email protected]]
Sent: Friday, August 13, 2010 7:02 PM
To: Zheng Shao; [email protected]<mailto:[email protected]>
Cc: Jerome Boulon
Subject: Deserializing map column via JDBC (HIVE-1378)
Trying to work on HIVE-1378. My first step is to get the Hive JDBC driver to
return actual values for mapcol in the result set of "select mapcol, bigintcol,
stringcol from foo", where mapcol is a map<string,string> column, instead of
the current behavior of complaining that mapcol's column type is not recognized.
I changed HiveResultSetMetaData.{getColumnType,getColumnTypeName} to recognize
the map type, but then the returned value for mapcol is always {}, even though
mapcol does contain some key-value entries. Turns out this is happening in
HiveQueryResultSet.next:
1. The call to client.fetchOne returns the string "{"a":"b","x":"y"}
123 abc".
2. The serde (DynamicSerDe ds) deserializes the string to the list
[{},123,"abc"].
The serde cannot correctly deserialize the map because apparently the map is
not in the serde's expected serialization format. The serde has been
initialized with TCTLSeparatedProtocol.
Should we make client.fetchOne return a ctrl-separated string? Or should we use
a different serde/format in HiveQueryResultSet? It seems the first way is
right; correct me if that's wrong. And how do we do that?
Thanks.
Steven