[jira] [Work logged] (HIVE-21240) JSON SerDe Re-Write

ASF GitHub Bot (JIRA) Mon, 22 Apr 2019 06:47:53 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-21240?focusedWorklogId=230698&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-230698
 ]


ASF GitHub Bot logged work on HIVE-21240:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 22/Apr/19 13:46
            Start Date: 22/Apr/19 13:46
    Worklog Time Spent: 10m 
      Work Description: BELUGABEHR commented on pull request #530: HIVE-21240: 
JSON SerDe Deserialize Re-Write
URL: https://github.com/apache/hive/pull/530#discussion_r277294622
 
 

 ##########
 File path: serde/src/java/org/apache/hadoop/hive/serde2/JsonSerDe.java
 ##########
 @@ -63,76 +43,151 @@
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
-@SerDeSpec(schemaProps = {serdeConstants.LIST_COLUMNS,
-    serdeConstants.LIST_COLUMN_TYPES,
-    serdeConstants.TIMESTAMP_FORMATS })
-
+/**
+ * Hive SerDe for processing JSON formatted data. This is typically paired with
+ * the TextInputFormat and therefore each line provided to this SerDe must be a
+ * single, and complete JSON object.<br/>
+ * <h2>Example</h2>
+ * <p>
+ * {"name="john","age"=30}<br/>
+ * {"name="sue","age"=32}
+ * </p>
+ */
+@SerDeSpec(schemaProps = { serdeConstants.LIST_COLUMNS,
+    serdeConstants.LIST_COLUMN_TYPES, serdeConstants.TIMESTAMP_FORMATS,
+    JsonSerDe.BINARY_FORMAT, JsonSerDe.IGNORE_EXTRA })
 public class JsonSerDe extends AbstractSerDe {
 
   private static final Logger LOG = LoggerFactory.getLogger(JsonSerDe.class);
+
+  public static final String BINARY_FORMAT = "json.binary.format";
+  public static final String IGNORE_EXTRA = "text.ignore.extra.fields";
+  public static final String NULL_EMPTY_LINES = "text.null.empty.line";
+
   private List<String> columnNames;
 
-  private HiveJsonStructReader structReader;
+  private BinaryEncoding binaryEncoding;
+  private boolean nullEmptyLines;
+
+  private HiveJsonReader jsonReader;
+  private HiveJsonWriter jsonWriter;
   private StructTypeInfo rowTypeInfo;
+  private StructObjectInspector soi;
 
+  /**
+   * Initialize the SerDe. By default, items being deserialized are expected to
+   * be wrapped in Hadoop Writable objects and objects being serialized are
+   * expected to be Java primitive objects.
+   */
   @Override
-  public void initialize(Configuration conf, Properties tbl)
-    throws SerDeException {
-    List<TypeInfo> columnTypes;
+  public void initialize(final Configuration conf, final Properties tbl)
+      throws SerDeException {
+    initialize(conf, tbl, true);
+  }
+
+  /**
+   * Initialize the SerDe.
+   *
+   * @param conf System properties; can be null in compile time
+   * @param tbl table properties
+   * @param writeablePrimitivesDeserialize true if outputs are Hadoop Writable
+   */
+  public void initialize(final Configuration conf, final Properties tbl,
+      final boolean writeablePrimitivesDeserialize) {
+
     LOG.debug("Initializing JsonSerDe: {}", tbl.entrySet());
 
     // Get column names
-    String columnNameProperty = tbl.getProperty(serdeConstants.LIST_COLUMNS);
-    final String columnNameDelimiter = 
tbl.containsKey(serdeConstants.COLUMN_NAME_DELIMITER) ? tbl
-        .getProperty(serdeConstants.COLUMN_NAME_DELIMITER)
-      : String.valueOf(SerDeUtils.COMMA);
-    // all table column names
-    if (columnNameProperty.isEmpty()) {
-      columnNames = Collections.emptyList();
-    } else {
-      columnNames = 
Arrays.asList(columnNameProperty.split(columnNameDelimiter));
-    }
+    final String columnNameProperty =
+        tbl.getProperty(serdeConstants.LIST_COLUMNS);
+    final String columnNameDelimiter = tbl.getProperty(
+        serdeConstants.COLUMN_NAME_DELIMITER, 
String.valueOf(SerDeUtils.COMMA));
+
+    this.columnNames = columnNameProperty.isEmpty() ? Collections.emptyList()
+        : Arrays.asList(columnNameProperty.split(columnNameDelimiter));
 
     // all column types
-    String columnTypeProperty = 
tbl.getProperty(serdeConstants.LIST_COLUMN_TYPES);
-    if (columnTypeProperty.isEmpty()) {
-      columnTypes = Collections.emptyList();
-    } else {
-      columnTypes = 
TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);
-    }
+    final String columnTypeProperty =
+        tbl.getProperty(serdeConstants.LIST_COLUMN_TYPES);
+
+    final List<TypeInfo> columnTypes =
+        columnTypeProperty.isEmpty() ? Collections.emptyList()
+            : TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);
 
     LOG.debug("columns: {}, {}", columnNameProperty, columnNames);
     LOG.debug("types: {}, {} ", columnTypeProperty, columnTypes);
 
     assert (columnNames.size() == columnTypes.size());
 
-    rowTypeInfo = (StructTypeInfo) 
TypeInfoFactory.getStructTypeInfo(columnNames, columnTypes);
+    final String nullEmpty = tbl.getProperty(NULL_EMPTY_LINES, "false");
+    this.nullEmptyLines = Boolean.parseBoolean(nullEmpty);
+
+    this.rowTypeInfo = (StructTypeInfo) TypeInfoFactory
+        .getStructTypeInfo(columnNames, columnTypes);
+
+    this.soi = (StructObjectInspector) TypeInfoUtils
+        .getStandardWritableObjectInspectorFromTypeInfo(this.rowTypeInfo);
+
+    final TimestampParser tsParser;
+    final String parserFormats =
+        tbl.getProperty(serdeConstants.TIMESTAMP_FORMATS);
+    if (parserFormats != null) {
+      tsParser =
+          new TimestampParser(HiveStringUtils.splitAndUnEscape(parserFormats));
+    } else {
+      tsParser = new TimestampParser();
+    }
+
+    final String binaryEncodingStr = tbl.getProperty(BINARY_FORMAT, "base64");
+    this.binaryEncoding =
+        BinaryEncoding.valueOf(binaryEncodingStr.toUpperCase());
+
+    this.jsonReader = new HiveJsonReader(this.soi, tsParser);
+    this.jsonWriter = new HiveJsonWriter(this.binaryEncoding, columnNames);
+
+    this.jsonReader.setBinaryEncoding(binaryEncoding);
+    this.jsonReader.enable(HiveJsonReader.Feature.COL_INDEX_PARSING);
 
-    TimestampParser tsParser = new TimestampParser(
-        
HiveStringUtils.splitAndUnEscape(tbl.getProperty(serdeConstants.TIMESTAMP_FORMATS)));
-    structReader = new HiveJsonStructReader(rowTypeInfo, tsParser);
-    structReader.setIgnoreUnknownFields(true);
-    structReader.enableHiveColIndexParsing(true);
-    structReader.setWritablesUsage(true);
+    if (writeablePrimitivesDeserialize) {
+      this.jsonReader.enable(HiveJsonReader.Feature.PRIMITIVE_TO_WRITABLE);
+    }
+
+    final String ignoreExtras = tbl.getProperty(IGNORE_EXTRA, "true");
+    if (Boolean.parseBoolean(ignoreExtras)) {
+      this.jsonReader.enable(HiveJsonReader.Feature.IGNORE_UKNOWN_FIELDS);
+    }
+
+    LOG.debug("JSON Struct Reader: {}", jsonReader);
+    LOG.debug("JSON Struct Writer: {}", jsonWriter);
   }
 
   /**
-   * Takes JSON string in Text form, and has to return an object 
representation above
-   * it that's readable by the corresponding object inspector.
+   * Deserialize an object out of a Writable blob containing JSON text. The
+   * return value of this function will be constant since the function will
+   * reuse the returned object. If the client wants to keep a copy of the
+   * object, the client needs to clone the returned value by calling
+   * ObjectInspectorUtils.getStandardObject().
    *
-   * For this implementation, since we're using the jackson parser, we can 
construct
-   * our own object implementation, and we use HCatRecord for it
+   * @param blob The Writable (Text) object containing a serialized object
+   * @return A List containing all the values of the row
    */
   @Override
-  public Object deserialize(Writable blob) throws SerDeException {
+  public Object deserialize(final Writable blob) throws SerDeException {
+    final Text t = (Text) blob;
+
+    if (t.getLength() == 0) {
+      if (!this.nullEmptyLines) {
+        throw new SerDeException("Encountered an empty row in the text file");
+      }
+      final int fieldCount = soi.getAllStructFieldRefs().size();
+      return Collections.nCopies(fieldCount, null);
+    }
 
-    Object row;
-    Text t = (Text) blob;
     try {
-      row = structReader.parseStruct(new ByteArrayInputStream((t.getBytes()), 
0, t.getLength()));
-      return row;
+      return jsonReader.parseStruct(
+          new ByteArrayInputStream((t.getBytes()), 0, t.getLength()));
     } catch (Exception e) {
-      LOG.warn("Error [{}] parsing json text [{}].", e, t);
+      LOG.warn("Error parsing JSON text [{}].", t, e);
 
 Review comment:
   Thanks for pointing this out.  I am going to lower this logging level to 
`debug` because there is also an Exception thrown here.  It should either log, 
or throw, not both.
   
   https://community.oracle.com/docs/DOC-983543#logAndThrow
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 230698)
    Time Spent: 1h 40m  (was: 1.5h)

> JSON SerDe Re-Write
> -------------------
>
>                 Key: HIVE-21240
>                 URL: https://issues.apache.org/jira/browse/HIVE-21240
>             Project: Hive
>          Issue Type: Improvement
>          Components: Serializers/Deserializers
>    Affects Versions: 4.0.0, 3.1.1
>            Reporter: David Mollitor
>            Assignee: David Mollitor
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>         Attachments: HIVE-21240.1.patch, HIVE-21240.1.patch, 
> HIVE-21240.10.patch, HIVE-21240.11.patch, HIVE-21240.11.patch, 
> HIVE-21240.11.patch, HIVE-21240.11.patch, HIVE-21240.2.patch, 
> HIVE-21240.3.patch, HIVE-21240.4.patch, HIVE-21240.5.patch, 
> HIVE-21240.6.patch, HIVE-21240.7.patch, HIVE-21240.9.patch, 
> HIVE-24240.8.patch, kafka_storage_handler.diff
>
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> The JSON SerDe has a few issues, I will link them to this JIRA.
> * Use Jackson Tree parser instead of manually parsing
> * Added support for base-64 encoded data (the expected format when using JSON)
> * Added support to skip blank lines (returns all columns as null values)
> * Current JSON parser accepts, but does not apply, custom timestamp formats 
> in most cases
> * Added some unit tests
> * Added cache for column-name to column-index searches, currently O\(n\) for 
> each row processed, for each column in the row



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (HIVE-21240) JSON SerDe Re-Write

Reply via email to