[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

jose-torres Tue, 07 Aug 2018 17:31:05 -0700

Github user jose-torres commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22009#discussion_r208425199
  
    --- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/StreamingReadSupport.java
 ---
    @@ -0,0 +1,49 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.sources.v2.reader.streaming;
    +
    +import org.apache.spark.sql.sources.v2.reader.ReadSupport;
    +
    +/**
    + * A base interface for streaming read support. This is package private 
and is invisible to data
    + * sources. Data sources should implement concrete streaming read support 
interfaces:
    + * {@link MicroBatchReadSupport} or {@link ContinuousReadSupport}.
    + */
    +interface StreamingReadSupport extends ReadSupport {
    +
    +  /**
    +   * Returns the initial offset for a streaming query to start reading 
from. Note that the
    +   * streaming data source should not assume that it will start reading 
from its
    +   * {@link #initialOffset()} value: if Spark is restarting an existing 
query, it will restart from
    +   * the check-pointed offset rather than the initial one.
    +   */
    +  Offset initialOffset();
    +
    +  /**
    +   * Deserialize a JSON string into an Offset of the 
implementation-defined offset type.
    +   *
    +   * @throws IllegalArgumentException if the JSON does not encode a valid 
offset for this reader
    +   */
    +  Offset deserializeOffset(String json);
    --- End diff --
    
    I think I understand what you're saying. I could get behind a proposal to 
simply define "arbitrary JSON string" as the one and only offset type, with 
each connector responsible for writing and parsing JSON however it'd like. All 
the existing offsets are trivial case classes anyway; it'd be a bit of a 
migration, but nothing architecturally difficult to handle.
    
    I don't see how a `toBytes` method would help the problem. Neither 
arbitrary byte arrays nor arbitrary JSON strings let Spark know what type it's 
supposed to instantiate.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22009: [SPARK-24882][SQL] improve data source v2 API

Reply via email to