Mathew Wicks created SPARK-20353:
------------------------------------

             Summary: Implement Tensorflow TFRecords file format
                 Key: SPARK-20353
                 URL: https://issues.apache.org/jira/browse/SPARK-20353
             Project: Spark
          Issue Type: Improvement
          Components: Input/Output, SQL
    Affects Versions: 2.1.0
            Reporter: Mathew Wicks


Spark is a very good prepossessing engine for tools like Tensorflow. However, 
we lack native support for Tensorflow's core file format, TFRecords.

There is a project which implements this functionality as an external JAR. (But 
is not user friendly, or robust enough for production use.)
https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-connector

Here is some discussion around the above.
https://github.com/tensorflow/ecosystem/issues/32

If we were to implement "tfrecords" as a data-frame writable/readable format, 
we would have to account for the various datatypes that can be present in spark 
columns, and which ones are actually useful in Tensorflow. 

Note: The `spark-tensorflow-connector` described above, does not properly 
support the vector data type. 

Further discussion of whether this is within the scope of Spark SQL is strongly 
welcomed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to