Alex Amato created BEAM-618:
-------------------------------

             Summary: Python SDKs writes non RFC compliant JSON files for BQ 
Export
                 Key: BEAM-618
                 URL: https://issues.apache.org/jira/browse/BEAM-618
             Project: Beam
          Issue Type: Bug
          Components: sdk-py
            Reporter: Alex Amato
            Assignee: Frances Perry


Python SDK uses the built in json.dumps to write JSON files to GCS for the BQ 
Exporter. BigQuery can fail to parse these files when it tries to load these 
files into a BQ table because json.dumps can export JSON which does not conform 
to the IEEE RFC.

There are a few cases which are not RFC compilant listed in that module.
https://docs.python.org/2/library/json.html#standard-compliance-and-interoperability

The main issue we run into is the NAN, INF and -INF values.
These fails with a confusing error (and we delete the GCS files making it hard 
to debug):
JSON table encountered too many errors, giving up. Rows JSON parsing error in 
row starting at position

We can set the allow_nan argument to json.dumps to false to address these 
issues. So that when a user tries to write a file with INF, -INF or NAN

Setting this argument will produce this type of error when json.dumps is called 
with NAN/INF values. We may want to catch this error to mention the fact that 
INF and NAN are not allowed.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/json/__init__.py", line 250, in dumps
    sort_keys=sort_keys, **kw).encode(obj)
  File "/usr/lib/python2.7/json/encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib/python2.7/json/encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
ValueError: Out of range float values are not JSON compliant




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to