[jira] [Commented] (BEAM-1874) Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

2018-03-12 Thread Deepyaman Datta (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16396465#comment-16396465
 ] 

Deepyaman Datta commented on BEAM-1874:
---

Hi,

I'm getting the same error using both `DirectRunner` and `DataflowRunner`. If I 
run my pipeline on a subset of files in GCS without `Content-Encoding` set, it 
works; if `Content-Encoding` is `gzip`, it fails. I have a mixture of files 
with and without `Content-Encoding` set, and I cannot touch the files.

[~smphhh] How are you removing the headers before reading the files with Beam?

Thanks!

Deepyaman

> Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: 
> gzip header
> -
>
> Key: BEAM-1874
> URL: https://issues.apache.org/jira/browse/BEAM-1874
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 0.6.0
>Reporter: Samuli Holopainen
>Assignee: Chamikara Jayalath
>Priority: Major
>
> We have gzipped text files in Google Cloud Storage that have the following 
> metadata headers set:
> Content-Encoding: gzip
> Content-Type: application/octet-stream
> Trying to read these with apache_beam.io.ReadFromText yields the following 
> error:
> ERROR:root:Exception while fetching 341565 bytes from position 0 of 
> gs://...-c72fa25a-5d8a-4801-a0b4-54b58c4723ce.gz: Cannot have start index 
> greater than total size
> Traceback (most recent call last):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 585, in _fetch_to_queue
> value = func(*args)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 610, in _get_segment
> downloader.GetRange(start, end)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 477, in GetRange
> progress, end_byte = self.__NormalizeStartEnd(start, end)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 340, in __NormalizeStartEnd
> 'Cannot have start index greater than total size')
> TransferInvalidError: Cannot have start index greater than total size
> WARNING:root:Task failed: Traceback (most recent call last):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/executor.py",
>  line 300, in __call__
> result = evaluator.finish_bundle()
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
>  line 206, in finish_bundle
> bundles = _read_values_to_bundles(reader)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
>  line 196, in _read_values_to_bundles
> read_result = [GlobalWindows.windowed_value(e) for e in reader]
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/concat_source.py",
>  line 79, in read
> range_tracker.sub_range_tracker(source_ix)):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 155, in read_records
> read_buffer)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 245, in _read_record
> sep_bounds = self._find_separator_bounds(file_to_read, read_buffer)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 190, in _find_separator_bounds
> file_to_read, read_buffer, current_pos + 1):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 212, in _try_to_ensure_num_bytes_in_buffer
> read_data = file_to_read.read(self._buffer_size)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
>  line 460, in read
> self._fetch_to_internal_buffer(num_bytes)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
>  line 420, in _fetch_to_internal_buffer
> buf = self._file.read(self._read_size)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 472, in read
> return self._read_inner(size=size, readline=False)
>   File 
> "/Users/samuli

[jira] [Commented] (BEAM-1874) Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

2017-08-03 Thread Samuli Holopainen (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112590#comment-16112590
 ] 

Samuli Holopainen commented on BEAM-1874:
-

Well, I think the issue still persists, i.e. the Python Beam SDK can't read 
files from GCS if they have the Content-Encoding header set. We are currently 
just working around the issue by removing the headers before reading the files 
with Beam.

> Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: 
> gzip header
> -
>
> Key: BEAM-1874
> URL: https://issues.apache.org/jira/browse/BEAM-1874
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 0.6.0
>Reporter: Samuli Holopainen
>Assignee: Charles Chen
>
> We have gzipped text files in Google Cloud Storage that have the following 
> metadata headers set:
> Content-Encoding: gzip
> Content-Type: application/octet-stream
> Trying to read these with apache_beam.io.ReadFromText yields the following 
> error:
> ERROR:root:Exception while fetching 341565 bytes from position 0 of 
> gs://...-c72fa25a-5d8a-4801-a0b4-54b58c4723ce.gz: Cannot have start index 
> greater than total size
> Traceback (most recent call last):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 585, in _fetch_to_queue
> value = func(*args)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 610, in _get_segment
> downloader.GetRange(start, end)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 477, in GetRange
> progress, end_byte = self.__NormalizeStartEnd(start, end)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 340, in __NormalizeStartEnd
> 'Cannot have start index greater than total size')
> TransferInvalidError: Cannot have start index greater than total size
> WARNING:root:Task failed: Traceback (most recent call last):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/executor.py",
>  line 300, in __call__
> result = evaluator.finish_bundle()
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
>  line 206, in finish_bundle
> bundles = _read_values_to_bundles(reader)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
>  line 196, in _read_values_to_bundles
> read_result = [GlobalWindows.windowed_value(e) for e in reader]
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/concat_source.py",
>  line 79, in read
> range_tracker.sub_range_tracker(source_ix)):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 155, in read_records
> read_buffer)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 245, in _read_record
> sep_bounds = self._find_separator_bounds(file_to_read, read_buffer)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 190, in _find_separator_bounds
> file_to_read, read_buffer, current_pos + 1):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 212, in _try_to_ensure_num_bytes_in_buffer
> read_data = file_to_read.read(self._buffer_size)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
>  line 460, in read
> self._fetch_to_internal_buffer(num_bytes)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
>  line 420, in _fetch_to_internal_buffer
> buf = self._file.read(self._read_size)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 472, in read
> return self._read_inner(size=size, readline=False)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 516, in _read_inner
> self._fetch_next_if_buffer_exhausted()
>   File 
> "/Users/samuli.h

[jira] [Commented] (BEAM-1874) Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

2017-08-02 Thread Ahmet Altay (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112160#comment-16112160
 ] 

Ahmet Altay commented on BEAM-1874:
---

Can we close this issue?

> Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: 
> gzip header
> -
>
> Key: BEAM-1874
> URL: https://issues.apache.org/jira/browse/BEAM-1874
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 0.6.0
>Reporter: Samuli Holopainen
>Assignee: Charles Chen
>
> We have gzipped text files in Google Cloud Storage that have the following 
> metadata headers set:
> Content-Encoding: gzip
> Content-Type: application/octet-stream
> Trying to read these with apache_beam.io.ReadFromText yields the following 
> error:
> ERROR:root:Exception while fetching 341565 bytes from position 0 of 
> gs://...-c72fa25a-5d8a-4801-a0b4-54b58c4723ce.gz: Cannot have start index 
> greater than total size
> Traceback (most recent call last):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 585, in _fetch_to_queue
> value = func(*args)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 610, in _get_segment
> downloader.GetRange(start, end)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 477, in GetRange
> progress, end_byte = self.__NormalizeStartEnd(start, end)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 340, in __NormalizeStartEnd
> 'Cannot have start index greater than total size')
> TransferInvalidError: Cannot have start index greater than total size
> WARNING:root:Task failed: Traceback (most recent call last):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/executor.py",
>  line 300, in __call__
> result = evaluator.finish_bundle()
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
>  line 206, in finish_bundle
> bundles = _read_values_to_bundles(reader)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
>  line 196, in _read_values_to_bundles
> read_result = [GlobalWindows.windowed_value(e) for e in reader]
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/concat_source.py",
>  line 79, in read
> range_tracker.sub_range_tracker(source_ix)):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 155, in read_records
> read_buffer)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 245, in _read_record
> sep_bounds = self._find_separator_bounds(file_to_read, read_buffer)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 190, in _find_separator_bounds
> file_to_read, read_buffer, current_pos + 1):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 212, in _try_to_ensure_num_bytes_in_buffer
> read_data = file_to_read.read(self._buffer_size)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
>  line 460, in read
> self._fetch_to_internal_buffer(num_bytes)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
>  line 420, in _fetch_to_internal_buffer
> buf = self._file.read(self._read_size)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 472, in read
> return self._read_inner(size=size, readline=False)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 516, in _read_inner
> self._fetch_next_if_buffer_exhausted()
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 577, in _fetch_next_if_buffer_exhausted
> raise exn
> TransferInvalidError: Cannot have start index greater than total siz

[jira] [Commented] (BEAM-1874) Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

2017-04-07 Thread Samuli Holopainen (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960484#comment-15960484
 ] 

Samuli Holopainen commented on BEAM-1874:
-

I was using the DataflowRunner.

> Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: 
> gzip header
> -
>
> Key: BEAM-1874
> URL: https://issues.apache.org/jira/browse/BEAM-1874
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 0.6.0
>Reporter: Samuli Holopainen
>Assignee: Charles Chen
>
> We have gzipped text files in Google Cloud Storage that have the following 
> metadata headers set:
> Content-Encoding: gzip
> Content-Type: application/octet-stream
> Trying to read these with apache_beam.io.ReadFromText yields the following 
> error:
> ERROR:root:Exception while fetching 341565 bytes from position 0 of 
> gs://...-c72fa25a-5d8a-4801-a0b4-54b58c4723ce.gz: Cannot have start index 
> greater than total size
> Traceback (most recent call last):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 585, in _fetch_to_queue
> value = func(*args)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 610, in _get_segment
> downloader.GetRange(start, end)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 477, in GetRange
> progress, end_byte = self.__NormalizeStartEnd(start, end)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 340, in __NormalizeStartEnd
> 'Cannot have start index greater than total size')
> TransferInvalidError: Cannot have start index greater than total size
> WARNING:root:Task failed: Traceback (most recent call last):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/executor.py",
>  line 300, in __call__
> result = evaluator.finish_bundle()
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
>  line 206, in finish_bundle
> bundles = _read_values_to_bundles(reader)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
>  line 196, in _read_values_to_bundles
> read_result = [GlobalWindows.windowed_value(e) for e in reader]
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/concat_source.py",
>  line 79, in read
> range_tracker.sub_range_tracker(source_ix)):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 155, in read_records
> read_buffer)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 245, in _read_record
> sep_bounds = self._find_separator_bounds(file_to_read, read_buffer)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 190, in _find_separator_bounds
> file_to_read, read_buffer, current_pos + 1):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 212, in _try_to_ensure_num_bytes_in_buffer
> read_data = file_to_read.read(self._buffer_size)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
>  line 460, in read
> self._fetch_to_internal_buffer(num_bytes)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
>  line 420, in _fetch_to_internal_buffer
> buf = self._file.read(self._read_size)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 472, in read
> return self._read_inner(size=size, readline=False)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 516, in _read_inner
> self._fetch_next_if_buffer_exhausted()
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 577, in _fetch_next_if_buffer_exhausted
> raise exn
> TransferInvalidError: Cannot have start index gre

[jira] [Commented] (BEAM-1874) Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

2017-04-07 Thread Chamikara Jayalath (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960439#comment-15960439
 ] 

Chamikara Jayalath commented on BEAM-1874:
--

Seems like when you set the Content-Type along with gzip Content-Encoding files 
gets automatically decompressed by GCS [1]. I think our GCS reader is running 
into issues reading such files.

I was talking about a secondary issue that could arise with reading these files 
due to runner trying to split such files. But seems like you are using direct 
runner which does not perform splitting.

Assigning to [~charleschen] to look into any gcsio related issues here.

[1] https://cloud.google.com/storage/docs/transcoding 

> Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: 
> gzip header
> -
>
> Key: BEAM-1874
> URL: https://issues.apache.org/jira/browse/BEAM-1874
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 0.6.0
>Reporter: Samuli Holopainen
>Assignee: Chamikara Jayalath
>
> We have gzipped text files in Google Cloud Storage that have the following 
> metadata headers set:
> Content-Encoding: gzip
> Content-Type: application/octet-stream
> Trying to read these with apache_beam.io.ReadFromText yields the following 
> error:
> ERROR:root:Exception while fetching 341565 bytes from position 0 of 
> gs://...-c72fa25a-5d8a-4801-a0b4-54b58c4723ce.gz: Cannot have start index 
> greater than total size
> Traceback (most recent call last):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 585, in _fetch_to_queue
> value = func(*args)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 610, in _get_segment
> downloader.GetRange(start, end)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 477, in GetRange
> progress, end_byte = self.__NormalizeStartEnd(start, end)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 340, in __NormalizeStartEnd
> 'Cannot have start index greater than total size')
> TransferInvalidError: Cannot have start index greater than total size
> WARNING:root:Task failed: Traceback (most recent call last):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/executor.py",
>  line 300, in __call__
> result = evaluator.finish_bundle()
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
>  line 206, in finish_bundle
> bundles = _read_values_to_bundles(reader)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
>  line 196, in _read_values_to_bundles
> read_result = [GlobalWindows.windowed_value(e) for e in reader]
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/concat_source.py",
>  line 79, in read
> range_tracker.sub_range_tracker(source_ix)):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 155, in read_records
> read_buffer)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 245, in _read_record
> sep_bounds = self._find_separator_bounds(file_to_read, read_buffer)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 190, in _find_separator_bounds
> file_to_read, read_buffer, current_pos + 1):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 212, in _try_to_ensure_num_bytes_in_buffer
> read_data = file_to_read.read(self._buffer_size)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
>  line 460, in read
> self._fetch_to_internal_buffer(num_bytes)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
>  line 420, in _fetch_to_internal_buffer
> buf = self._file.read(self._read_size)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 472, in read
> 

[jira] [Commented] (BEAM-1874) Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

2017-04-06 Thread Samuli Holopainen (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960332#comment-15960332
 ] 

Samuli Holopainen commented on BEAM-1874:
-

What exactly isn't supported? Like I said, reading the gzipped files seems to 
work just fine if we remove the Content-Encoding header.

> Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: 
> gzip header
> -
>
> Key: BEAM-1874
> URL: https://issues.apache.org/jira/browse/BEAM-1874
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 0.6.0
>Reporter: Samuli Holopainen
>Assignee: Chamikara Jayalath
>
> We have gzipped text files in Google Cloud Storage that have the following 
> metadata headers set:
> Content-Encoding: gzip
> Content-Type: application/octet-stream
> Trying to read these with apache_beam.io.ReadFromText yields the following 
> error:
> ERROR:root:Exception while fetching 341565 bytes from position 0 of 
> gs://...-c72fa25a-5d8a-4801-a0b4-54b58c4723ce.gz: Cannot have start index 
> greater than total size
> Traceback (most recent call last):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 585, in _fetch_to_queue
> value = func(*args)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 610, in _get_segment
> downloader.GetRange(start, end)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 477, in GetRange
> progress, end_byte = self.__NormalizeStartEnd(start, end)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 340, in __NormalizeStartEnd
> 'Cannot have start index greater than total size')
> TransferInvalidError: Cannot have start index greater than total size
> WARNING:root:Task failed: Traceback (most recent call last):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/executor.py",
>  line 300, in __call__
> result = evaluator.finish_bundle()
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
>  line 206, in finish_bundle
> bundles = _read_values_to_bundles(reader)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
>  line 196, in _read_values_to_bundles
> read_result = [GlobalWindows.windowed_value(e) for e in reader]
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/concat_source.py",
>  line 79, in read
> range_tracker.sub_range_tracker(source_ix)):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 155, in read_records
> read_buffer)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 245, in _read_record
> sep_bounds = self._find_separator_bounds(file_to_read, read_buffer)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 190, in _find_separator_bounds
> file_to_read, read_buffer, current_pos + 1):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 212, in _try_to_ensure_num_bytes_in_buffer
> read_data = file_to_read.read(self._buffer_size)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
>  line 460, in read
> self._fetch_to_internal_buffer(num_bytes)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
>  line 420, in _fetch_to_internal_buffer
> buf = self._file.read(self._read_size)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 472, in read
> return self._read_inner(size=size, readline=False)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 516, in _read_inner
> self._fetch_next_if_buffer_exhausted()
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  li

[jira] [Commented] (BEAM-1874) Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: gzip header

2017-04-06 Thread Chamikara Jayalath (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959683#comment-15959683
 ] 

Chamikara Jayalath commented on BEAM-1874:
--

Python SDK currently doesn't support these files but I believe Java SDK does by 
disable splitting using property isReadSeekEfficient of file-system 
abstraction. 

https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/fs/MatchResult.java#L98

[~sb2nov] shall we add this property to Python file-system abstraction as well ?

> Google Cloud Storage TextIO read fails with gz-files having Content-Encoding: 
> gzip header
> -
>
> Key: BEAM-1874
> URL: https://issues.apache.org/jira/browse/BEAM-1874
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 0.6.0
>Reporter: Samuli Holopainen
>Assignee: Chamikara Jayalath
>
> We have gzipped text files in Google Cloud Storage that have the following 
> metadata headers set:
> Content-Encoding: gzip
> Content-Type: application/octet-stream
> Trying to read these with apache_beam.io.ReadFromText yields the following 
> error:
> ERROR:root:Exception while fetching 341565 bytes from position 0 of 
> gs://...-c72fa25a-5d8a-4801-a0b4-54b58c4723ce.gz: Cannot have start index 
> greater than total size
> Traceback (most recent call last):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 585, in _fetch_to_queue
> value = func(*args)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 610, in _get_segment
> downloader.GetRange(start, end)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 477, in GetRange
> progress, end_byte = self.__NormalizeStartEnd(start, end)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apitools/base/py/transfer.py",
>  line 340, in __NormalizeStartEnd
> 'Cannot have start index greater than total size')
> TransferInvalidError: Cannot have start index greater than total size
> WARNING:root:Task failed: Traceback (most recent call last):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/executor.py",
>  line 300, in __call__
> result = evaluator.finish_bundle()
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
>  line 206, in finish_bundle
> bundles = _read_values_to_bundles(reader)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/runners/direct/transform_evaluator.py",
>  line 196, in _read_values_to_bundles
> read_result = [GlobalWindows.windowed_value(e) for e in reader]
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/concat_source.py",
>  line 79, in read
> range_tracker.sub_range_tracker(source_ix)):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 155, in read_records
> read_buffer)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 245, in _read_record
> sep_bounds = self._find_separator_bounds(file_to_read, read_buffer)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 190, in _find_separator_bounds
> file_to_read, read_buffer, current_pos + 1):
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/textio.py",
>  line 212, in _try_to_ensure_num_bytes_in_buffer
> read_data = file_to_read.read(self._buffer_size)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
>  line 460, in read
> self._fetch_to_internal_buffer(num_bytes)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/fileio.py",
>  line 420, in _fetch_to_internal_buffer
> buf = self._file.read(self._read_size)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apache_beam/io/gcp/gcsio.py",
>  line 472, in read
> return self._read_inner(size=size, readline=False)
>   File 
> "/Users/samuli.holopainen/miniconda2/envs/python-dataflow/lib/python2.7/site-packages/apac