Ben Roling created CRUNCH-690:
---------------------------------
Summary: materialize() directly from Source skips compression and
glob handling
Key: CRUNCH-690
URL: https://issues.apache.org/jira/browse/CRUNCH-690
Project: Crunch
Issue Type: Bug
Components: Core
Affects Versions: 0.14.0
Reporter: Ben Roling
Assignee: Josh Wills
I recently came to notice some oddities that can occur with PCollections
materialized without any DoFns performed.
For example:
{code}
PCollection<String> data = pipeline.read(From.textFile("/data/*.txt"));
Iterable<String> materializedData = data.materialize();
{code}
Results in:
{noformat}
org.apache.crunch.CrunchRuntimeException: java.io.IOException: No files found
to materialize at: /data/*.txt
{noformat}
The globbing in the input path is not evaluated.
Another issue is that decompression is not handled. For example, suppose I do:
{noformat}
PCollection<String> data = pipeline.read(From.textFile("/data/file1.txt.gz"));
Iterable<String> materializedData = data.materialize();
{noformat}
This will succeed, but the Strings in {{materializedData}} will be garbled junk
as the data was never decompressed.
If I run {{parallelDo(IdentityFn.getInstance())}} on the data before
materializing, both problems go away. Globbing and decompression are both
handled.
The code path to read data materialized directly is different than the code
path for data that passes through a DoFn. When the data passes through a DoFn,
all the magic in FileInputFormat happens. If it doesn't pass through a DoFn,
that is skipped.
The latter problem is probably the worse of the two. Potentially there could
be a pipeline that succeeds but silently processes data incorrectly.
I only bumped into this with some perhaps weird stuff I was trying in a test
and I have the workaround of running IdentityFn so a fix is not urgent for me
but I wonder about possibly patching in something to avoid this silent problem.
I'm not going to jump on it immediately but rather at least leave this issue
out here for visibility.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)