[
https://issues.apache.org/jira/browse/CRUNCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429533#comment-15429533
]
Mikael Goldmann commented on CRUNCH-601:
----------------------------------------
If I understand correctly
* It is important that p.getSize() > 0 if p is not empty, or processing might
be skipped incorrectly.
* Unless p.getSize() == 0 at least sometimes, the branches that skip
computation are never taken and could be removed.
So assume that p is empty and p.getSize() == 0.
Form q = p.parallelDo(dofn);
where process(x, emitter) simply does emitter.emit(x) and there is a
cleanup(emitter) that does emitter.emit(something).
Now, q is not empty since it consists of 'something'.
It seems like it would be a bug if q.getSize() == 0. However, it seems like the
current implementation, even when this patch is applied would give q.getSize()
== 0.
Am I missing something in my assumptions?
> Short PCollections in SparkPipeline get length null.
> ----------------------------------------------------
>
> Key: CRUNCH-601
> URL: https://issues.apache.org/jira/browse/CRUNCH-601
> Project: Crunch
> Issue Type: Bug
> Components: Spark
> Affects Versions: 0.13.0
> Environment: Running in local mode on Mac as well as in a ubuntu
> 14.04 docker container
> Reporter: Mikael Goldmann
> Assignee: Micah Whitacre
> Priority: Minor
> Attachments: CRUNCH-601.patch, CRUNCH-601b.patch, CRUNCH-601c.patch,
> SmallCollectionLengthTest.java
>
>
> I'll attach a file with a test that I would expect to pass but which fails.
> It creates five PCollection<String> of lengths 0, 1, 2, 3, 4 gets the
> lengths, runs the pipeline and prints the lengths. Finally it asserts that
> all lengths are non-null.
> I would expect it to print lengths 0, 1, 2, 3, 4 and pass.
> What it does is print lengths null, null, null, 3, 4 and fail.
> I think the underlying reason is the use of getSize() on an unmaterialized
> object and assuming that when the estimate that getSize() returns is 0, then
> the PCollection is guaranteed to be empty, which is false in some cases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)