[
https://issues.apache.org/jira/browse/PIG-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13582696#comment-13582696
]
Cheolsoo Park commented on PIG-3169:
------------------------------------
{quote}
they are going back to read intermediate data from the first job and failing
because it has been deleted after the previous store
{quote}
Please correct me if I am wrong, but I have a sightly different understanding.
If you print out what's causing failure in the test, that's not the
intermediate output file from a previous job but the input file. Here is what I
did to verify it:
{code}
String fn = Util.generateURI(tmpFile.toString(), pig.getPigContext());
System.out.println("blah: "+ fn);
pig.registerQuery("A = LOAD '" + fn + "';");
pig.registerQuery("Split A into A1 if $0<=10, A2 if $0>10;");
pig.registerQuery("Store A1 into '" +
FileLocalizer.getTemporaryPath(pigContext) + "';");
pig.openIterator("A2");
{code}
Now here is the test log:
{code}
blah: hdfs://localhost.localdomain:45150/tmp/temp2039910329/tmp-2001898725
...
ERROR 2118: Input path does not exist:
hdfs://localhost.localdomain:45150/tmp/temp2039910329/tmp-2001898725
{code}
As can be seen, the input file (/tmp/temp2039910329/tmp-2001898725) is deleted
after the 1st job, and that makes the 2nd job fail. In fact, I can verify this
by doing explain on A1 and A2 as well. For example, if I do the following in
Pig,
{code}
A = LOAD '1.txt';
Split A into A1 if $0<=10, A2 if $0>10;
Explain A1;
Explain A2;
{code}
It gives me the following output, and I can see both A1 and A2 share the same
input:
{code}
A1: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-7
|
|---A1: Filter[bag] - scope-2
| |
| Less Than or Equal[boolean] - scope-6
| |
| |---Cast[int] - scope-4
| | |
| | |---Project[bytearray][0] - scope-3
| |
| |---Constant(10) - scope-5
|
|---A:
Load(file:///home/cheolsoo/workspace/pig-git-2/1.txt:org.apache.pig.builtin.PigStorage)
- scope-0--------
...
A2: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-19
|
|---A2: Filter[bag] - scope-14
| |
| Greater Than[boolean] - scope-18
| |
| |---Cast[int] - scope-16
| | |
| | |---Project[bytearray][0] - scope-15
| |
| |---Constant(10) - scope-17
|
|---A:
Load(file:///home/cheolsoo/workspace/pig-git-2/1.txt:org.apache.pig.builtin.PigStorage)
- scope-12--------
{code}
Sorry for the long message, but I wanted to make sure that we are on the same
page before deciding how to fix this.
> Remove temporary files that are not needed
> ------------------------------------------
>
> Key: PIG-3169
> URL: https://issues.apache.org/jira/browse/PIG-3169
> Project: Pig
> Issue Type: Improvement
> Reporter: Mark Wagner
> Assignee: Mark Wagner
> Priority: Minor
> Fix For: 0.12
>
> Attachments: PIG-3169.1.patch, PIG-3169-hotfix.patch
>
>
> When using Grunt, intermediate data and distributed caches files are left in
> 'pig.temp.dir' until the session is closed. It would be nice to cleanup files
> as they are no longer needed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira