[jira] Commented: (PIG-1458) aggregate files for replicated join

2010-08-30 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904346#action_12904346
 ] 

Koji Noguchi commented on PIG-1458:
---

Can we increase the replication to 10 for the aggregated file (if not already 
done)?

 aggregate files for replicated join
 ---

 Key: PIG-1458
 URL: https://issues.apache.org/jira/browse/PIG-1458
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1458.patch, PIG-1458_1.patch


 We have noticed that if the smaller data in replicated join has many files, 
 this puts  unneeded burden on the name node. pre-aggregating the files can 
 improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1458) aggregate files for replicated join

2010-08-30 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904358#action_12904358
 ] 

Thejas M Nair commented on PIG-1458:


+1

 aggregate files for replicated join
 ---

 Key: PIG-1458
 URL: https://issues.apache.org/jira/browse/PIG-1458
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1458.patch, PIG-1458_1.patch


 We have noticed that if the smaller data in replicated join has many files, 
 this puts  unneeded burden on the name node. pre-aggregating the files can 
 improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1458) aggregate files for replicated join

2010-08-30 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904385#action_12904385
 ] 

Richard Ding commented on PIG-1458:
---

Koji,

Please open a jira on increasing the replication factor of the replicated 
files. Now it uses the default replication factor. 

Thanks,
-Richard 

 aggregate files for replicated join
 ---

 Key: PIG-1458
 URL: https://issues.apache.org/jira/browse/PIG-1458
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1458.patch, PIG-1458_1.patch


 We have noticed that if the smaller data in replicated join has many files, 
 this puts  unneeded burden on the name node. pre-aggregating the files can 
 improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1458) aggregate files for replicated join

2010-08-30 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904451#action_12904451
 ] 

Richard Ding commented on PIG-1458:
---

Patch committed to trunk.

 aggregate files for replicated join
 ---

 Key: PIG-1458
 URL: https://issues.apache.org/jira/browse/PIG-1458
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1458.patch, PIG-1458_1.patch


 We have noticed that if the smaller data in replicated join has many files, 
 this puts  unneeded burden on the name node. pre-aggregating the files can 
 improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1458) aggregate files for replicated join

2010-08-28 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903892#action_12903892
 ] 

Thejas M Nair commented on PIG-1458:


+1
Looks good. Some minor comments - 

- If the preceding op is a native MR job (for native mapreduce operator), we 
don't know how many reducers will be run , pig should use the  
pig.frjoin.merge.files.optimistic property in that case. For native mr job, the 
map plan will be empty, so currently the check for number of roots will return 
false.

- If one input file has been found to have several files, we can stop there, 
and avoid checking other files.
{code}
  } else if (!frJoinOptimisticFileMerge) {
// file doesn't exist yet. Treat it as having too many
// files
numFiles = frJoinFileMergeThreshold;  
  }
{code}
{code}
  } else if (!frJoinOptimisticFileMerge) {
// file doesn't exist yet. Treat it as having too many
// files
   return true;
  }
{code}



 aggregate files for replicated join
 ---

 Key: PIG-1458
 URL: https://issues.apache.org/jira/browse/PIG-1458
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1458.patch


 We have noticed that if the smaller data in replicated join has many files, 
 this puts  unneeded burden on the name node. pre-aggregating the files can 
 improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1458) aggregate files for replicated join

2010-08-28 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903895#action_12903895
 ] 

Thejas M Nair commented on PIG-1458:


Another comment about the patch -
- The test testUnknownNumMaps2 is same as testUnknownNumMaps, it should be 
removed .


A note about the 2nd case described in first comment -
bq. 2.  The right input is a map-only job and input files do not exist at the 
compile time.

When the input does not exist for the input map-only job, in most(/all ?) cases 
it would be possible to determine the number of files by looking at the 
previous MR operator (or ones before that).
Also, with current implementation, since the checks for number of files are 
being done before the MR jobs are merged together, there will be cases where 
the final plan has only one MR job with existing input for the replicated input 
and pig still considers it as a case 2.

The example used in testUnknownNumMaps() has only one input MR job with inputs 
that exist at compile time, but if pig.frjoin.merge.files.optimistic=false, it 
will create an additional MR job that combines the input -
{code}
A = LOAD ' + INPUT_FILE + ' as (x:int,y:int);
B = Filter A by x  50;
C = join A by $0, B by $0 using 'repl';
{code}


 aggregate files for replicated join
 ---

 Key: PIG-1458
 URL: https://issues.apache.org/jira/browse/PIG-1458
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1458.patch


 We have noticed that if the smaller data in replicated join has many files, 
 this puts  unneeded burden on the name node. pre-aggregating the files can 
 improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1458) aggregate files for replicated join

2010-08-28 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903898#action_12903898
 ] 

Thejas M Nair commented on PIG-1458:


What i described under 'A note about the 2nd case described in first comment -' 
in previous comment is a change that can be done as part of a separate jira.


 aggregate files for replicated join
 ---

 Key: PIG-1458
 URL: https://issues.apache.org/jira/browse/PIG-1458
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1458.patch


 We have noticed that if the smaller data in replicated join has many files, 
 this puts  unneeded burden on the name node. pre-aggregating the files can 
 improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1458) aggregate files for replicated join

2010-08-11 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897451#action_12897451
 ] 

Richard Ding commented on PIG-1458:
---

The proposal is to run another map-reduce job to merge the small files before 
the replicated join. This additional job will be added to the MR plan at the 
compile time.

We consider three cases of a replicated join: 

# The right input is a map-only job and input files exist at the compile time.
# The right input is a map-only job and input files do not exist at the compile 
time.
# The right input is a map-reduce job.

For 1., if the number of files exceeds the threshold specified in the property 
file (_pig.frjoin.merge.files.threshold_), a merge job is added between right 
input job and FR join job.

For 3., if the number of reducers exceeds the threshold specified in the 
property file (_pig.frjoin.merge.files.threshold_), a merge job is added 
between right input job and FR join job.

For 2., if the flag specified in the property file 
(_pig.frjoin.merge.files.optimistic_) is false,  a merge job is added between 
right input job and FR join job. The default value of this flag is false. 



 aggregate files for replicated join
 ---

 Key: PIG-1458
 URL: https://issues.apache.org/jira/browse/PIG-1458
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0


 We have noticed that if the smaller data in replicated join has many files, 
 this puts  unneeded burden on the name node. pre-aggregating the files can 
 improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1458) aggregate files for replicated join

2010-08-11 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897484#action_12897484
 ] 

Richard Ding commented on PIG-1458:
---

For 1. and 2. above, another approach is to do nothing and rely on 
MultiFileInputFormat (PIG-1518) to merge small files. 

 aggregate files for replicated join
 ---

 Key: PIG-1458
 URL: https://issues.apache.org/jira/browse/PIG-1458
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0


 We have noticed that if the smaller data in replicated join has many files, 
 this puts  unneeded burden on the name node. pre-aggregating the files can 
 improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.