[jira] Updated: (PIG-1218) Use distributed cache to store samples

2010-02-19 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-1218:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Committed patch PIG-1218_2.patch since the merge join changes need to be 
re-worked and will be handled in a different patch.

Thanks Richard!

> Use distributed cache to store samples
> --
>
> Key: PIG-1218
> URL: https://issues.apache.org/jira/browse/PIG-1218
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.7.0
>
> Attachments: PIG-1218.patch, PIG-1218_2.patch, PIG-1218_3.patch
>
>
> Currently, in the case of skew join and order by we use sample that is just 
> written to the dfs (not distributed cache) and, as the result, get opened and 
> copied around more than necessary. This impacts query performance and also 
> places unnecesary load on the name node

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1218) Use distributed cache to store samples

2010-02-18 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1218:
--

Attachment: PIG-1218_3.patch

The patch 3 includes all of patch 2 plus distributed cache for merge join's 
index file (PIG-1079).

> Use distributed cache to store samples
> --
>
> Key: PIG-1218
> URL: https://issues.apache.org/jira/browse/PIG-1218
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.7.0
>
> Attachments: PIG-1218.patch, PIG-1218_2.patch, PIG-1218_3.patch
>
>
> Currently, in the case of skew join and order by we use sample that is just 
> written to the dfs (not distributed cache) and, as the result, get opened and 
> copied around more than necessary. This impacts query performance and also 
> places unnecesary load on the name node

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1218) Use distributed cache to store samples

2010-02-18 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1218:
--

Attachment: (was: PIG-1218_2.patch)

> Use distributed cache to store samples
> --
>
> Key: PIG-1218
> URL: https://issues.apache.org/jira/browse/PIG-1218
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.7.0
>
> Attachments: PIG-1218.patch, PIG-1218_2.patch
>
>
> Currently, in the case of skew join and order by we use sample that is just 
> written to the dfs (not distributed cache) and, as the result, get opened and 
> copied around more than necessary. This impacts query performance and also 
> places unnecesary load on the name node

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1218) Use distributed cache to store samples

2010-02-18 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1218:
--

Attachment: PIG-1218_2.patch

Updated the patch to address the comments of Pradeep and Ashutosh.

> Use distributed cache to store samples
> --
>
> Key: PIG-1218
> URL: https://issues.apache.org/jira/browse/PIG-1218
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.7.0
>
> Attachments: PIG-1218.patch, PIG-1218_2.patch, PIG-1218_2.patch
>
>
> Currently, in the case of skew join and order by we use sample that is just 
> written to the dfs (not distributed cache) and, as the result, get opened and 
> copied around more than necessary. This impacts query performance and also 
> places unnecesary load on the name node

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1218) Use distributed cache to store samples

2010-02-16 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1218:
--

Attachment: PIG-1218_2.patch

The second patch is for LSR branch and ready for review.

> Use distributed cache to store samples
> --
>
> Key: PIG-1218
> URL: https://issues.apache.org/jira/browse/PIG-1218
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.7.0
>
> Attachments: PIG-1218.patch, PIG-1218_2.patch
>
>
> Currently, in the case of skew join and order by we use sample that is just 
> written to the dfs (not distributed cache) and, as the result, get opened and 
> copied around more than necessary. This impacts query performance and also 
> places unnecesary load on the name node

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1218) Use distributed cache to store samples

2010-02-10 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1218:
--

Attachment: PIG-1218.patch

This patch uses Hadoop DistributedCache to cache the sample files used by order 
by and skewed join, as well as the side files used in FR join.

When a HDFS file is added to the DistributedCache,  Pig generates a symlink to 
the file and, at runtime, this symlink is used to open the file  from the local 
working directory of the task. To avoid symlink colision, instead of using file 
name, a symlink name is generated by using a combination of the hashcode of the 
file path and the current timestamp. 

The replication factor for the sample file in HDFS is not changed with this 
patch. The reasons are that we're not clear what's the right factor to 
increase, and the work to implement the change in Pig is not trivail. 



> Use distributed cache to store samples
> --
>
> Key: PIG-1218
> URL: https://issues.apache.org/jira/browse/PIG-1218
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.7.0
>
> Attachments: PIG-1218.patch
>
>
> Currently, in the case of skew join and order by we use sample that is just 
> written to the dfs (not distributed cache) and, as the result, get opened and 
> copied around more than necessary. This impacts query performance and also 
> places unnecesary load on the name node

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1218) Use distributed cache to store samples

2010-02-10 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1218:
--

Status: Patch Available  (was: Open)

> Use distributed cache to store samples
> --
>
> Key: PIG-1218
> URL: https://issues.apache.org/jira/browse/PIG-1218
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Richard Ding
> Fix For: 0.7.0
>
> Attachments: PIG-1218.patch
>
>
> Currently, in the case of skew join and order by we use sample that is just 
> written to the dfs (not distributed cache) and, as the result, get opened and 
> copied around more than necessary. This impacts query performance and also 
> places unnecesary load on the name node

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.