[jira] Created: (PIG-861) POJoinPackage lose tuple in large dataset

2009-06-23 Thread Daniel Dai (JIRA)
POJoinPackage lose tuple in large dataset
-

 Key: PIG-861
 URL: https://issues.apache.org/jira/browse/PIG-861
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.4.0


Some script using POJoinPackage loses records when processing large amount of 
input data. We do not see this problem in smaller input. We can reproduce this 
problem, however, the dataset for the test case is too big to be included here. 
We suspect that POJoinPackage causes the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



asking for comments on benchmark queries

2009-06-23 Thread Zheng Shao
Hi Pig team,

We'd like to get your feedback on a set of queries we implemented on Pig.

We've attached the hadoop configuration and pig queries in the email. We start 
the queries by issuing "pig xxx.pig". The queries are from SIGMOD'2009 paper. 
More details are at https://issues.apache.org/jira/browse/HIVE-396 (Shall we 
open a JIRA on PIG for this?)


One improvement is that we are going to change hadoop to use LZO as 
intermediate compression algorithm very soon. Previously we used gzip for all 
performance tests including hadoop, hive and pig.

The reason that we specify the number of reducers in the query is to try to 
match the same number of reducer as Hive automatically suggested. Please let us 
know what is the best way to set the number of reducers in Pig.

Are there any other improvements we can make to the Pig query and the hadoop 
configuration?

Thanks,
Zheng








	
		dfs.balance.bandwidthPerSec
		10485760
		
		Specifies the maximum amount of bandwidth that each datanode
		can utilize for the balancing purpose in term of
		the number of bytes per second.
		
	

	
		dfs.name.dir
		/dfs.metadata
	

	
		fs.default.name
		hdfs://namenode.example.com:8020
	

	
		mapred.job.tracker
		jobtracker.example.com:50029
	

	
		mapred.min.split.size
		65536
	

	
		dfs.replication
		3
	

	
		mapred.reduce.copy.backoff
		5
	

io.sort.factor
100


mapred.reduce.parallel.copies
25

	
		io.sort.mb
		200
	

		dfs.data.dir
		/hdfs


mapred.local.dir
/mapred/local


		dfs.namenode.handler.count
		40


io.file.buffer.size
32768


dfs.datanode.du.reserved
102400


fs.trash.root
/Trash


		fs.trash.interval
		1440



   mapred.linerecordreader.maxlength
   100


   dfs.block.size
   134217728


mapred.tasktracker.dns.interface
eth0


dfs.datanode.dns.interface
eth0


webinterface.private.actions
true



mapred.reduce.tasks.speculative.execution
false


mapred.speculative.map.gap
0.9


mapred.child.java.opts
-Xmx1024m -Djava.net.preferIPv4Stack=true



mapred.speculative.execution
false



  dfs.safemode.threshold.pct
  1
  
Specifies the percentage of blocks that should satisfy
the minimal replication requirement defined by dfs.replication.min.
Values less than or equal to 0 mean not to start in safe mode.
Values greater than 1 will make safe mode permanent.
  



 dfs.permissions
 false
 
   If "true", enable permission checking in HDFS.
   If "false", permission checking is turned off,
   but all other behavior is unchanged.
   Switching from one parameter value to the other does not change the mode,
   owner or group of files or directories.
 


  
  mapred.output.compress  
  true 
 


  mapred.compress.map.output
  true
  Should the outputs of the maps be compressed before being
   sent across the network. Uses SequenceFile compression.
  



  mapred.map.output.compression.type
  BLOCK
  If the map outputs are to compressed, how should they
   be compressed? Should be one of NONE, RECORD or BLOCK.
  



  mapred.output.compression.codec
  org.apache.hadoop.io.compress.GzipCodec



  mapred.output.compression.type
  BLOCK



  mapred.map.output.compression.codec
  org.apache.hadoop.io.compress.GzipCodec
  If the job outputs are compressed, how should they be compressed?
  



  mapred.tasktracker.map.tasks.maximum
  5
  The maximum number of map tasks that will be run
  simultaneously by a task tracker.
  



  mapred.tasktracker.reduce.tasks.maximum
  5
  The maximum number of reduce tasks that will be run
  simultaneously by a task tracker.
  



  fs.checkpoint.dir
  /dfs/namesecondary
  Determines where on the local filesystem the DFS secondary
  name node should store the temporary images and edits to merge.
  



  mapred.system.dir
  /mapred/system/prod
  The shared HDFS directory where MapReduce stores control files.
  
 


  mapred.temp.dir
  mapred/temp
  A shared HDFS directory for temporary files.
  



  mapred.jobtracker.completeuserjobs.maximum
  10
  The maximum number of complete jobs per user to keep around before delegating them to the job history.
  



  
hadoop.job.history.user.location
none

User can specify a location to store the history files of
a particular job. If nothing is specified, the logs are stored in
output directory. The files are stored in "_logs/history/" in the directory.
User can stop logging by giving the value "none".

  

  
mapred.jobtracker.taskScheduler
org.apache.hadoop.mapred.FairScheduler
  

  
   

RE: asking for comments on benchmark queries

2009-06-23 Thread Zheng Shao
By the way, just for clarification, these queries are used for gathering 
performance data.

Zheng
From: Zheng Shao
Sent: Monday, June 22, 2009 10:37 PM
To: 'pig-dev@hadoop.apache.org'
Subject: asking for comments on benchmark queries

Hi Pig team,

We'd like to get your feedback on a set of queries we implemented on Pig.

We've attached the hadoop configuration and pig queries in the email. We start 
the queries by issuing "pig xxx.pig". The queries are from SIGMOD'2009 paper. 
More details are at https://issues.apache.org/jira/browse/HIVE-396 (Shall we 
open a JIRA on PIG for this?)


One improvement is that we are going to change hadoop to use LZO as 
intermediate compression algorithm very soon. Previously we used gzip for all 
performance tests including hadoop, hive and pig.

The reason that we specify the number of reducers in the query is to try to 
match the same number of reducer as Hive automatically suggested. Please let us 
know what is the best way to set the number of reducers in Pig.

Are there any other improvements we can make to the Pig query and the hadoop 
configuration?

Thanks,
Zheng



[jira] Commented: (PIG-832) Make import list configurable

2009-06-23 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723142#action_12723142
 ] 

Olga Natkovich commented on PIG-832:


Hi Daniel,

The patch looks good.

One comment - I think the Yahoo line that you commented out should be removed

One question - the way this is implemented, the builtins will take precedence 
over user defined functions in case of the conflict. I think this is the right 
approach - I think overwriting builting should be explicit via fully qualified 
names but I wanted to see what others thought.

> Make import list configurable
> -
>
> Key: PIG-832
> URL: https://issues.apache.org/jira/browse/PIG-832
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
>Assignee: Daniel Dai
> Fix For: 0.4.0
>
> Attachments: PIG-832-1.patch, PIG-832-2.patch
>
>
> Currently, it is hardwired in PigContext.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-832) Make import list configurable

2009-06-23 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723143#action_12723143
 ] 

Olga Natkovich commented on PIG-832:


+1 once we answer/resolve issues above

> Make import list configurable
> -
>
> Key: PIG-832
> URL: https://issues.apache.org/jira/browse/PIG-832
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
>Assignee: Daniel Dai
> Fix For: 0.4.0
>
> Attachments: PIG-832-1.patch, PIG-832-2.patch
>
>
> Currently, it is hardwired in PigContext.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-06-23 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723194#action_12723194
 ] 

Ashutosh Chauhan commented on PIG-773:
--

Santhosh, thanks for the review.

1. Will be fixing it in new patch.
2. Test passes while it should fail. Seems like there is an issue how Bag 
handles its schema. Will be investigating it further.
3. Will include test cases which check for existence of constants in the plan.


> Empty complex constants (empty bag, empty tuple and empty map) should be 
> supported
> --
>
> Key: PIG-773
> URL: https://issues.apache.org/jira/browse/PIG-773
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Ashutosh Chauhan
>Priority: Minor
> Attachments: pig-773.patch, pig-773_v2.patch
>
>
> We should be able to create empty bag constant using {}, empty tuple constant 
> using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: asking for comments on benchmark queries

2009-06-23 Thread Alan Gates

Zheng,

I don't think you're subscribed to pig-dev (your emails have been  
bouncing to the moderator).  So I've cc'd you explicitly on this.


I don't think we need a Pig JIRA, it's probably easier if we all work  
on the hive one.  I'll post my comments on the various scripts to that  
bug.  I've also attached them here since pig-dev won't see the updates  
to that bug.


Alan.

grep_select.pig:

Adding types in the LOAD statement will force Pig to cast the key  
field, even though it doesn't need to (it only reads and writes the  
key field).  So I'd change the query to be:


rmf output/PIG_bench/grep_select;
a = load '/data/grep/*' using PigStorage as (key,field);
b = filter a by field matches '.*XYZ.*';
store b into 'output/PIG_bench/grep_select';

field will still be cast to a chararray for the matches, but we won't  
waste time casting key and then turning it back into bytes for the  
store.


rankings_select.pig:

Same comment, remove the casts.  pagerank will be properly cast to an  
integer.


rmf output/PIG_bench/rankings_select;
a = load '/data/rankings/*' using PigStorage('|') as  
(pagerank,pageurl,aveduration);

b = filter a by pagerank > 10;
store b into 'output/PIG_bench/rankings_select';

rankings_uservisits_join.pig:

Here you want to keep the casts of pagerank so that it is handled as  
the right type.  adRevenue will default to double in SUM when you  
don't specify a type.  You also want to project out all unneeded  
columns as soon as possible.  You should set PARALLEL on the join to  
use the number of reducers appropriate for your cluster.  Given that  
you have 10 machines and 5 reduce slots per machine, and speculative  
execution is off you probably want 50 reducers.  I notice you set  
parallel to 60 on the group by.  That will give you 10 trailing  
reducers.  Unless you have a need for the result to be split 60 ways  
you should reduce that to 50 as well.  (I'm assuming here when you say  
you have a 10 node cluster you mean 10 data nodes, not counting your  
name node and task tracker.  The reduce formula should be 5 * number  
of data nodes.)


A last question is how large are the uservisits and rankings data  
sets?  If either is < 80M or so you can use the fragment/replicate  
join, which is much faster than the general join.  The following  
script assumes that isn't the case; but if it is let me know and I can  
show you the syntax for it.


So the end query looks like:

rmf output/PIG_bench/html_join;
a = load '/data/uservisits/*' using PigStorage('|') as
	 
(sourceIP 
,destURL 
,visitDate 
,adRevenue,userAgent,countryCode,languageCode:,searchWord,duration);
b = load '/data/rankings/*' using PigStorage('|') as  
(pagerank:int,pageurl,aveduration);

c = filter a by visitDate > '1999-01-01' AND visitDate < '2000-01-01';
c1 = fjjkkoreach c generate sourceIP, destURL, addRevenue;
b1 = foreach b generate pagerank, pageurl;
d = JOIN c1 by destURL, b1 by pageurl parallel 50;
d1 = foreach d generate sourceIP, pagerank, adRevenue;
e = group d1 by sourceIP parallel 50;
f = FOREACH e GENERATE group, AVG(d1.pagerank), SUM(d1.adRevenue);
store f into 'output/PIG_bench/html_join';

uservisists_agrre.pig:

Same comments as above on projecting out as early as possible and on  
setting parallel appropriately for your cluster.


rmf output/PIG_bench/uservisits_aggre;
a = load '/data/uservisits/*' using PigStorage('|') as
	 
(sourceIP 
,destURL 
,visitDate 
,adRevenue,userAgent,countryCode,languageCode,searchWord,duration);

a1 = foreach a generate sourceIP, adRevenue;
b = group a by sourceIP parallel 50;
c = FOREACH b GENERATE group, SUM(a. adRevenue);
store c into 'output/PIG_bench/uservisits_aggre';



On Jun 22, 2009, at 10:36 PM, Zheng Shao wrote:


Hi Pig team,

We’d like to get your feedback on a set of queries we implemented on  
Pig.


We’ve attached the hadoop configuration and pig queries in the  
email. We start the queries by issuing “pig xxx.pig”. The queries  
are from SIGMOD’2009 paper. More details are athttps:// 
issues.apache.org/jira/browse/HIVE-396 (Shall we open a JIRA on PIG  
for this?)



One improvement is that we are going to change hadoop to use LZO as  
intermediate compression algorithm very soon. Previously we used  
gzip for all performance tests including hadoop, hive and pig.


The reason that we specify the number of reducers in the query is to  
try to match the same number of reducer as Hive automatically  
suggested. Please let us know what is the best way to set the number  
of reducers in Pig.


Are there any other improvements we can make to the Pig query and  
the hadoop configuration?


Thanks,
Zheng






[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-23 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723204#action_12723204
 ] 

Alan Gates commented on PIG-820:


+1

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
> another loader
> -
>
> Key: PIG-820
> URL: https://issues.apache.org/jira/browse/PIG-820
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.3.0, 0.4.0
>Reporter: Alan Gates
>Assignee: Ashutosh Chauhan
> Fix For: 0.4.0
>
> Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch
>
>
> Currently a sampling job requires that data already be stored in 
> BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
> order by this
> has mostly been acceptable, because users tend to use order by at the end of 
> their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an 
> order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this 
> requirement to read the entire input and write it back out will not be 
> acceptable.
> Join is often the first operation of a script, and thus is much more likely 
> to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, 
> using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a 
> Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
> 
> /**
>  * Skip ahead in the input stream.
>  * @param n number of bytes to skip
>  * @return number of bytes actually skipped.  The return semantics are
>  * exactly the same as {...@link java.io.InpuStream#skip(long)}
>  */
> public long skip(long n) throws IOException;
> 
> /**
>  * Get the current position in the stream.
>  * @return position in the stream.
>  */
> public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data 
> implemented the SamplableLoader interface.  If so, rather than create an 
> initial MR
> job to do the translation it would create the sampling job, having 
> RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-862) Pig Site - 0.3.0 updates

2009-06-23 Thread Corinne Chandel (JIRA)
Pig Site - 0.3.0 updates


 Key: PIG-862
 URL: https://issues.apache.org/jira/browse/PIG-862
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.3.0
Reporter: Corinne Chandel


Updates for Pig Site
> change home tab to project tab
> added search bar
> cleaned up logo image

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-862) Pig Site - 0.3.0 updates

2009-06-23 Thread Corinne Chandel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corinne Chandel updated PIG-862:


Attachment: PIG-862.patch

Patch file.

> Pig Site - 0.3.0 updates
> 
>
> Key: PIG-862
> URL: https://issues.apache.org/jira/browse/PIG-862
> Project: Pig
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 0.3.0
>Reporter: Corinne Chandel
> Attachments: PIG-862.patch
>
>
> Updates for Pig Site
> > change home tab to project tab
> > added search bar
> > cleaned up logo image

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-862) Pig Site - 0.3.0 updates

2009-06-23 Thread Corinne Chandel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corinne Chandel updated PIG-862:


Status: Patch Available  (was: Open)

Apply patch to this branch: https://svn.apache.org/repos/asf/hadoop/pig/site

Note: No new test code required; changes to documentation only.

> Pig Site - 0.3.0 updates
> 
>
> Key: PIG-862
> URL: https://issues.apache.org/jira/browse/PIG-862
> Project: Pig
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 0.3.0
>Reporter: Corinne Chandel
> Attachments: PIG-862.patch
>
>
> Updates for Pig Site
> > change home tab to project tab
> > added search bar
> > cleaned up logo image

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-23 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723240#action_12723240
 ] 

Pradeep Kamath commented on PIG-820:


Some review comments:
In SampleOptimizer.java, 
{noformat}
  LoadFunc lf = 
(LoadFunc)PigContext.instantiateFuncFromSpec(predLoad.getLFile().getFuncName());
 should be changed to 
  LoadFunc lf = 
(LoadFunc)PigContext.instantiateFuncFromSpec(predLoad.getLFile().getFuncSpec());
{noformat}
This is so that we correctly handle loaders which do not have default 
constuctor. FuncSpec encapsulates both the classname and constructor arguments 
and hence would handle both loaders which only have default constructor and 
those which only have constructor with args.

Similarly
{noformat}  
fs = new FileSpec(predFs.getFileName(), new FuncSpec(predFs.getFuncName()));
should be changed to
  fs = new FileSpec(predFs.getFileName(), predFs.getFuncSpec());
{noformat}

Likewise, the constructor of RandomSampleLoader should take a FuncSpec object 
as its first argument to represent the loader classname and constructor args. 
So this will require callers who create RandomSampleLoader to create it with 
correct funcspec objects.





> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
> another loader
> -
>
> Key: PIG-820
> URL: https://issues.apache.org/jira/browse/PIG-820
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.3.0, 0.4.0
>Reporter: Alan Gates
>Assignee: Ashutosh Chauhan
> Fix For: 0.4.0
>
> Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch
>
>
> Currently a sampling job requires that data already be stored in 
> BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
> order by this
> has mostly been acceptable, because users tend to use order by at the end of 
> their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an 
> order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this 
> requirement to read the entire input and write it back out will not be 
> acceptable.
> Join is often the first operation of a script, and thus is much more likely 
> to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, 
> using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a 
> Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
> 
> /**
>  * Skip ahead in the input stream.
>  * @param n number of bytes to skip
>  * @return number of bytes actually skipped.  The return semantics are
>  * exactly the same as {...@link java.io.InpuStream#skip(long)}
>  */
> public long skip(long n) throws IOException;
> 
> /**
>  * Get the current position in the stream.
>  * @return position in the stream.
>  */
> public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data 
> implemented the SamplableLoader interface.  If so, rather than create an 
> initial MR
> job to do the translation it would create the sampling job, having 
> RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas

2009-06-23 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723269#action_12723269
 ] 

Milind Bhandarkar commented on PIG-856:
---

Replication of 2 is 17% faster than replication for 3 for the sort benchmark. 
But, the sort benchmark does not have any computation in mappers or reducers. 
Therefore, the percentage improvement for Pig will definitely be much less.

> PERFORMANCE: reduce number of replicas
> --
>
> Key: PIG-856
> URL: https://issues.apache.org/jira/browse/PIG-856
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>
> Currently Pig uses the default number of replicas between MR jobs. Currently, 
> the number is 3. Given the temp nature of the data, we should never need more 
> than 2 and should explicitely set it to improve performance and to be nicer 
> to the name node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: requirements for Pig 1.0?

2009-06-23 Thread Alan Gates
I don't believe there's a solid list of want to haves for 1.0.  The  
big issue I see is that there are too many interfaces that are still  
shifting, such as:


1) Data input/output formats.  The way we do slicing (that is, user  
provided InputFormats) and the equivalent outputs aren't yet solid.   
They are still too tied to load and store functions.  We need to break  
those out and understand how they will be expressed in the language.  
Related to this is the semantics of how Pig interacts with non-file  
based inputs and outputs.  We have a suggestion of moving to URLs, but  
we haven't finished test driving this to see if it will really be what  
we want.


2) The memory model.  While technically the choices we make on how to  
represent things in memory are internal, the reality is that these  
changes may affect the way we read and write tuples and bags, which in  
turn may affect our load, store, eval, and filter functions.


3) SQL.  We're working on introducing SQL soon, and it will take it a  
few releases to be fully baked.


4) Much better error messages.  In 0.2 our error messages made a leap  
forward, but before we can claim to be 1.0 I think they need to make 2  
more leaps:  1) they need to be written in a way end users can  
understand them instead of in a way engineers can understand them,  
including having sufficient error documentation with suggested courses  
of action, etc.; 2) they need to be much better at tying errors back  
to where they happened in the script, right now if one of the MR jobs  
associated with a Pig Latin script fails there is no way to know what  
part of the script it is associated with.


There are probably others, but those are the ones I can think of off  
the top of my head.  The summary from my viewpoint is we still have  
several 0.x releases before we're ready to consider 1.0.  It would be  
nice to be 1.0 not too long after Hadoop is, which still gives us at  
least 6-9 months.


Alan.


On Jun 22, 2009, at 10:58 AM, Dmitriy Ryaboy wrote:

I know there was some discussion of making the types release (0.2) a  
"Pig 1"

release, but that got nixed. There wasn't a similar discussion on 0.3.
Has the list of want-to-haves for Pig 1.0 been discussed since?




RE: requirements for Pig 1.0?

2009-06-23 Thread Santhosh Srinivasan
To add to Alan's list:

1. Ability to handle unknown types in Pig's schema model.
2. Load/Store interfaces are not set in stone.
3. Nice to have: Make PigServer thread safe.

Thanks,
Santhosh 

-Original Message-
From: Alan Gates [mailto:ga...@yahoo-inc.com] 
Sent: Tuesday, June 23, 2009 1:40 PM
To: pig-dev@hadoop.apache.org
Subject: Re: requirements for Pig 1.0?

I don't believe there's a solid list of want to haves for 1.0.  The  
big issue I see is that there are too many interfaces that are still  
shifting, such as:

1) Data input/output formats.  The way we do slicing (that is, user  
provided InputFormats) and the equivalent outputs aren't yet solid.   
They are still too tied to load and store functions.  We need to break  
those out and understand how they will be expressed in the language.  
Related to this is the semantics of how Pig interacts with non-file  
based inputs and outputs.  We have a suggestion of moving to URLs, but  
we haven't finished test driving this to see if it will really be what  
we want.

2) The memory model.  While technically the choices we make on how to  
represent things in memory are internal, the reality is that these  
changes may affect the way we read and write tuples and bags, which in  
turn may affect our load, store, eval, and filter functions.

3) SQL.  We're working on introducing SQL soon, and it will take it a  
few releases to be fully baked.

4) Much better error messages.  In 0.2 our error messages made a leap  
forward, but before we can claim to be 1.0 I think they need to make 2  
more leaps:  1) they need to be written in a way end users can  
understand them instead of in a way engineers can understand them,  
including having sufficient error documentation with suggested courses  
of action, etc.; 2) they need to be much better at tying errors back  
to where they happened in the script, right now if one of the MR jobs  
associated with a Pig Latin script fails there is no way to know what  
part of the script it is associated with.

There are probably others, but those are the ones I can think of off  
the top of my head.  The summary from my viewpoint is we still have  
several 0.x releases before we're ready to consider 1.0.  It would be  
nice to be 1.0 not too long after Hadoop is, which still gives us at  
least 6-9 months.

Alan.


On Jun 22, 2009, at 10:58 AM, Dmitriy Ryaboy wrote:

> I know there was some discussion of making the types release (0.2) a  
> "Pig 1"
> release, but that got nixed. There wasn't a similar discussion on 0.3.
> Has the list of want-to-haves for Pig 1.0 been discussed since?



[jira] Commented: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-23 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723316#action_12723316
 ] 

Ashutosh Chauhan commented on PIG-820:
--

Thanks Alan and Pradeep for the review.

Will be incorporating SampleOptimizer changes. 
Constructor of RandomSampleLoader can only take string args since it is 
instantiated from FuncSpec on backend. So, cant make changes to types of 
RandomSampleLoader constructor argument. However, instead of String having 
classname of loader , String version of FuncSpec can be used so that loader 
with correct constructor gets instantiated.

Will be uploading a new patch soon.

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
> another loader
> -
>
> Key: PIG-820
> URL: https://issues.apache.org/jira/browse/PIG-820
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.3.0, 0.4.0
>Reporter: Alan Gates
>Assignee: Ashutosh Chauhan
> Fix For: 0.4.0
>
> Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch
>
>
> Currently a sampling job requires that data already be stored in 
> BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
> order by this
> has mostly been acceptable, because users tend to use order by at the end of 
> their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an 
> order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this 
> requirement to read the entire input and write it back out will not be 
> acceptable.
> Join is often the first operation of a script, and thus is much more likely 
> to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, 
> using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a 
> Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
> 
> /**
>  * Skip ahead in the input stream.
>  * @param n number of bytes to skip
>  * @return number of bytes actually skipped.  The return semantics are
>  * exactly the same as {...@link java.io.InpuStream#skip(long)}
>  */
> public long skip(long n) throws IOException;
> 
> /**
>  * Get the current position in the stream.
>  * @return position in the stream.
>  */
> public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data 
> implemented the SamplableLoader interface.  If so, rather than create an 
> initial MR
> job to do the translation it would create the sampling job, having 
> RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-23 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Status: Open  (was: Patch Available)

Will be uploading a new patch.

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
> another loader
> -
>
> Key: PIG-820
> URL: https://issues.apache.org/jira/browse/PIG-820
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.3.0, 0.4.0
>Reporter: Alan Gates
>Assignee: Ashutosh Chauhan
> Fix For: 0.4.0
>
> Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch
>
>
> Currently a sampling job requires that data already be stored in 
> BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
> order by this
> has mostly been acceptable, because users tend to use order by at the end of 
> their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an 
> order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this 
> requirement to read the entire input and write it back out will not be 
> acceptable.
> Join is often the first operation of a script, and thus is much more likely 
> to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, 
> using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a 
> Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
> 
> /**
>  * Skip ahead in the input stream.
>  * @param n number of bytes to skip
>  * @return number of bytes actually skipped.  The return semantics are
>  * exactly the same as {...@link java.io.InpuStream#skip(long)}
>  */
> public long skip(long n) throws IOException;
> 
> /**
>  * Get the current position in the stream.
>  * @return position in the stream.
>  */
> public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data 
> implemented the SamplableLoader interface.  If so, rather than create an 
> initial MR
> job to do the translation it would create the sampling job, having 
> RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-23 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Attachment: pig-820_v4.patch

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
> another loader
> -
>
> Key: PIG-820
> URL: https://issues.apache.org/jira/browse/PIG-820
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.3.0, 0.4.0
>Reporter: Alan Gates
>Assignee: Ashutosh Chauhan
> Fix For: 0.4.0
>
> Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, 
> pig-820_v4.patch
>
>
> Currently a sampling job requires that data already be stored in 
> BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
> order by this
> has mostly been acceptable, because users tend to use order by at the end of 
> their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an 
> order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this 
> requirement to read the entire input and write it back out will not be 
> acceptable.
> Join is often the first operation of a script, and thus is much more likely 
> to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, 
> using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a 
> Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
> 
> /**
>  * Skip ahead in the input stream.
>  * @param n number of bytes to skip
>  * @return number of bytes actually skipped.  The return semantics are
>  * exactly the same as {...@link java.io.InpuStream#skip(long)}
>  */
> public long skip(long n) throws IOException;
> 
> /**
>  * Get the current position in the stream.
>  * @return position in the stream.
>  */
> public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data 
> implemented the SamplableLoader interface.  If so, rather than create an 
> initial MR
> job to do the translation it would create the sampling job, having 
> RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-23 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Status: Patch Available  (was: Open)

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
> another loader
> -
>
> Key: PIG-820
> URL: https://issues.apache.org/jira/browse/PIG-820
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.3.0, 0.4.0
>Reporter: Alan Gates
>Assignee: Ashutosh Chauhan
> Fix For: 0.4.0
>
> Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, 
> pig-820_v4.patch
>
>
> Currently a sampling job requires that data already be stored in 
> BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
> order by this
> has mostly been acceptable, because users tend to use order by at the end of 
> their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an 
> order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this 
> requirement to read the entire input and write it back out will not be 
> acceptable.
> Join is often the first operation of a script, and thus is much more likely 
> to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, 
> using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a 
> Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
> 
> /**
>  * Skip ahead in the input stream.
>  * @param n number of bytes to skip
>  * @return number of bytes actually skipped.  The return semantics are
>  * exactly the same as {...@link java.io.InpuStream#skip(long)}
>  */
> public long skip(long n) throws IOException;
> 
> /**
>  * Get the current position in the stream.
>  * @return position in the stream.
>  */
> public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data 
> implemented the SamplableLoader interface.  If so, rather than create an 
> initial MR
> job to do the translation it would create the sampling job, having 
> RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-792) PERFORMANCE: Support skewed join in pig

2009-06-23 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-792:


Attachment: lojoin.patch

Patch contains code for the logical operator LOJoin

> PERFORMANCE: Support skewed join in pig
> ---
>
> Key: PIG-792
> URL: https://issues.apache.org/jira/browse/PIG-792
> Project: Pig
>  Issue Type: Improvement
>Reporter: Sriranjan Manjunath
> Attachments: lojoin.patch
>
>
> Fragmented replicated join has a few limitations:
>  - One of the tables needs to be loaded into memory
>  - Join is limited to two tables
> Skewed join partitions the table and joins the records in the reduce phase. 
> It computes a histogram of the key space to account for skewing in the input 
> records. Further, it adjusts the number of reducers depending on the key 
> distribution.
> We need to implement the skewed join in pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-734) Non-string keys in maps

2009-06-23 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723350#action_12723350
 ] 

Daniel Dai commented on PIG-734:


Patch looks good to me.

> Non-string keys in maps
> ---
>
> Key: PIG-734
> URL: https://issues.apache.org/jira/browse/PIG-734
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.2.0
>Reporter: Alan Gates
>Assignee: Alan Gates
>Priority: Minor
> Fix For: 0.4.0
>
> Attachments: PIG-734.patch, PIG-734_2.patch, PIG-734_3.patch
>
>
> With the addition of types to pig, maps were changed to allow any atomic type 
> to be a key.  However, in practice we do not see people using keys other than 
> strings.  And allowing multiple types is causing us issues in serializing 
> data (we have to check what every key type is) and in the design for non-java 
> UDFs (since many scripting languages include associative arrays such as 
> Perl's hash).
> So I propose we scope back maps to only have string keys.  This would be a 
> non-compatible change.  But I am not aware of anyone using non-string keys, 
> so hopefully it would have little or no impact.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-734) Non-string keys in maps

2009-06-23 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-734:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch checked in.

> Non-string keys in maps
> ---
>
> Key: PIG-734
> URL: https://issues.apache.org/jira/browse/PIG-734
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.2.0
>Reporter: Alan Gates
>Assignee: Alan Gates
>Priority: Minor
> Fix For: 0.4.0
>
> Attachments: PIG-734.patch, PIG-734_2.patch, PIG-734_3.patch
>
>
> With the addition of types to pig, maps were changed to allow any atomic type 
> to be a key.  However, in practice we do not see people using keys other than 
> strings.  And allowing multiple types is causing us issues in serializing 
> data (we have to check what every key type is) and in the design for non-java 
> UDFs (since many scripting languages include associative arrays such as 
> Perl's hash).
> So I propose we scope back maps to only have string keys.  This would be a 
> non-compatible change.  But I am not aware of anyone using non-string keys, 
> so hopefully it would have little or no impact.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-820) PERFORMANCE: The RandomSampleLoader should be changed to allow it subsume another loader

2009-06-23 Thread Ashutosh Chauhan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-820:
-

Status: Open  (was: Patch Available)

Due to change in LoadFunc interface as a part of PIG-734 commit, my patch won't 
apply cleanly on trunk anymore. Will merge with trunk and regenerate the patch 
again.

> PERFORMANCE:  The RandomSampleLoader should be changed to allow it subsume 
> another loader
> -
>
> Key: PIG-820
> URL: https://issues.apache.org/jira/browse/PIG-820
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.3.0, 0.4.0
>Reporter: Alan Gates
>Assignee: Ashutosh Chauhan
> Fix For: 0.4.0
>
> Attachments: pig-820.patch, pig-820_v2.patch, pig-820_v3.patch, 
> pig-820_v4.patch
>
>
> Currently a sampling job requires that data already be stored in 
> BinaryStorage format, since RandomSampleLoader extends BinaryStorage.  For 
> order by this
> has mostly been acceptable, because users tend to use order by at the end of 
> their script where other MR jobs have already operated on the data and thus it
> is already being stored in BinaryStorage.  For pig scripts that just did an 
> order by, an entire MR job is required to read the data and write it out
> in BinaryStorage format.
> As we begin work on join algorithms that will require sampling, this 
> requirement to read the entire input and write it back out will not be 
> acceptable.
> Join is often the first operation of a script, and thus is much more likely 
> to trigger this useless up front translation job.
> Instead RandomSampleLoader can be changed to subsume an existing loader, 
> using the user specified loader to read the tuples while handling the skipping
> between tuples itself.  This will require the subsumed loader to implement a 
> Samplable Interface, that will look something like:
> {code}
> public interface SamplableLoader extends LoadFunc {
> 
> /**
>  * Skip ahead in the input stream.
>  * @param n number of bytes to skip
>  * @return number of bytes actually skipped.  The return semantics are
>  * exactly the same as {...@link java.io.InpuStream#skip(long)}
>  */
> public long skip(long n) throws IOException;
> 
> /**
>  * Get the current position in the stream.
>  * @return position in the stream.
>  */
> public long getPosition() throws IOException;
> }
> {code}
> The MRCompiler would then check if the loader being used to load data 
> implemented the SamplableLoader interface.  If so, rather than create an 
> initial MR
> job to do the translation it would create the sampling job, having 
> RandomSampleLoader use the user specified loader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-856) PERFORMANCE: reduce number of replicas

2009-06-23 Thread Amr Awadallah (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723412#action_12723412
 ] 

Amr Awadallah commented on PIG-856:
---


Please keep in mind that when running on a loaded system (i.e. with many 
concurrent jobs) the fair-scheduler will have a better chance of allocating 
mappers with local data to process your job if you have more replicas (not sure 
if capacity also does that). So, while setting replicas to less than 3 might 
improve performance when you are only job running in system, it will harm it 
when you are sharing cluster with many others.

Not to mention that this also affects speculative execution, etc.

-- amr

> PERFORMANCE: reduce number of replicas
> --
>
> Key: PIG-856
> URL: https://issues.apache.org/jira/browse/PIG-856
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Olga Natkovich
>
> Currently Pig uses the default number of replicas between MR jobs. Currently, 
> the number is 3. Given the temp nature of the data, we should never need more 
> than 2 and should explicitely set it to improve performance and to be nicer 
> to the name node.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Hudson build is back to normal: Pig-Patch-minerva.apache.org #97

2009-06-23 Thread Apache Hudson Server
See 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/97/changes




[jira] Commented: (PIG-773) Empty complex constants (empty bag, empty tuple and empty map) should be supported

2009-06-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723431#action_12723431
 ] 

Hadoop QA commented on PIG-773:
---

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12411357/pig-773_v2.patch
  against trunk revision 787878.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/97/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/97/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/97/console

This message is automatically generated.

> Empty complex constants (empty bag, empty tuple and empty map) should be 
> supported
> --
>
> Key: PIG-773
> URL: https://issues.apache.org/jira/browse/PIG-773
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Ashutosh Chauhan
>Priority: Minor
> Attachments: pig-773.patch, pig-773_v2.patch
>
>
> We should be able to create empty bag constant using {}, empty tuple constant 
> using (), empty map constant using [] within a pig script

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-832) Make import list configurable

2009-06-23 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-832:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Remove the Yahoo line

> Make import list configurable
> -
>
> Key: PIG-832
> URL: https://issues.apache.org/jira/browse/PIG-832
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
>Assignee: Daniel Dai
> Fix For: 0.4.0
>
> Attachments: PIG-832-1.patch, PIG-832-2.patch
>
>
> Currently, it is hardwired in PigContext.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-832) Make import list configurable

2009-06-23 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723445#action_12723445
 ] 

Daniel Dai commented on PIG-832:


Patch committed

> Make import list configurable
> -
>
> Key: PIG-832
> URL: https://issues.apache.org/jira/browse/PIG-832
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
>Assignee: Daniel Dai
> Fix For: 0.4.0
>
> Attachments: PIG-832-1.patch, PIG-832-2.patch
>
>
> Currently, it is hardwired in PigContext.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-06-23 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Status: Patch Available  (was: In Progress)

> Proposed improvements to pig's optimizer
> 
>
> Key: PIG-697
> URL: https://issues.apache.org/jira/browse/PIG-697
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Alan Gates
>Assignee: Santhosh Srinivasan
> Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
> OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
> OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, 
> OptimizerPhase4_part1.patch
>
>
> I propose the following changes to pig optimizer, plan, and operator 
> functionality to support more robust optimization:
> 1) Remove the required array from Rule.  This will change rules so that they 
> only match exact patterns instead of allowing missing elements in the pattern.
> This has the downside that if a given rule applies to two patterns (say 
> Load->Filter->Group, Load->Group) you have to write two rules.  But it has 
> the upside that
> the resulting rules know exactly what they are getting.  The original intent 
> of this was to reduce the number of rules that needed to be written.  But the
> resulting rules have do a lot of work to understand the operators they are 
> working with.  With exact matches only, each rule will know exactly the 
> operators it
> is working on and can apply the logic of shifting the operators around.  All 
> four of the existing rules set all entries of required to true, so removing 
> this
> will have no effect on them.
> 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
> conversions or a certain number of iterations has been reached.  Currently the
> function is:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> for (Rule rule : mRules) {
> if (matcher.match(rule)) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> List> matches = matcher.getAllMatches();
> for (List match:matches)
> {
>   if (rule.transformer.check(match)) {
>   // The transformer approves.
>   rule.transformer.transform(match);
>   }
> }
> }
> }
> }
> {code}
> It would change to be:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> boolean sawMatch;
> int iterators = 0;
> do {
> sawMatch = false;
> for (Rule rule : mRules) {
> List> matches = matcher.getAllMatches();
> for (List match:matches) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> if (rule.transformer.check(match)) {
> // The transformer approves.
> sawMatch = true;
> rule.transformer.transform(match);
> }
> }
> }
> // Not sure if 1000 is the right number of iterations, maybe it
> // should be configurable so that large scripts don't stop too 
> // early.
> } while (sawMatch && numIterations++ < 1000);
> }
> {code}
> The reason for limiting the number of iterations is to avoid infinite loops.  
> The reason for iterating over the rules is so that each rule can be applied 
> multiple
> times as necessary.  This allows us to write simple rules, mostly swaps 
> between neighboring operators, without worrying that we get the plan right in 
> one pass.
> For example, we might have a plan that looks like:  
> Load->Join->Filter->Foreach, and we want to optimize it to 
> Load->Foreach->Filter->Join.  With two simple
> rules (swap filter and join and swap foreach and filter), applied 
> iteratively, we can get from the initial to final plan, without needing to 
> understanding the
> big picture of the entire plan.
> 3) Add three calls to OperatorPlan:
> {code}
> /**
>  * Swap two operators in a plan.  Both of the operators must have single
>  * inputs and single outputs.
>  * @param first operator
>  * @param second operator
>  * @throws PlanException if either operator is not single input and output.
>  */
> public void swap(E first, E second) throws PlanException {
> ...
> }
> /**
>  * Push one operator in front of another.  This function is for use when
>  * the first operator has multiple inputs.  The caller can specify
>  * which input of the 

[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-06-23 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Status: In Progress  (was: Patch Available)

> Proposed improvements to pig's optimizer
> 
>
> Key: PIG-697
> URL: https://issues.apache.org/jira/browse/PIG-697
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Alan Gates
>Assignee: Santhosh Srinivasan
> Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
> OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
> OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, 
> OptimizerPhase4_part1.patch
>
>
> I propose the following changes to pig optimizer, plan, and operator 
> functionality to support more robust optimization:
> 1) Remove the required array from Rule.  This will change rules so that they 
> only match exact patterns instead of allowing missing elements in the pattern.
> This has the downside that if a given rule applies to two patterns (say 
> Load->Filter->Group, Load->Group) you have to write two rules.  But it has 
> the upside that
> the resulting rules know exactly what they are getting.  The original intent 
> of this was to reduce the number of rules that needed to be written.  But the
> resulting rules have do a lot of work to understand the operators they are 
> working with.  With exact matches only, each rule will know exactly the 
> operators it
> is working on and can apply the logic of shifting the operators around.  All 
> four of the existing rules set all entries of required to true, so removing 
> this
> will have no effect on them.
> 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
> conversions or a certain number of iterations has been reached.  Currently the
> function is:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> for (Rule rule : mRules) {
> if (matcher.match(rule)) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> List> matches = matcher.getAllMatches();
> for (List match:matches)
> {
>   if (rule.transformer.check(match)) {
>   // The transformer approves.
>   rule.transformer.transform(match);
>   }
> }
> }
> }
> }
> {code}
> It would change to be:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> boolean sawMatch;
> int iterators = 0;
> do {
> sawMatch = false;
> for (Rule rule : mRules) {
> List> matches = matcher.getAllMatches();
> for (List match:matches) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> if (rule.transformer.check(match)) {
> // The transformer approves.
> sawMatch = true;
> rule.transformer.transform(match);
> }
> }
> }
> // Not sure if 1000 is the right number of iterations, maybe it
> // should be configurable so that large scripts don't stop too 
> // early.
> } while (sawMatch && numIterations++ < 1000);
> }
> {code}
> The reason for limiting the number of iterations is to avoid infinite loops.  
> The reason for iterating over the rules is so that each rule can be applied 
> multiple
> times as necessary.  This allows us to write simple rules, mostly swaps 
> between neighboring operators, without worrying that we get the plan right in 
> one pass.
> For example, we might have a plan that looks like:  
> Load->Join->Filter->Foreach, and we want to optimize it to 
> Load->Foreach->Filter->Join.  With two simple
> rules (swap filter and join and swap foreach and filter), applied 
> iteratively, we can get from the initial to final plan, without needing to 
> understanding the
> big picture of the entire plan.
> 3) Add three calls to OperatorPlan:
> {code}
> /**
>  * Swap two operators in a plan.  Both of the operators must have single
>  * inputs and single outputs.
>  * @param first operator
>  * @param second operator
>  * @throws PlanException if either operator is not single input and output.
>  */
> public void swap(E first, E second) throws PlanException {
> ...
> }
> /**
>  * Push one operator in front of another.  This function is for use when
>  * the first operator has multiple inputs.  The caller can specify
>  * which input of the 

[jira] Updated: (PIG-697) Proposed improvements to pig's optimizer

2009-06-23 Thread Santhosh Srinivasan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santhosh Srinivasan updated PIG-697:


Attachment: OptimizerPhase4_part1.patch

Attached patch, implements the optimization rule for pushing filters up.

> Proposed improvements to pig's optimizer
> 
>
> Key: PIG-697
> URL: https://issues.apache.org/jira/browse/PIG-697
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Alan Gates
>Assignee: Santhosh Srinivasan
> Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
> OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
> OptimizerPhase3_parrt1.patch, OptimizerPhase3_part2_3.patch, 
> OptimizerPhase4_part1.patch
>
>
> I propose the following changes to pig optimizer, plan, and operator 
> functionality to support more robust optimization:
> 1) Remove the required array from Rule.  This will change rules so that they 
> only match exact patterns instead of allowing missing elements in the pattern.
> This has the downside that if a given rule applies to two patterns (say 
> Load->Filter->Group, Load->Group) you have to write two rules.  But it has 
> the upside that
> the resulting rules know exactly what they are getting.  The original intent 
> of this was to reduce the number of rules that needed to be written.  But the
> resulting rules have do a lot of work to understand the operators they are 
> working with.  With exact matches only, each rule will know exactly the 
> operators it
> is working on and can apply the logic of shifting the operators around.  All 
> four of the existing rules set all entries of required to true, so removing 
> this
> will have no effect on them.
> 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
> conversions or a certain number of iterations has been reached.  Currently the
> function is:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> for (Rule rule : mRules) {
> if (matcher.match(rule)) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> List> matches = matcher.getAllMatches();
> for (List match:matches)
> {
>   if (rule.transformer.check(match)) {
>   // The transformer approves.
>   rule.transformer.transform(match);
>   }
> }
> }
> }
> }
> {code}
> It would change to be:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> boolean sawMatch;
> int iterators = 0;
> do {
> sawMatch = false;
> for (Rule rule : mRules) {
> List> matches = matcher.getAllMatches();
> for (List match:matches) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> if (rule.transformer.check(match)) {
> // The transformer approves.
> sawMatch = true;
> rule.transformer.transform(match);
> }
> }
> }
> // Not sure if 1000 is the right number of iterations, maybe it
> // should be configurable so that large scripts don't stop too 
> // early.
> } while (sawMatch && numIterations++ < 1000);
> }
> {code}
> The reason for limiting the number of iterations is to avoid infinite loops.  
> The reason for iterating over the rules is so that each rule can be applied 
> multiple
> times as necessary.  This allows us to write simple rules, mostly swaps 
> between neighboring operators, without worrying that we get the plan right in 
> one pass.
> For example, we might have a plan that looks like:  
> Load->Join->Filter->Foreach, and we want to optimize it to 
> Load->Foreach->Filter->Join.  With two simple
> rules (swap filter and join and swap foreach and filter), applied 
> iteratively, we can get from the initial to final plan, without needing to 
> understanding the
> big picture of the entire plan.
> 3) Add three calls to OperatorPlan:
> {code}
> /**
>  * Swap two operators in a plan.  Both of the operators must have single
>  * inputs and single outputs.
>  * @param first operator
>  * @param second operator
>  * @throws PlanException if either operator is not single input and output.
>  */
> public void swap(E first, E second) throws PlanException {
> ...
> }
> /**
>  * Push one operator in front of another.  This function is for use when
>  * the first operator