date:20090601

[jira] Commented: (PIG-6) Addition of Hbase Storage Option In Load/Store Statement

2009-06-01 Thread Amr Awadallah (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715385#action_12715385
 ] 

Amr Awadallah commented on PIG-6:
-

Any progress on this?

> Addition of Hbase Storage Option In Load/Store Statement
> 
>
> Key: PIG-6
> URL: https://issues.apache.org/jira/browse/PIG-6
> Project: Pig
>  Issue Type: New Feature
> Environment: all environments
>Reporter: Edward J. Yoon
> Fix For: 0.2.0
>
> Attachments: hbase-0.18.1-test.jar, hbase-0.18.1.jar, m34813f5.txt, 
> PIG-6.patch, PIG-6_V01.patch
>
>
> It needs to be able to load full table in hbase.  (maybe ... difficult? i'm 
> not sure yet.)
> Also, as described below, 
> It needs to compose an abstract 2d-table only with certain data filtered from 
> hbase array structure using arbitrary query-delimited. 
> {code}
> A = LOAD table('hbase_table');
> or
> B = LOAD table('hbase_table') Using HbaseQuery('Query-delimited by attributes 
> & timestamp') as (f1, f2[, f3]);
> {code}
> Once test is done on my local machines, 
> I will clarify the grammars and give you more examples to help you explain 
> more storage options. 
> Any advice welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-825) PIG_HADOOP_VERSION should be 18

2009-06-01 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715368#action_12715368
 ] 

Hudson commented on PIG-825:


Integrated in Pig-trunk #460 (See 
[http://hudson.zones.apache.org/hudson/job/Pig-trunk/460/])
: PIG_HADOOP_VERSION should be set to 18.


> PIG_HADOOP_VERSION should be 18
> ---
>
> Key: PIG-825
> URL: https://issues.apache.org/jira/browse/PIG-825
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Reporter: Dmitriy V. Ryaboy
> Fix For: 0.3.0
>
> Attachments: pig-825.patch, pig-825.patch
>
>
> PIG_HADOOP_VERSION should be set to 18, not 17, as Hadoop 0.18 is now 
> considered default.
> Patch coming.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Build failed in Hudson: Pig-Patch-minerva.apache.org #65

2009-06-01 Thread Apache Hudson Server

See 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/65/changes

Changes:

[gates] PIG-825: PIG_HADOOP_VERSION should be set to 18.

[olga] PIG-802: PERFORMANCE: not creating bags for ORDER BY (serakesh via olgan)

[pradeepkth] PIG-816: PigStorage() does not accept Unicode characters in its 
contructor (pradeepkth)

--
[...truncated 788 lines...]
A src/docs/src/documentation/content/xdocs/images/group.svg
A src/docs/src/documentation/content/xdocs/index.xml
A src/docs/src/documentation/content/xdocs/piglatin.xml
A src/docs/src/documentation/content/xdocs/tabs.xml
A src/docs/src/documentation/content/xdocs/quickstart.xml
A src/docs/src/documentation/content/xdocs/udf.xml
A src/docs/src/documentation/content/test1.html
A src/docs/src/documentation/content/locationmap.xml
A src/docs/src/documentation/resources
A src/docs/src/documentation/resources/schema
A src/docs/src/documentation/resources/schema/hello-v10.dtd
A src/docs/src/documentation/resources/schema/symbols-project-v10.ent
A src/docs/src/documentation/resources/schema/catalog.xcat
A src/docs/src/documentation/resources/images
A src/docs/src/documentation/resources/images/ellipse-2.svg
A src/docs/src/documentation/resources/stylesheets
A src/docs/src/documentation/resources/stylesheets/hello2document.xsl
A src/docs/src/documentation/README.txt
A src/docs/src/documentation/classes
A src/docs/src/documentation/classes/CatalogManager.properties
A src/docs/forrest.properties.xml
A src/overview.html
A scripts
A lib-src
A lib-src/bzip2
A lib-src/bzip2/org
A lib-src/bzip2/org/apache
A lib-src/bzip2/org/apache/tools
A lib-src/bzip2/org/apache/tools/bzip2r
A lib-src/bzip2/org/apache/tools/bzip2r/BZip2Constants.java
A lib-src/bzip2/org/apache/tools/bzip2r/CBZip2InputStream.java
A lib-src/bzip2/org/apache/tools/bzip2r/CBZip2OutputStream.java
A lib-src/bzip2/org/apache/tools/bzip2r/CRC.java
A lib-src/shock
A lib-src/shock/org
A lib-src/shock/org/apache
A lib-src/shock/org/apache/pig
A lib-src/shock/org/apache/pig/shock
A lib-src/shock/org/apache/pig/shock/SSHSocketImplFactory.java
A build.xml
A NOTICE.txt
A LICENSE.txt
A contrib
A contrib/CHANGES.txt
A contrib/piggybank
A contrib/piggybank/java
A contrib/piggybank/java/lib
A contrib/piggybank/java/src
A contrib/piggybank/java/src/test
A contrib/piggybank/java/src/test/java
A contrib/piggybank/java/src/test/java/org
A contrib/piggybank/java/src/test/java/org/apache
A contrib/piggybank/java/src/test/java/org/apache/pig
A contrib/piggybank/java/src/test/java/org/apache/pig/piggybank
A contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test
A 
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/filtering
A 
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage
A 
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation
A 
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestMathUDF.java
A 
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestStat.java
A 
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/util
A 
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/util/TestTop.java
A 
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/util/TestSearchQuery.java
A 
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestEvalString.java
A 
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/comparison
A 
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/grouping
A contrib/piggybank/java/src/main
A contrib/piggybank/java/src/main/java
A contrib/piggybank/java/src/main/java/org
A contrib/piggybank/java/src/main/java/org/apache
A contrib/piggybank/java/src/main/java/org/apache/pig
A contrib/piggybank/java/src/main/java/org/apache/pig/piggybank
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/filtering
A contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/MaxTupleBy1stField.java
A 
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/evaluation/string
A 
contrib/piggybank/java/src/main/java/org/apach

[jira] Commented: (PIG-796) support conversion from numeric types to chararray

2009-06-01 Thread Pradeep Kamath (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715325#action_12715325
 ] 

Pradeep Kamath commented on PIG-796:


A few comments:
- In TestPOCast.java the variables can be named as something like 
"opWithInputTypeAsByteArray" for the POCast objects since the intent is not so 
clear with the current names
- In POCast.java you can check for the realType inside the catch clause rather 
than before trying to cast to ByteArray. This way, if the cast to ByteArray is 
always successful, we will not be incurring the overhead of the 
if(realType==null) check
- In POCast.java, you can avoid catching ExecException and checking for 
errorCode == 1071. Since the getNext() call in POCast already throws 
ExecException, you can just let ExecExceptions from DataType.toXXX() methods 
bubble out.

> support  conversion from numeric types to chararray
> ---
>
> Key: PIG-796
> URL: https://issues.apache.org/jira/browse/PIG-796
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
> Attachments: 796.patch, pig-796.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-829) DECLARE statement stop processing after special characters such as dot "." , "+" "%" etc..

2009-06-01 Thread Viraj Bhat (JIRA)

DECLARE statement stop processing after special characters such as dot "." , 
"+" "%" etc..
--

 Key: PIG-829
 URL: https://issues.apache.org/jira/browse/PIG-829
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.3.0
Reporter: Viraj Bhat
 Fix For: 0.3.0


The below Pig script does not work well, when special characters are used in 
the DECLARE statement.
{code}
%DECLARE OUT foo.bar

x = LOAD 'something' as (a:chararray, b:chararray);

y = FILTER x BY ( a MATCHES '^.*yahoo.*$' );

STORE y INTO '$OUT';
{code}

When the above script is run in the dry run mode; the substituted file does not 
contain the special character.

{code}
java -cp pig.jar:/homes/viraj/hadoop-0.18.0-dev/conf -Dhod.server='' 
org.apache.pig.Main -r declaresp.pig
{code}

Resulting file: "declaresp.pig.substituted"
{code}
x = LOAD 'something' as (a:chararray, b:chararray);

y = FILTER x BY ( a MATCHES '^.*yahoo.*$' );

STORE y INTO 'foo';
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-796) support conversion from numeric types to chararray

2009-06-01 Thread Pradeep Kamath (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-796:
---

Status: Patch Available  (was: Open)

> support  conversion from numeric types to chararray
> ---
>
> Key: PIG-796
> URL: https://issues.apache.org/jira/browse/PIG-796
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
> Attachments: 796.patch, pig-796.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-753) Provide support for UDFs without parameters

2009-06-01 Thread Alan Gates (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-753:
---

Status: Open  (was: Patch Available)

> Provide support for UDFs without parameters
> ---
>
> Key: PIG-753
> URL: https://issues.apache.org/jira/browse/PIG-753
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Jeff Zhang
> Fix For: 0.3.0
>
> Attachments: Pig_753_Patch.txt
>
>
> Pig do not support UDF without parameters, it force me provide a parameter.
> like the following statement:
>  B = FOREACH A GENERATE bagGenerator();  this will generate error. I have to 
> provide a parameter like following
>  B = FOREACH A GENERATE bagGenerator($0);
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-753) Provide support for UDFs without parameters

2009-06-01 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715312#action_12715312
 ] 

Alan Gates commented on PIG-753:


The patch should include a unit tests that to test whether a pig script will 
parse with a udf that has no parameters, and whether the backend will properly 
execute a udf that takes no parameters.

> Provide support for UDFs without parameters
> ---
>
> Key: PIG-753
> URL: https://issues.apache.org/jira/browse/PIG-753
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Jeff Zhang
> Fix For: 0.3.0
>
> Attachments: Pig_753_Patch.txt
>
>
> Pig do not support UDF without parameters, it force me provide a parameter.
> like the following statement:
>  B = FOREACH A GENERATE bagGenerator();  this will generate error. I have to 
> provide a parameter like following
>  B = FOREACH A GENERATE bagGenerator($0);
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-796) support conversion from numeric types to chararray

2009-06-01 Thread Ashutosh Chauhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashutosh Chauhan updated PIG-796:
-

Attachment: 796.patch

updated patch.

This patch fixes the following issue: Sometimes (e.g. when values coming out of 
map lookup) Pig assumes type of element as ByteArray when actually it is of 
some other type. In such cases request for a Cast fails. 

This patch first finds out the actual type of element before casting it 
(specifically when Pig thinks its ByteArray) and then do the cast. It also 
caches the type. When type changes ClassCastException is raised which gets 
caught and cast is then tried again. Cached value of type is also updated. This 
ensures that type is not determined on each cast call as well as handling of 
casts when types changes from one call to the next. 

> support  conversion from numeric types to chararray
> ---
>
> Key: PIG-796
> URL: https://issues.apache.org/jira/browse/PIG-796
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.2.0
>Reporter: Olga Natkovich
> Attachments: 796.patch, pig-796.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-828) Problem accessing a tuple within a bag

2009-06-01 Thread Viraj Bhat (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-828:
---

Attachment: tupleacc.pig
studenttab5

Input script and data.

> Problem accessing a tuple within a bag
> --
>
> Key: PIG-828
> URL: https://issues.apache.org/jira/browse/PIG-828
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.3.0
>Reporter: Viraj Bhat
> Fix For: 0.3.0
>
> Attachments: studenttab5, tupleacc.pig
>
>
> Below pig script creates a tuple which contains 3 columns, 2 of which are 
> chararray's and the third column is a bag of constant chararray. The script 
> later projects the tuple within a bag.
> {code}
> a = load 'studenttab5' as (name, age, gpa);
> b = foreach a generate ('viraj', {('sms')}, 'pig') as 
> document:(id,singlebag:{singleTuple:(single)}, article);
> describe b;
> c = foreach b generate document.singlebag;
> dump c;
> {code}
> When we run this script we get a run-time error in the Map phase.
> 
> java.lang.ClassCastException: org.apache.pig.data.DefaultTuple cannot be cast 
> to org.apache.pig.data.DataBag
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:402)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:400)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:250)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:245)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:236)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
>   at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
> 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-828) Problem accessing a tuple within a bag

2009-06-01 Thread Viraj Bhat (JIRA)

Problem accessing a tuple within a bag
--

 Key: PIG-828
 URL: https://issues.apache.org/jira/browse/PIG-828
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Viraj Bhat
 Fix For: 0.3.0


Below pig script creates a tuple which contains 3 columns, 2 of which are 
chararray's and the third column is a bag of constant chararray. The script 
later projects the tuple within a bag.

{code}
a = load 'studenttab5' as (name, age, gpa);

b = foreach a generate ('viraj', {('sms')}, 'pig') as 
document:(id,singlebag:{singleTuple:(single)}, article);

describe b;

c = foreach b generate document.singlebag;

dump c;
{code}

When we run this script we get a run-time error in the Map phase.

java.lang.ClassCastException: org.apache.pig.data.DefaultTuple cannot be cast 
to org.apache.pig.data.DataBag
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:402)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:400)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:250)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:245)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:236)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: A proposal for changing pig's memory management

2009-06-01 Thread Mridul Muralidharan


Alan Gates wrote:


On May 19, 2009, at 10:30 PM, Mridul Muralidharan wrote:



I am still not very convinced about the value about this 
implementation - particularly considering the advances made since 1.3 
in memory allocators and garbage collection.


My fundamental concern is not with the slowness of garbage collection.  
I am asserting (along with the paper) that garbage collection is not an 
optimal choice for a large data processing system.  I don't want to 
improve the garbage collector, I want to manage a subset of the memory 
without it.



I should probably have elaborated better.
Most objects in Pig are in young generation (pls correct me if I am 
wrong) - so promoting them from there (which is handled pretty optimally 
and blazingly fast by vm) into slower/longer memory pools should be done 
with some thought (management of buffers, etc).


The only (corner) cases where this is not valid, from top of my head, is 
when a single tuple becomes really large due to a bag (usually) with 
either large number of tuples in it, or tuples with larger payloads : 
and imo that results in quite similar costs with this proposal too - but 
I could be wrong.











The side effect of this proposal is many, and sometimes non-obvious.
Like implicitly moving young generation data into older generation, 
causing much more memory pressure for gc, fragmentation of memory 
blocks causing quite a bit of memory pressure, replicating quite a bit 
of functionality with garbage collection, possibility of bugs with ref 
counting, etc.


I don't understand your concerns regarding the load on the gc and memory 
fragmentation.  Let's say I have 10,000 tuples, each with 10 fields.  
Let's also assume that these tuples live long enough to make it into the 
"old" memory pool, since this is the interesting case where objects live 
long enough to cause a problem.  In the current implementation there 
will be 110,000 objects that the gc has to manage moving into the old 
pool, and check every time it cleans the old pool.  In the proposed 
implementation there would be 10,001 objects (assuming all the data fit 
into one buffer) to manage.  And rather than allocating 100,000 small 
pieces of memory, we would have allocated one large segment.  My belief 
is that this would lighten the load on the gc.



Old gen memory management is not very trivial.
For example, which should probably be very commonly known now - if an 
old block is freed and yet the cost of moving existing blocks around to 
use the 'free' block is high, vm just leaves it around. Over time, you 
will end up with fragmentation on old gen which cant be freed. (This is 
not a vm bug - the costs outweigh the benefits).



That being said, as I mentioned above, the costs of mem usage is not 
linear - young gen is way faster (allocation, management, free) than 
objects promoted to older generations (successively) [compaction, 
reference changes, etc in gc].


In pig's case, since it is essentially streaming in nature - most 
tuples/bag - except in corner cases, would fall into young gen where 
things are faster.






Just a note though -
The last time I had to dabble in memory management for my server needs, 
it was already pretty complex and un-intutive (not to mention env and 
impl specific) - and that was a few years back - unfortunately, I have 
not kept abreast with recent changes (and quite a few have gone into vm 
for java 6 I was told) : so probably my comments above might not be 
valid anymore.
Other than saying you would probably want to test extensively like we 
had to do, and that things are not as simple as they normally appear 
[and imo almost all books/articles get it wrong - so testing is only way 
out], I cant really comment more authoritatively anymore :-) Any 
improvement to pig memory management would be a welcome change though !




Regards,
Mridul




This does replicate some of  the functionality of the garbage 
collector.  Complex systems frequently need to re-implement foundational 
functionality in order to optimize it for their needs.  Hence many RDBMS 
engines have their own implementations of memory management, file I/O, 
thread scheduling, etc.


As for bugs in ref counting, I agree that forgetting to deallocate is 
one of the most pernicious problems of allowing programmers to do memory 
management.  But in this case all that will happen is that a buffer will 
get left around that isn't needed.  If the system needs more memory then 
that buffer will eventually get selected for flushing to disk, and then 
it will stay there as no one will call it back into memory.  So the cost 
of forgetting to deallocate is minor.





If assumption that current working set of bag/tuple does not need to 
be spilled, and anything else can be, then this will pretty much 
deteriorate to current impl in worst case.
That is not the assumption.  There are two issues:  1) trying to spill 
bags only when we determine we need to is highly error prone, because we

[jira] Resolved: (PIG-825) PIG_HADOOP_VERSION should be 18

2009-06-01 Thread Alan Gates (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates resolved PIG-825.


   Resolution: Fixed
Fix Version/s: 0.3.0

Patch checked in.  Thanks Dmitriy.

> PIG_HADOOP_VERSION should be 18
> ---
>
> Key: PIG-825
> URL: https://issues.apache.org/jira/browse/PIG-825
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Reporter: Dmitriy V. Ryaboy
> Fix For: 0.3.0
>
> Attachments: pig-825.patch, pig-825.patch
>
>
> PIG_HADOOP_VERSION should be set to 18, not 17, as Hadoop 0.18 is now 
> considered default.
> Patch coming.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-827) Redesign graph operations in OperatorPlan

2009-06-01 Thread Santhosh Srinivasan (JIRA)

Redesign graph operations in OperatorPlan
-

 Key: PIG-827
 URL: https://issues.apache.org/jira/browse/PIG-827
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.1
Reporter: Santhosh Srinivasan
 Fix For: 0.2.1


The graph operations swap, insertBetween, pushBefore, etc. have to be 
re-implemented in a layered fashion. The layering will facilitate the re-use of 
operations. In addition, use of operator.rewire in the aforementioned 
operations requires transaction like ability due to various pre-conditions. 
Often, the result of one of the operations leaves the graph in an inconsistent 
state for the rewire operation. Clear layering and assignment of the ability to 
rewire will remove these inconsistencies. For now, use of rewire has resulted 
in a slightly less maintainable code along with the necessity to use rewire 
with discretion.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer

2009-06-01 Thread Santhosh Srinivasan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715143#action_12715143
 ] 

Santhosh Srinivasan commented on PIG-697:
-

The graph operation pushAfter was added as a complementary operation to 
pushBefore. Currently, on the logical side, there are no concrete use cases for 
pushAfter. The only operator that truly supports multiple outputs is split. Our 
current model for split is to have an no-op split operator that has multiple 
successors, split outputs, each of which is the equivalent of a filter. The 
split output has inner plans which could have projection operators that hold 
references to the split's predecessor. 

When an operator is pushed after split, the operator will be placed between the 
split and split output. As a result, when rewire on split is called, the call 
is dispatched to the split output. The references in the split output after the 
rewire will now point to split's predecessor instead of pointing to the 
operator that was pushed after.

The intention of the pushAfter in the case of a split is to push it after the 
split output. However, the generic pushAfter operation does not distinguish 
between split and split output. A possible way out is to override this method 
in the logical plan and duplicate most of the code in the OperatorPlan and add 
new code to handle split.

As of now, the pushAfter will not be used in the logical layer.


> Proposed improvements to pig's optimizer
> 
>
> Key: PIG-697
> URL: https://issues.apache.org/jira/browse/PIG-697
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Alan Gates
>Assignee: Santhosh Srinivasan
> Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch, 
> OptimizerPhase2.patch, OptimizerPhase3_parrt1-1.patch, 
> OptimizerPhase3_parrt1.patch
>
>
> I propose the following changes to pig optimizer, plan, and operator 
> functionality to support more robust optimization:
> 1) Remove the required array from Rule.  This will change rules so that they 
> only match exact patterns instead of allowing missing elements in the pattern.
> This has the downside that if a given rule applies to two patterns (say 
> Load->Filter->Group, Load->Group) you have to write two rules.  But it has 
> the upside that
> the resulting rules know exactly what they are getting.  The original intent 
> of this was to reduce the number of rules that needed to be written.  But the
> resulting rules have do a lot of work to understand the operators they are 
> working with.  With exact matches only, each rule will know exactly the 
> operators it
> is working on and can apply the logic of shifting the operators around.  All 
> four of the existing rules set all entries of required to true, so removing 
> this
> will have no effect on them.
> 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
> conversions or a certain number of iterations has been reached.  Currently the
> function is:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> for (Rule rule : mRules) {
> if (matcher.match(rule)) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> List> matches = matcher.getAllMatches();
> for (List match:matches)
> {
>   if (rule.transformer.check(match)) {
>   // The transformer approves.
>   rule.transformer.transform(match);
>   }
> }
> }
> }
> }
> {code}
> It would change to be:
> {code}
> public final void optimize() throws OptimizerException {
> RuleMatcher matcher = new RuleMatcher();
> boolean sawMatch;
> int iterators = 0;
> do {
> sawMatch = false;
> for (Rule rule : mRules) {
> List> matches = matcher.getAllMatches();
> for (List match:matches) {
> // It matches the pattern.  Now check if the transformer
> // approves as well.
> if (rule.transformer.check(match)) {
> // The transformer approves.
> sawMatch = true;
> rule.transformer.transform(match);
> }
> }
> }
> // Not sure if 1000 is the right number of iterations, maybe it
> // should be configurable so that large scripts don't stop too 
> // early.
> } while (sawMatch && numIterations++ < 1000);
> }
> {code}
> The reason for limiting the number of itera

[jira] Updated: (PIG-826) DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig

2009-06-01 Thread David Ciemiewicz (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Ciemiewicz updated PIG-826:
-

Summary: DISTINCT as "Function/Operator" rather than statement/operator - 
High Level Pig  (was: DISTINCT as "Function" rather than statement - High Level 
Pig)

> DISTINCT as "Function/Operator" rather than statement/operator - High Level 
> Pig
> ---
>
> Key: PIG-826
> URL: https://issues.apache.org/jira/browse/PIG-826
> Project: Pig
>  Issue Type: New Feature
>Reporter: David Ciemiewicz
>
> In SQL, a user would think nothing of doing something like:
> {code}
> select
> COUNT(DISTINCT(user)) as user_count,
> COUNT(DISTINCT(country)) as country_count,
> COUNT(DISTINCT(url) as url_count
> from
> server_logs;
> {code}
> But in Pig, we'd need to do something like the following.  And this is about 
> the most
> compact version I could come up with.
> {code}
> Logs = load 'log' using PigStorage()
> as ( user: chararray, country: chararray, url: chararray);
> DistinctUsers = distinct (foreach Logs generate user);
> DistinctCountries = distinct (foreach Logs generate country);
> DistinctUrls = distinct (foreach Logs generate url);
> DistinctUsersCount = foreach (group DistinctUsers all) generate
> group, COUNT(DistinctUsers) as user_count;
> DistinctCountriesCount = foreach (group DistinctCountries all) generate
> group, COUNT(DistinctCountries) as country_count;
> DistinctUrlCount = foreach (group DistinctUrls all) generate
> group, COUNT(DistinctUrls) as url_count;
> AllDistinctCounts = cross
> DistinctUsersCount, DistinctCountriesCount, DistinctUrlCount;
> Report = foreach AllDistinctCounts generate
> DistinctUsersCount::user_count,
> DistinctCountriesCount::country_count,
> DistinctUrlCount::url_count;
> store Report into 'log_report' using PigStorage();
> {code}
> It would be good if there was a higher level version of Pig that permitted 
> code to be written as:
> {code}
> Logs = load 'log' using PigStorage()
> as ( user: chararray, country: chararray, url: chararray);
> Report = overall Logs generate
> COUNT(DISTINCT(user)) as user_count,
> COUNT(DISTINCT(country)) as country_count,
> COUNT(DISTINCT(url)) as url_count;
> store Report into 'log_report' using PigStorage();
> {code}
> I do want this in Pig and not as SQL.  I'd expect High Level Pig to generate 
> Lower Level Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-753) Provide support for UDFs without parameters

2009-06-01 Thread Jeff Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-753:
---

Fix Version/s: 0.3.0
Affects Version/s: 0.3.0
   Status: Patch Available  (was: Open)

Submit the patch.
Now we do not have to provider a parameter for UDF, zero-parameters UDF is also 
OK too.



> Provide support for UDFs without parameters
> ---
>
> Key: PIG-753
> URL: https://issues.apache.org/jira/browse/PIG-753
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Jeff Zhang
> Fix For: 0.3.0
>
> Attachments: Pig_753_Patch.txt
>
>
> Pig do not support UDF without parameters, it force me provide a parameter.
> like the following statement:
>  B = FOREACH A GENERATE bagGenerator();  this will generate error. I have to 
> provide a parameter like following
>  B = FOREACH A GENERATE bagGenerator($0);
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-753) Provide support for UDFs without parameters

2009-06-01 Thread Jeff Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-753:
---

Attachment: Pig_753_Patch.txt

attach the patch

> Provide support for UDFs without parameters
> ---
>
> Key: PIG-753
> URL: https://issues.apache.org/jira/browse/PIG-753
> Project: Pig
>  Issue Type: Improvement
>Reporter: Jeff Zhang
> Attachments: Pig_753_Patch.txt
>
>
> Pig do not support UDF without parameters, it force me provide a parameter.
> like the following statement:
>  B = FOREACH A GENERATE bagGenerator();  this will generate error. I have to 
> provide a parameter like following
>  B = FOREACH A GENERATE bagGenerator($0);
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-6) Addition of Hbase Storage Option In Load/Store Statement

[jira] Commented: (PIG-825) PIG_HADOOP_VERSION should be 18

Build failed in Hudson: Pig-Patch-minerva.apache.org #65

[jira] Commented: (PIG-796) support conversion from numeric types to chararray

[jira] Created: (PIG-829) DECLARE statement stop processing after special characters such as dot "." , "+" "%" etc..

[jira] Updated: (PIG-796) support conversion from numeric types to chararray

[jira] Updated: (PIG-753) Provide support for UDFs without parameters

[jira] Commented: (PIG-753) Provide support for UDFs without parameters

[jira] Updated: (PIG-796) support conversion from numeric types to chararray

[jira] Updated: (PIG-828) Problem accessing a tuple within a bag

[jira] Created: (PIG-828) Problem accessing a tuple within a bag

Re: A proposal for changing pig's memory management

[jira] Resolved: (PIG-825) PIG_HADOOP_VERSION should be 18

[jira] Created: (PIG-827) Redesign graph operations in OperatorPlan

[jira] Commented: (PIG-697) Proposed improvements to pig's optimizer

[jira] Updated: (PIG-826) DISTINCT as "Function/Operator" rather than statement/operator - High Level Pig

[jira] Updated: (PIG-753) Provide support for UDFs without parameters

[jira] Updated: (PIG-753) Provide support for UDFs without parameters

18 matches

Site Navigation

Mail list logo

Footer information