[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-14 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: streaming-fix.patch

Some fixes in the patch streaming-fix.patch:

   * The split operator wasn't always playing nicely with the way we run the 
pipeline one extra time in the mapper's or reducer's close function if there's 
a stream operator present
   * Moved the MR optimizer that sets stream in map and stream in reduce to 
the end of the queue.
   * PhyPlanVisitor forgets to pop some walkers it pushed on the stack. That 
can result in the NoopFilterRemoval stage failing, because it's looking in the 
wrong plan.
   * Setting the jobname by default to the scriptname came in through the last 
merge, but didn't work anymore

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
 merge_741727_HEAD__0324.patch, merge_741727_HEAD__0324_2.patch, 
 merge_trunk_to_branch.patch, multi-store-0303.patch, multi-store-0304.patch, 
 multiquery-phase2_0313.patch, multiquery-phase2_0323.patch, 
 multiquery_0223.patch, multiquery_0224.patch, multiquery_0306.patch, 
 multiquery_explain_fix.patch, non_reversible_store_load_dependencies.patch, 
 non_reversible_store_load_dependencies_2.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-765) to implement jdiff

2009-04-14 Thread Giridharan Kesavan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giridharan Kesavan updated PIG-765:
---

Attachment: pig-765.patch

this patch implements jdiff.



 to implement jdiff
 --

 Key: PIG-765
 URL: https://issues.apache.org/jira/browse/PIG-765
 Project: Pig
  Issue Type: Improvement
  Components: build
Reporter: Giridharan Kesavan
Assignee: Giridharan Kesavan
 Attachments: pig-765.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-627) PERFORMANCE: multi-query optimization

2009-04-14 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-627:
---

Attachment: merge-041409.patch

merge-041409.patch contains the latest merge from trunk to branch.

 PERFORMANCE: multi-query optimization
 -

 Key: PIG-627
 URL: https://issues.apache.org/jira/browse/PIG-627
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich
 Attachments: file_cmds-0305.patch, fix_store_prob.patch, 
 merge-041409.patch, merge_741727_HEAD__0324.patch, 
 merge_741727_HEAD__0324_2.patch, merge_trunk_to_branch.patch, 
 multi-store-0303.patch, multi-store-0304.patch, multiquery-phase2_0313.patch, 
 multiquery-phase2_0323.patch, multiquery_0223.patch, multiquery_0224.patch, 
 multiquery_0306.patch, multiquery_explain_fix.patch, 
 non_reversible_store_load_dependencies.patch, 
 non_reversible_store_load_dependencies_2.patch, 
 noop_filter_absolute_path_flag.patch, 
 noop_filter_absolute_path_flag_0401.patch, streaming-fix.patch


 Currently, if your Pig script contains multiple stores and some shared 
 computation, Pig will execute several independent queries. For instance:
 A = load 'data' as (a, b, c);
 B = filter A by a  5;
 store B into 'output1';
 C = group B by b;
 store C into 'output2';
 This script will result in map-only job that generated output1 followed by a 
 map-reduce job that generated output2. As the resuld data is read, parsed and 
 filetered twice which is unnecessary and costly. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-573) Changes to make Pig run with Hadoop 19

2009-04-14 Thread Kevin Weil (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12698910#action_12698910
 ] 

Kevin Weil commented on PIG-573:


What is the current status of this patch with pig 0.2?  Since PIG-563 went in 
to 0.20, all that should be necessary is applying this single patch to the 0.20 
release source, right?

 Changes to make Pig run with Hadoop 19
 --

 Key: PIG-573
 URL: https://issues.apache.org/jira/browse/PIG-573
 Project: Pig
  Issue Type: Task
Affects Versions: 0.2.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Attachments: hadoop19.jar, PIG-573-combinerflag.patch, PIG-573.patch


 This issue tracks changes to Pig code to make it work with Hadoop-0.19.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Ajax library for Pig

2009-04-14 Thread Alan Gates
Would you want to contribute this to the Pig project or release it  
separately?  Either way, keep us posted on your progress.  It sounds  
interesting.


Alan.

On Apr 9, 2009, at 9:28 PM, nitesh bhatia wrote:


Hi
Thanks for the reply.
This will be the architecture:

1. Pig would be installed on some dedicated server machine (say P)  
with

hadoop support.
2. In front of it will be a web server (say S)
  2.1 A web server will consist of a dedicated tomcat server (say  
St) for

handling dwr servlets.
  2.2 PigScript.js  proposed javascript.
  2.2 If user is using some other server than tomcat for  
presentation layer
(say http for php or IIS for asp.net); the server (say Su) will  
appear in

front of St.

-Connections between Su and St will be done through PigScript.js
- St and P will be done through dwr
- To get the results from server, this system will be using Reverse- 
ajax
calls ( i.e async call from server to browser  an inbuilt feature in  
DWR).


DWR is under Apache Licence V2.

--nitesh

On Wed, Apr 8, 2009 at 9:11 PM, Alan Gates ga...@yahoo-inc.com  
wrote:


Sorry if these are silly questions, but I'm not very familiar with  
some of
these technologies.  So what you propose is that Pig would be  
installed on
some dedicated server machine and a web server would be placed in  
front of
it.  Then client libraries would be developed that made calls to  
the web
server.  Would these client side libraries include presentation in  
the
browser, both for user's submitting queries and receiving results?   
Also,
pig currently does not have a server mode, thus any web server  
would have to

spin off threads that ran a pig job.

If the above is what you're proposing, I think it would be great.   
Opening

up pig to more users by making it browser accessible would be nice.

Alan.


On Apr 3, 2009, at 5:36 AM, nitesh bhatia wrote:

Hi

Since pig is getting a lot of usage in industries and universities;
how about adding a front-end support for Pig? The plan is to write a
jquery/dojo type of general JavaScript/AJAX library which can be  
used

over any server technologies (php, jsp, asp, etc.) to call pig
functions over web.

Direct Web Remoting (DWR- http://directwebremoting.org ), an open
source project at Java.net gives a functionality that allows
JavaScript in a browser to interact with Java on a server. Can we
write a JavaScript library exclusively for Pig using DWR? I am not
sure about licensing issues.

The major advantages I can point is
-Use of Pig over HTTP rather SSH.
-User management will become easy as this can be handled easily  
using any

CMS

--nitesh

--
Nitesh Bhatia
Dhirubhai Ambani Institute of Information  Communication Technology
Gandhinagar
Gujarat

Life is never perfect. It just depends where you draw the line.

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun







--
Nitesh Bhatia
Dhirubhai Ambani Institute of Information  Communication Technology
Gandhinagar
Gujarat

Life is never perfect. It just depends where you draw the line.

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun




Re: Ajax library for Pig

2009-04-14 Thread Ted Dunning
Each pig program submission should involve a separate piglatin interpreter.

On Tue, Apr 14, 2009 at 2:32 PM, nitesh bhatia niteshbhatia...@gmail.comwrote:

 Hi
 Currently I am under one doubt. How this system can be designed so that
 multiple users can run same pig.
 Current scenario is  - User executes its own copy of pig.jar on shell and
 access hadoop.

 But under this system multiple users will log-in to some domain and they
 have separate sessions. Now suppose user1 submits a pig script or access
 pig. Then user2 also access pig shell. How this system will work for
 multiple users? I am not sure what can be the optimized solution.

 --nitesh



 On Wed, Apr 15, 2009 at 2:07 AM, Alan Gates ga...@yahoo-inc.com wrote:

  Would you want to contribute this to the Pig project or release it
  separately?  Either way, keep us posted on your progress.  It sounds
  interesting.
 
  Alan.
 
 
  On Apr 9, 2009, at 9:28 PM, nitesh bhatia wrote:
 
   Hi
  Thanks for the reply.
  This will be the architecture:
 
  1. Pig would be installed on some dedicated server machine (say P) with
  hadoop support.
  2. In front of it will be a web server (say S)
   2.1 A web server will consist of a dedicated tomcat server (say St) for
  handling dwr servlets.
   2.2 PigScript.js  proposed javascript.
   2.2 If user is using some other server than tomcat for presentation
 layer
  (say http for php or IIS for asp.net); the server (say Su) will appear
 in
  front of St.
 
  -Connections between Su and St will be done through PigScript.js
  - St and P will be done through dwr
  - To get the results from server, this system will be using Reverse-ajax
  calls ( i.e async call from server to browser  an inbuilt feature in
 DWR).
 
  DWR is under Apache Licence V2.
 
  --nitesh
 
  On Wed, Apr 8, 2009 at 9:11 PM, Alan Gates ga...@yahoo-inc.com wrote:
 
   Sorry if these are silly questions, but I'm not very familiar with some
  of
  these technologies.  So what you propose is that Pig would be installed
  on
  some dedicated server machine and a web server would be placed in front
  of
  it.  Then client libraries would be developed that made calls to the
 web
  server.  Would these client side libraries include presentation in the
  browser, both for user's submitting queries and receiving results?
  Also,
  pig currently does not have a server mode, thus any web server would
 have
  to
  spin off threads that ran a pig job.
 
  If the above is what you're proposing, I think it would be great.
   Opening
  up pig to more users by making it browser accessible would be nice.
 
  Alan.
 
 
  On Apr 3, 2009, at 5:36 AM, nitesh bhatia wrote:
 
  Hi
 
  Since pig is getting a lot of usage in industries and universities;
  how about adding a front-end support for Pig? The plan is to write a
  jquery/dojo type of general JavaScript/AJAX library which can be used
  over any server technologies (php, jsp, asp, etc.) to call pig
  functions over web.
 
  Direct Web Remoting (DWR- http://directwebremoting.org ), an open
  source project at Java.net gives a functionality that allows
  JavaScript in a browser to interact with Java on a server. Can we
  write a JavaScript library exclusively for Pig using DWR? I am not
  sure about licensing issues.
 
  The major advantages I can point is
  -Use of Pig over HTTP rather SSH.
  -User management will become easy as this can be handled easily using
  any
  CMS
 
  --nitesh
 
  --
  Nitesh Bhatia
  Dhirubhai Ambani Institute of Information  Communication Technology
  Gandhinagar
  Gujarat
 
  Life is never perfect. It just depends where you draw the line.
 
  visit:
  http://www.awaaaz.com - connecting through music
  http://www.volstreet.com - lets volunteer for better tomorrow
  http://www.instibuzz.com - Voice opinions, Transact easily, Have fun
 
 
 
 
 
  --
  Nitesh Bhatia
  Dhirubhai Ambani Institute of Information  Communication Technology
  Gandhinagar
  Gujarat
 
  Life is never perfect. It just depends where you draw the line.
 
  visit:
  http://www.awaaaz.com - connecting through music
  http://www.volstreet.com - lets volunteer for better tomorrow
  http://www.instibuzz.com - Voice opinions, Transact easily, Have fun
 
 
 


 --
 Nitesh Bhatia
 Dhirubhai Ambani Institute of Information  Communication Technology
 Gandhinagar
 Gujarat

 Life is never perfect. It just depends where you draw the line.

 visit:
 http://www.awaaaz.com - connecting through music
 http://www.volstreet.com - lets volunteer for better tomorrow
 http://www.instibuzz.com - Voice opinions, Transact easily, Have fun




-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)


[jira] Created: (PIG-766) ava.lang.OutOfMemoryError: Java heap space

2009-04-14 Thread Vadim Zaliva (JIRA)
ava.lang.OutOfMemoryError: Java heap space
--

 Key: PIG-766
 URL: https://issues.apache.org/jira/browse/PIG-766
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
 Environment: Hadoop-0.18.3 (cloudera RPMs).
mapred.child.java.opts=-Xmx1024m

Reporter: Vadim Zaliva


My pig script always fails with the following error:

Java.lang.OutOfMemoryError: Java heap space
   at java.util.Arrays.copyOf(Arrays.java:2786)
   at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
   at java.io.DataOutputStream.write(DataOutputStream.java:90)
   at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
   at 
org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213)
   at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
   at 
org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233)
   at 
org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162)
   at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
   at 
org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83)
   at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
   at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
   at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156)
   at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857)
   at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467)
   at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101)
   at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219)
   at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208)
   at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-766) ava.lang.OutOfMemoryError: Java heap space

2009-04-14 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12698996#action_12698996
 ] 

Alan Gates commented on PIG-766:


It isn't overall data size that matters.  It is the size of a given key.  So if 
you have a 2G data set up it has only one key (that is, every row has that 
key), then you'll hit this problem (assuming you can't fit 2G in memory on your 
data nodes).  Pig does try to spill to avoid this, but has a hard time knowing 
when and how much to spill, and thus often runs out of memory.

But I think you're right that this isn't in the join.  From the stack it looks 
like it's trying to write data out of the map task.  Do you have very large 
rows in this data?

 ava.lang.OutOfMemoryError: Java heap space
 --

 Key: PIG-766
 URL: https://issues.apache.org/jira/browse/PIG-766
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
 Environment: Hadoop-0.18.3 (cloudera RPMs).
 mapred.child.java.opts=-Xmx1024m
Reporter: Vadim Zaliva

 My pig script always fails with the following error:
 Java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
at 
 org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233)
at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
at 
 org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83)
at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156)
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857)
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-766) ava.lang.OutOfMemoryError: Java heap space

2009-04-14 Thread Vadim Zaliva (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699000#action_12699000
 ] 

Vadim Zaliva commented on PIG-766:
--

I have at most 17m rows in my dataset.
At some point I am doing GROUP BY and longest row about 500,000 tuples.



 ava.lang.OutOfMemoryError: Java heap space
 --

 Key: PIG-766
 URL: https://issues.apache.org/jira/browse/PIG-766
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
 Environment: Hadoop-0.18.3 (cloudera RPMs).
 mapred.child.java.opts=-Xmx1024m
Reporter: Vadim Zaliva

 My pig script always fails with the following error:
 Java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
at 
 org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233)
at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
at 
 org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83)
at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156)
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857)
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-766) ava.lang.OutOfMemoryError: Java heap space

2009-04-14 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699017#action_12699017
 ] 

Olga Natkovich commented on PIG-766:


I asked a member of hadoop team to take a look. A possible problem is that 
there is a single record that does not fit into combiner buffer. Hopefully we 
will get some help with this.

 ava.lang.OutOfMemoryError: Java heap space
 --

 Key: PIG-766
 URL: https://issues.apache.org/jira/browse/PIG-766
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
 Environment: Hadoop-0.18.3 (cloudera RPMs).
 mapred.child.java.opts=-Xmx1024m
Reporter: Vadim Zaliva

 My pig script always fails with the following error:
 Java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
at 
 org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233)
at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
at 
 org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83)
at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156)
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857)
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-766) ava.lang.OutOfMemoryError: Java heap space

2009-04-14 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699019#action_12699019
 ] 

Olga Natkovich commented on PIG-766:


I got confirmation from Hadoop dev that this is a case of one huge record that 
is larger than combiner buffer which means that it is over 90 MB. Does this 
sound right for your data? Is it possible you have data corruption? Do you have 
another data set to try this query with?

 ava.lang.OutOfMemoryError: Java heap space
 --

 Key: PIG-766
 URL: https://issues.apache.org/jira/browse/PIG-766
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
 Environment: Hadoop-0.18.3 (cloudera RPMs).
 mapred.child.java.opts=-Xmx1024m
Reporter: Vadim Zaliva

 My pig script always fails with the following error:
 Java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
at 
 org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233)
at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
at 
 org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83)
at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156)
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857)
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (PIG-766) ava.lang.OutOfMemoryError: Java heap space

2009-04-14 Thread Vadim Zaliva (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699021#action_12699021
 ] 

Vadim Zaliva edited comment on PIG-766 at 4/14/09 6:53 PM:
---

I know for sure that I have at some point a record of 500K tuples, but tuples 
being 40-50 bytes each, it is far from 90M.

Even if this is the case, I do not see how this could cause OutOfMemory 
exception in java.util.Arrays.copyOf(). Even if this record
is, say, 200Mb, given JVM total heap memory size 1Gb, it could happen only if 
all of this memory is already used and does not have
200Mb left.

How can I increase combiner buffer size?  What are the possible 
remedies/workarounds for this problem?



  was (Author: vzaliva):
I know for sure that I have at some point a record of 500K tuples, but 
tuples being 40-50 bytes each, it is far from 90M.

Even if this is the case, I do not see how this could cause OutOfMemory 
exception in java.util.Arrays.copyOf(). Even if this record
is, say, 200Mb, given JVM total heap memory size 1Gb, it could happen only if 
all of this memory is already used and does not have
200Mb left.

How can I increase combiner buffer size?

  
 ava.lang.OutOfMemoryError: Java heap space
 --

 Key: PIG-766
 URL: https://issues.apache.org/jira/browse/PIG-766
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
 Environment: Hadoop-0.18.3 (cloudera RPMs).
 mapred.child.java.opts=-Xmx1024m
Reporter: Vadim Zaliva

 My pig script always fails with the following error:
 Java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
at 
 org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233)
at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
at 
 org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83)
at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156)
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857)
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-750) Use combiner when a mix of algebraic and non-algebraic functions are used

2009-04-14 Thread Amir Youssefi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699023#action_12699023
 ] 

Amir Youssefi commented on PIG-750:
---

Other use-cases we need have in unit tests:

1) foreach X generate SUM(a) * AVG(b), ...

2) foreach X generate 1 / SUM(a) 

Currently, there is a work-around suggested to have all algebraic functions 
calculated in a foreach and then more expressions/mixes are calculated in a 
second foreach. This way combiner is used in the first foreach and we get 
combiner speed-up.

 Use combiner when a mix of algebraic and non-algebraic functions are used
 -

 Key: PIG-750
 URL: https://issues.apache.org/jira/browse/PIG-750
 Project: Pig
  Issue Type: Improvement
Reporter: Amir Youssefi
Priority: Minor

 Currently Pig uses combiner when all a,b, c,... are algebraic (e.g. SUM, AVG 
 etc.) in foreach:
 foreach X generate a,b,c,... 
  It's a performance improvement if it uses combiner when a mix of algebraic 
 and non-algebraic functions are used as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-766) ava.lang.OutOfMemoryError: Java heap space

2009-04-14 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699026#action_12699026
 ] 

Santhosh Srinivasan commented on PIG-766:
-

You can specify the I/O sort buffer size on the command line as:

java -Dio.sort.mb=200 -cp pig.jar:/path_to_hadoop_site.xml

Reference: http://hadoop.apache.org/core/docs/current/hadoop-default.html

 ava.lang.OutOfMemoryError: Java heap space
 --

 Key: PIG-766
 URL: https://issues.apache.org/jira/browse/PIG-766
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
 Environment: Hadoop-0.18.3 (cloudera RPMs).
 mapred.child.java.opts=-Xmx1024m
Reporter: Vadim Zaliva

 My pig script always fails with the following error:
 Java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:213)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
at 
 org.apache.pig.data.DefaultAbstractBag.write(DefaultAbstractBag.java:233)
at 
 org.apache.pig.data.DataReaderWriter.writeDatum(DataReaderWriter.java:162)
at org.apache.pig.data.DefaultTuple.write(DefaultTuple.java:291)
at 
 org.apache.pig.impl.io.PigNullableWritable.write(PigNullableWritable.java:83)
at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at 
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
at org.apache.hadoop.mapred.IFile$Writer.append(IFile.java:156)
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spillSingleRecord(MapTask.java:857)
at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:467)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:101)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:219)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:208)
at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:86)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.