[jira] [Updated] (PIG-2833) org.apache.pig.pigunit.pig.PigServer does not initialize set default log level of pigContext

2012-07-20 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-2833:
---

Status: Patch Available  (was: Open)

> org.apache.pig.pigunit.pig.PigServer does not initialize set default log 
> level of pigContext
> 
>
> Key: PIG-2833
> URL: https://issues.apache.org/jira/browse/PIG-2833
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: pig-0.10.0, Hadoop 2.0.0-cdh4.0.1 on Kubuntu 12.04 64Bit.
>Reporter: Johannes Schwenk
>Assignee: Cheolsoo Park
> Attachments: PIG-2833.patch
>
>
> The class org.apache.pig.pigunit.pig.PigServer does not set the default log 
> level of its instance of PigContext so that pigunit tests that have 
> {code}
> set debug off;
> {code}
> in them, will cause a NullPointerException at org.apache.pig.PigServer line 
> 291 because the default log level is not set.
> So I think org.apache.pig.pigunit.pig.PigServer should do something like 
> {code}
> pigContext.setDefaultLogLevel(Level.INFO);
> {code}
> in its contructors.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-2833) org.apache.pig.pigunit.pig.PigServer does not initialize set default log level of pigContext

2012-07-20 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-2833:
---

Attachment: PIG-2833.patch

Attached is a patch that initializes the default log level of PigContext to 
Level.INFO.

I also added two test cases to TestGrunt to verify "set debug on/off" work 
properly.

> org.apache.pig.pigunit.pig.PigServer does not initialize set default log 
> level of pigContext
> 
>
> Key: PIG-2833
> URL: https://issues.apache.org/jira/browse/PIG-2833
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: pig-0.10.0, Hadoop 2.0.0-cdh4.0.1 on Kubuntu 12.04 64Bit.
>Reporter: Johannes Schwenk
> Attachments: PIG-2833.patch
>
>
> The class org.apache.pig.pigunit.pig.PigServer does not set the default log 
> level of its instance of PigContext so that pigunit tests that have 
> {code}
> set debug off;
> {code}
> in them, will cause a NullPointerException at org.apache.pig.PigServer line 
> 291 because the default log level is not set.
> So I think org.apache.pig.pigunit.pig.PigServer should do something like 
> {code}
> pigContext.setDefaultLogLevel(Level.INFO);
> {code}
> in its contructors.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (PIG-2833) org.apache.pig.pigunit.pig.PigServer does not initialize set default log level of pigContext

2012-07-20 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park reassigned PIG-2833:
--

Assignee: Cheolsoo Park

> org.apache.pig.pigunit.pig.PigServer does not initialize set default log 
> level of pigContext
> 
>
> Key: PIG-2833
> URL: https://issues.apache.org/jira/browse/PIG-2833
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: pig-0.10.0, Hadoop 2.0.0-cdh4.0.1 on Kubuntu 12.04 64Bit.
>Reporter: Johannes Schwenk
>Assignee: Cheolsoo Park
> Attachments: PIG-2833.patch
>
>
> The class org.apache.pig.pigunit.pig.PigServer does not set the default log 
> level of its instance of PigContext so that pigunit tests that have 
> {code}
> set debug off;
> {code}
> in them, will cause a NullPointerException at org.apache.pig.PigServer line 
> 291 because the default log level is not set.
> So I think org.apache.pig.pigunit.pig.PigServer should do something like 
> {code}
> pigContext.setDefaultLogLevel(Level.INFO);
> {code}
> in its contructors.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-1891) Enable StoreFunc to make intelligent decision based on job success or failure

2012-07-20 Thread Eli Reisman (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Reisman updated PIG-1891:
-

Attachment: PIG-1891-1.patch

> Enable StoreFunc to make intelligent decision based on job success or failure
> -
>
> Key: PIG-1891
> URL: https://issues.apache.org/jira/browse/PIG-1891
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.10.0
>Reporter: Alex Rovner
>Priority: Minor
>  Labels: patch
> Attachments: PIG-1891-1.patch
>
>
> We are in the process of using PIG for various data processing and component 
> integration. Here is where we feel pig storage funcs lack:
> They are not aware if the over all job has succeeded. This creates a problem 
> for storage funcs which needs to "upload" results into another system:
> DB, FTP, another file system etc.
> I looked at the DBStorage in the piggybank 
> (http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/DBStorage.java?view=markup)
>  and what I see is essentially a mechanism which for each task does the 
> following:
> 1. Creates a recordwriter (in this case open connection to db)
> 2. Open transaction.
> 3. Writes records into a batch
> 4. Executes commit or rollback depending if the task was successful.
> While this aproach works great on a task level, it does not work at all on a 
> job level. 
> If certain tasks will succeed but over job will fail, partial records are 
> going to get uploaded into the DB.
> Any ideas on the workaround? 
> Our current workaround is fairly ugly: We created a java wrapper that 
> launches pig jobs and then uploads to DB's once pig's job is successful. 
> While the approach works, it's not really integrated into pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-1891) Enable StoreFunc to make intelligent decision based on job success or failure

2012-07-20 Thread Eli Reisman (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Reisman updated PIG-1891:
-

Attachment: (was: PIG-1891-1.patch)

> Enable StoreFunc to make intelligent decision based on job success or failure
> -
>
> Key: PIG-1891
> URL: https://issues.apache.org/jira/browse/PIG-1891
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.10.0
>Reporter: Alex Rovner
>Priority: Minor
>  Labels: patch
>
> We are in the process of using PIG for various data processing and component 
> integration. Here is where we feel pig storage funcs lack:
> They are not aware if the over all job has succeeded. This creates a problem 
> for storage funcs which needs to "upload" results into another system:
> DB, FTP, another file system etc.
> I looked at the DBStorage in the piggybank 
> (http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/DBStorage.java?view=markup)
>  and what I see is essentially a mechanism which for each task does the 
> following:
> 1. Creates a recordwriter (in this case open connection to db)
> 2. Open transaction.
> 3. Writes records into a batch
> 4. Executes commit or rollback depending if the task was successful.
> While this aproach works great on a task level, it does not work at all on a 
> job level. 
> If certain tasks will succeed but over job will fail, partial records are 
> going to get uploaded into the DB.
> Any ideas on the workaround? 
> Our current workaround is fairly ugly: We created a java wrapper that 
> launches pig jobs and then uploads to DB's once pig's job is successful. 
> While the approach works, it's not really integrated into pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-1891) Enable StoreFunc to make intelligent decision based on job success or failure

2012-07-20 Thread Eli Reisman (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Reisman updated PIG-1891:
-

Attachment: PIG-1891-1.patch

> Enable StoreFunc to make intelligent decision based on job success or failure
> -
>
> Key: PIG-1891
> URL: https://issues.apache.org/jira/browse/PIG-1891
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.10.0
>Reporter: Alex Rovner
>Priority: Minor
>  Labels: patch
> Attachments: PIG-1891-1.patch
>
>
> We are in the process of using PIG for various data processing and component 
> integration. Here is where we feel pig storage funcs lack:
> They are not aware if the over all job has succeeded. This creates a problem 
> for storage funcs which needs to "upload" results into another system:
> DB, FTP, another file system etc.
> I looked at the DBStorage in the piggybank 
> (http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/DBStorage.java?view=markup)
>  and what I see is essentially a mechanism which for each task does the 
> following:
> 1. Creates a recordwriter (in this case open connection to db)
> 2. Open transaction.
> 3. Writes records into a batch
> 4. Executes commit or rollback depending if the task was successful.
> While this aproach works great on a task level, it does not work at all on a 
> job level. 
> If certain tasks will succeed but over job will fail, partial records are 
> going to get uploaded into the DB.
> Any ideas on the workaround? 
> Our current workaround is fairly ugly: We created a java wrapper that 
> launches pig jobs and then uploads to DB's once pig's job is successful. 
> While the approach works, it's not really integrated into pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-1891) Enable StoreFunc to make intelligent decision based on job success or failure

2012-07-20 Thread Eli Reisman (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Reisman updated PIG-1891:
-

   Labels: patch  (was: )
Affects Version/s: 0.10.0
   Status: Patch Available  (was: Open)

A first attempt at the cleanupOnSuccess() solution proposed in the comment 
thread. And a first attempt at contributing to Pig ;)


> Enable StoreFunc to make intelligent decision based on job success or failure
> -
>
> Key: PIG-1891
> URL: https://issues.apache.org/jira/browse/PIG-1891
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.10.0
>Reporter: Alex Rovner
>Priority: Minor
>  Labels: patch
> Attachments: PIG-1891-1.patch
>
>
> We are in the process of using PIG for various data processing and component 
> integration. Here is where we feel pig storage funcs lack:
> They are not aware if the over all job has succeeded. This creates a problem 
> for storage funcs which needs to "upload" results into another system:
> DB, FTP, another file system etc.
> I looked at the DBStorage in the piggybank 
> (http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/DBStorage.java?view=markup)
>  and what I see is essentially a mechanism which for each task does the 
> following:
> 1. Creates a recordwriter (in this case open connection to db)
> 2. Open transaction.
> 3. Writes records into a batch
> 4. Executes commit or rollback depending if the task was successful.
> While this aproach works great on a task level, it does not work at all on a 
> job level. 
> If certain tasks will succeed but over job will fail, partial records are 
> going to get uploaded into the DB.
> Any ideas on the workaround? 
> Our current workaround is fairly ugly: We created a java wrapper that 
> launches pig jobs and then uploads to DB's once pig's job is successful. 
> While the approach works, it's not really integrated into pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: How to selectively ship a class and its dependencies?

2012-07-20 Thread Julien Le Dem
You could look at the class bytecode and see what other classes it depends
on recursively and then ship only those classes.
That's assuming nothing uses reflection to load classes and instantiate
them.
Julien

On Fri, Jul 20, 2012 at 10:07 AM, Jonathan Coveney wrote:

> I think we already have this code, but I'm not sure.
>
> On the frontend, is there a way to say "Ship this class file, and
> everything it depends on?" I ask this because I'm considering an
> optimization using primitive collections, and most of the primitive
> collection frameworks are pretty large (because they have to cover all
> cases), but we would only need to actually ship a small subset of that. I'm
> wondering how baked our methodology to do this is.
>
> Thanks!
> Jon
>


[jira] [Commented] (PIG-2824) Pushing checking number of fields into LoadFunc

2012-07-20 Thread Jie Li (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419629#comment-13419629
 ] 

Jie Li commented on PIG-2824:
-

Also run a comparison using TPC-H 19:

{code}
lineitem = load '$input/lineitem' USING PigStorage('|') as (l_orderkey, 
l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, 
l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, 
l_receiptdate,l_shipinstruct, l_shipmode, l_comment);

part = load '$input/part' USING PigStorage('|') as (p_partkey, p_name, p_mfgr, 
p_brand, p_type, p_size, p_container, p_retailprice, p_comment);

lpart = JOIN lineitem BY l_partkey, part by p_partkey;

fltResult = FILTER lpart BY 
  (
p_brand == 'Brand#12'
and p_container matches 'SM CASE|SM BOX|SM PACK|SM PKG'
and l_quantity >= 1 and l_quantity <= 11
and p_size >= 1 and p_size <= 5
and l_shipmode matches 'AIR|AIR REG'
and l_shipinstruct == 'DELIVER IN PERSON'
  ) 
  or 
  (
p_brand == 'Brand#23'
and p_container matches 'MED BAG|MED BOX|MED PKG|MED PACK'
and l_quantity >= 10 and l_quantity <= 20
and p_size >= 1 and p_size <= 10
and l_shipmode matches 'AIR|AIR REG'
and l_shipinstruct == 'DELIVER IN PERSON'
  )
  or
  (
p_brand == 'Brand#34'
and p_container matches 'LG CASE|LG BOX|LG PACK|LG PKG'
and l_quantity >= 20 and l_quantity <= 30
and p_size >= 1 and p_size <= 15
and l_shipmode matches 'AIR|AIR REG'
and l_shipinstruct == 'DELIVER IN PERSON'
  );
volume = FOREACH fltResult GENERATE l_extendedprice * (1 - l_discount);
grpResult = GROUP volume ALL;
revenue = FOREACH grpResult GENERATE SUM(volume);

store revenue into '$output/Q19out' USING PigStorage('|');
{code}

It consists of a join job which dominates the running time, and a light-weight 
group job. Below is the comparison of the map phase time for processing 10GB 
data:

||trunk||this patch||
|7m54s||7m22s|

The improvement is less significant as previous mini benchmark because half 
fields are pruned, but still we can see 30 seconds speed up (6%).

> Pushing checking number of fields into LoadFunc
> ---
>
> Key: PIG-2824
> URL: https://issues.apache.org/jira/browse/PIG-2824
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.9.0, 0.10.0
>Reporter: Jie Li
> Attachments: 2824.patch, 2824.png
>
>
> As described in PIG-1188, if users define a schema (w or w/o types), we need 
> to check the number of fields after loading data, so if there are less fields 
> we need to pad null fields, and if there are more fields we need to throw 
> them away. 
> For schema with types, Pig used to insert a Foreach after the loader for type 
> casting which also checks #fields. For schema without types there was no such 
> Foreach, thus PIG-1188 inserted one just for checking #fields. Unfortunately, 
> Foreach is too expensive for such checking, and ideally we can push it into 
> the loader.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2826) Training link on front page no longer points to Pig training

2012-07-20 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419589#comment-13419589
 ] 

Thejas M Nair commented on PIG-2826:


+1

> Training link on front page no longer points to Pig training
> 
>
> Key: PIG-2826
> URL: https://issues.apache.org/jira/browse/PIG-2826
> Project: Pig
>  Issue Type: Bug
>  Components: site
>Affects Versions: site
>Reporter: Alan Gates
>Assignee: Alan Gates
> Fix For: site
>
> Attachments: PIG-2826.patch
>
>
> The training link on Pig's website used to point to a Pig specific video on 
> Cloudera's site.  It now points to a list of all their videos.  Also, at the 
> time they were the only ones providing training videos for Hadoop.  Now other 
> vendors do as well.  This link should be replaced by a link to a wiki page 
> where vendors who wish to can list their training resources.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2824) Pushing checking number of fields into LoadFunc

2012-07-20 Thread Jie Li (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419516#comment-13419516
 ] 

Jie Li commented on PIG-2824:
-

Here is the script I used:

{code}
LineItems = LOAD '$input/lineitem' USING PigStorage('|') AS (orderkey, partkey, 
suppkey, linenumber, quantity, extendedprice, discount, tax, returnflag, 
linestatus, shipdate, commitdate, receiptdate, shipinstruct, shipmode, comment);

Result = filter LineItems by 1==0; 

STORE Result INTO '$output/filter';
{code}

Note again we specified -t PushUpFilter to force processing Foreach before the 
filter, so we can observe the overhead of Foreach. With this patch, Foreach 
will not be inserted and we can achieve the improvement shown in 2824.png, 
which is about 234 seconds vs. 147 seconds for loading 10GB data.

> Pushing checking number of fields into LoadFunc
> ---
>
> Key: PIG-2824
> URL: https://issues.apache.org/jira/browse/PIG-2824
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.9.0, 0.10.0
>Reporter: Jie Li
> Attachments: 2824.patch, 2824.png
>
>
> As described in PIG-1188, if users define a schema (w or w/o types), we need 
> to check the number of fields after loading data, so if there are less fields 
> we need to pad null fields, and if there are more fields we need to throw 
> them away. 
> For schema with types, Pig used to insert a Foreach after the loader for type 
> casting which also checks #fields. For schema without types there was no such 
> Foreach, thus PIG-1188 inserted one just for checking #fields. Unfortunately, 
> Foreach is too expensive for such checking, and ideally we can push it into 
> the loader.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Review Request: PIG-2492 AvroStorage should recognize globs and commas

2012-07-20 Thread Santhosh Srinivasan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5936/#review9320
---

Ship it!


Ship It!

- Santhosh Srinivasan


On July 20, 2012, 4:36 a.m., Cheolsoo Park wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/5936/
> ---
> 
> (Updated July 20, 2012, 4:36 a.m.)
> 
> 
> Review request for pig.
> 
> 
> Description
> ---
> 
> Add glob support to AvroStorage:
> 
> https://issues.apache.org/jira/browse/PIG-2492
> 
> 
> This addresses bug PIG-2492.
> https://issues.apache.org/jira/browse/PIG-2492
> 
> 
> Diffs
> -
> 
>   
> contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java
>  0f8ef27 
>   
> contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorageUtils.java
>  c7de726 
>   
> contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
>  48b093b 
>   
> contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorageUtils.java
>  e5d0c38 
> 
> Diff: https://reviews.apache.org/r/5936/diff/
> 
> 
> Testing
> ---
> 
> 1. Added new unit tests as follows:
> 
> - testDir verifies that AvroStorage recursively loads files in a directory 
> and its sub-directories.
> - testGlob1 to 3 verify that glob patterns are expanded properly.
> 
> To run the tests, please do the following:
> 
> wget 
> https://issues.apache.org/jira/secure/attachment/12536534/avro_test_files.tar.gz
>  
> tar -xf avro_test_files.tar.gz
> ant clean compile-test piggybank -Dhadoopversion=20
> cd contrib/piggybank/java
> ant test -Dtestcase=TestAvroStorage
> 
> 2. Both TestAvroStorage and TestAvroStorageUtils pass.
> 
> 
> Thanks,
> 
> Cheolsoo Park
> 
>



[jira] [Commented] (PIG-2729) Macro expansion does not use pig.import.search.path - UnitTest borked

2012-07-20 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419343#comment-13419343
 ] 

Rohini Palaniswamy commented on PIG-2729:
-

Few comments:
1) Exception should not be thrown here as it will break Amazon s3 filesystem 
support.
{code}
File macroFile = QueryParserUtils.getFileFromSearchImportPath(fname);
+if (macroFile == null) {
+throw new FileNotFoundException("Could not find the 
specified file '"
++ fname + "' using import 
search path");
+}
+localFileRet = 
FileLocalizer.fetchFile(pigContext.getProperties(),
+   
macroFile.getAbsolutePath());
{code}
  It should be
{code}
File localFile = QueryParserUtils.getFileFromSearchImportPath(fname);
localFileRet = localFile == null
 ? FileLocalizer.fetchFile(pigContext.getProperties(), fname)
   : new FetchFileRet(localFile.getCanonicalFile(), false);
{code}
   The reason is the macro path could be fully qualified s3 or some other 
supported file system path. So if we could not find it in the local filesystem 
with getFileFromSearchImportPath, then FileLocalizer.fetchFile will take care 
of looking at other filesystems and downloading it locally and returning the 
local file path. Also it will throw the FileNotFoundException if the file is 
missing.

2.  Again for the same reason of s3 support, it is incorrect to use 
getFileFromSearchImportPath in this code. And getMacroFile already fetches the 
file.

{code}
FetchFileRet localFileRet = getMacroFile(fname);
File macroFile = QueryParserUtils.getFileFromSearchImportPath(
+localFileRet.file.getAbsolutePath());
 try {
-in = 
QueryParserUtils.getImportScriptAsReader(localFileRet.file.getAbsolutePath());
+in = new BufferedReader(new FileReader(macroFile));
{code}

should be

{code}
in = new BufferedReader(new FileReader(localFileRet.file));
{code}

3. For the tests, can you extract out the common code to a method to cut down 
on the repetition of code. Something like

{code}
importUsingSearchPathTest() {
   verifyImportUsingSearchPath("/tmp/mytest2.pig", "mytest2.pig", "/tmp");
}

importUsingSearchPathTest2() {
   verifyImportUsingSearchPath("/tmp/mytest2.pig", "./mytest2.pig", "/tmp");
}

importUsingSearchPathTest3() {
   verifyImportUsingSearchPath("/tmp/mytest2.pig", "../mytest2.pig", "/tmp");
}

importUsingSearchPathTest4() {
   verifyImportUsingSearchPath("/tmp/mytest2.pig", "/tmp/mytest2.pig", 
"/foo/bar");
}

verifyImportUsingSearchPath(String macroFilePath, String importFilePath, String 
importSearchPath) {
.
}

{code}

4) negtiveUsingSearchPathTest2 and 3 are not very useful, unless some file with 
same name and garbage text are created in the search path location. That way we 
can ensure that the right file is being picked up and not the other file.

> Macro expansion does not use pig.import.search.path - UnitTest borked
> -
>
> Key: PIG-2729
> URL: https://issues.apache.org/jira/browse/PIG-2729
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.2, 0.10.0
> Environment: pig-0.9.2 and pig-0.10.0, hadoop-0.20.2 from Clouderas 
> distribution cdh3u3 on Kubuntu 12.04 64Bit.
>Reporter: Johannes Schwenk
> Fix For: 0.10.0
>
> Attachments: PIG-2729.patch, PIG-2729.patch, test-macros.tar.gz, 
> use-search-path-for-imports.patch
>
>
> org.apache.pig.test.TestMacroExpansion, in function importUsingSearchPathTest 
> the import statement is provided with the full path to /tmp/mytest2.pig so 
> the pig.import.search.path is never used. I changed the import to 
> import 'mytest2.pig';
> and ran the UnitTest again. This time the test failed as expected from my 
> experience from earlier this day trying in vain to get pig eat my 
> pig.import.search.path property! Other properties in the same custom 
> properties file (provided via -propertyFile command line option) like 
> udf.import.list get read without any problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (PIG-2833) org.apache.pig.pigunit.pig.PigServer does not initialize set default log level of pigContext

2012-07-20 Thread Johannes Schwenk (JIRA)
Johannes Schwenk created PIG-2833:
-

 Summary: org.apache.pig.pigunit.pig.PigServer does not initialize 
set default log level of pigContext
 Key: PIG-2833
 URL: https://issues.apache.org/jira/browse/PIG-2833
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.10.0
 Environment: pig-0.10.0, Hadoop 2.0.0-cdh4.0.1 on Kubuntu 12.04 64Bit.
Reporter: Johannes Schwenk


The class org.apache.pig.pigunit.pig.PigServer does not set the default log 
level of its instance of PigContext so that pigunit tests that have 

{code}
set debug off;
{code}

in them, will cause a NullPointerException at org.apache.pig.PigServer line 291 
because the default log level is not set.

So I think org.apache.pig.pigunit.pig.PigServer should do something like 

{code}
pigContext.setDefaultLogLevel(Level.INFO);
{code}

in its contructors.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (PIG-2832) org.apache.pig.pigunit.pig.PigServer does not initialize udf.import.list of PigContext

2012-07-20 Thread Johannes Schwenk (JIRA)
Johannes Schwenk created PIG-2832:
-

 Summary: org.apache.pig.pigunit.pig.PigServer does not initialize 
udf.import.list of PigContext
 Key: PIG-2832
 URL: https://issues.apache.org/jira/browse/PIG-2832
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.10.0
 Environment: pig-0.10.0, Hadoop 2.0.0-cdh4.0.1 on Kubuntu 12.04 64Bit.
Reporter: Johannes Schwenk


PigServer does not initialize udf.import.list. 

So, if you have a pig script that uses UDFs and want to pass the 
udf.import.list via a property file you can do so using the -propertyFile 
command line to pig. But you should also be able to do it using pigunits 
PigServer class that already has the corresponding contructor, e.g. doing 
something similar to :

{code}
Properties props = new Properties();
props.load(new FileInputStream("./testdata/test.properties"));
pig = new PigServer(ExecType.LOCAL, props);
String[] params = {"data_dir=testdata"};
test = new PigTest("test.pig", params, pig, cluster);
test.assertSortedOutput("aggregated", new File("./testdata/expected.out"));
{code}

While udf.import.list is defined in test.properties and test.pig uses names of 
UDFs which should be resolved using that list.

This does not work!

I'd say the org.apache.pig.PigServer class is the problem. It should initialize 
the import list of the PigContext. 

{code}
if(properties.get("udf.import.list") != null) {
PigContext.initializeImportList((String)properties.get("udf.import.list"));
}{code}

Right now this is done in org.apache.pig.Main.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PIG-2729) Macro expansion does not use pig.import.search.path - UnitTest borked

2012-07-20 Thread Johannes Schwenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johannes Schwenk updated PIG-2729:
--

Attachment: PIG-2729.patch

Hi Rohini,

I changed the PIG to your suggestion. I would post this on the review board, 
but I currently get an Error 500 everytime I try to submit. Test cases in 
TestMacroExpansion all succeed. Could you take a look again please?

Thanks,
Johannes

> Macro expansion does not use pig.import.search.path - UnitTest borked
> -
>
> Key: PIG-2729
> URL: https://issues.apache.org/jira/browse/PIG-2729
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.2, 0.10.0
> Environment: pig-0.9.2 and pig-0.10.0, hadoop-0.20.2 from Clouderas 
> distribution cdh3u3 on Kubuntu 12.04 64Bit.
>Reporter: Johannes Schwenk
> Fix For: 0.10.0
>
> Attachments: PIG-2729.patch, PIG-2729.patch, test-macros.tar.gz, 
> use-search-path-for-imports.patch
>
>
> org.apache.pig.test.TestMacroExpansion, in function importUsingSearchPathTest 
> the import statement is provided with the full path to /tmp/mytest2.pig so 
> the pig.import.search.path is never used. I changed the import to 
> import 'mytest2.pig';
> and ran the UnitTest again. This time the test failed as expected from my 
> experience from earlier this day trying in vain to get pig eat my 
> pig.import.search.path property! Other properties in the same custom 
> properties file (provided via -propertyFile command line option) like 
> udf.import.list get read without any problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2831) MR-Cube implementation (Distributed cubing for holistic measures)

2012-07-20 Thread Prasanth J (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419009#comment-13419009
 ] 

Prasanth J commented on PIG-2831:
-

Hello everyone

With reference to the description of this issue, I am working on step 3 which 
involves creating a sampling job and executing naive cube computation algorithm 
over the sample dataset. The requirement for this sample job is that I should 
be able to select sample size proportional to the size of the input data size. 
This sampling job is required to determine the large group size and perform 
partitioning of the large groups so that no single reducer gets overloaded with 
large groups. 
One thing I am stuck with is dynamically choosing the sample size. In the 
current implementation I am using sample operator to load a fixed sample size 
(10% data). Since the sample size is not chosen dynamically this fixed sampling 
will result in over sampling for large datasets. For dynamically choosing the 
sample size, we need to know the total number of tuples in the input dataset. 
But finding the total number of tuples is not trivial. One way to find the 
total number of tuples is to first find the total input size and size of one 
tuple in memory. The problem with this approach is that since tuple is 
List the reported in-memory size of tuple will be much larger than 
actual size of row in bytes. To verify this I tested with a simple dataset 

Input file size : 319 bytes
Actual number of rows: 13
Number of dimensions: 5 
Schema: int, chararray, chararray, chararray, int
Actual row size: 319/13 ~= 25 bytes
In-memory tuple size reported: 264 bytes (~10x greater than actual size of row)

Since, in-memory tuple size is higher we cannot make a good estimation of the 
total number of rows in the dataset and hence the sample size.

Other approaches,
I looked into how PoissonSampleLoader and RandomSampleLoader works. Both takes 
a different approach for loading sample dataset. PoissonSampleLoader uses the 
distribution of the skewed key to generate sample rows that best represent the 
underlying data. This loader inserts a special marker at the last tuple with 
the number of rows in the dataset. Since, this loader is specifically meant for 
handling skewed keys, I cannot use this in my case for generating sample 
dataset. 
For using RandomSampleLoader, we need to specify the number of samples to be 
loaded beforehand so that the loader stops after loading the specified number 
of tuples. Since we need to specify the sample size before loading we have no 
means to dynamically load samples for datasets of varying size. 
Also, for using these 2 loaders we need to copy the entire dataset to a temp 
file and use any of these loaders to load data from temp file. This consumes an 
additional map job. I don't know why there is a need for copying entire dataset 
to a temp file and then reading back again. I believe the reason (from what I 
can understand from the source) for copying the dataset to temp file and 
reading from it is that the loader classes can only read using InterStorage 
format. 

I have listed below few pros and cons of different approaches 
1) +Using sample operator+ 
*Pros:* 
1 less map job compared to other loaders

*Cons:*
Reads entire dataset for generating sample dataset because sample operator is 
implemented as filter + RANDOM udf + less than expression(sample size) after 
projecting the input columns.
May result in oversampling for larger dataset

2) +RandomSampleLoader+
*Pros:* 
Fixed sample size (the paper provided in the description mentions that 2M 
sample size is good enough to represent 20B tuples, 100K is good enough for 2B 
tuples. plz refer page-6 in the paper.) 
Stops reading after sample size is reached (useful for large dataset) - NOT 
sure about this!! Please correct me if I am wrong.

*Cons:*
1 additional map job required ( including post processing there will be 4 MR 
jobs with 2 map only jobs ) 
Since fixed sample size is used this method is not scalable

3) +PoissonSampleLoader+
*Pros:*
Dynamically determines sample size
Can determine number of rows in dataset using special tuple

*Cons:*
1 additional map job required ( including post processing there will be 4 MR 
jobs with 2 map only jobs ) 
Not suitable for my usecase since the sample size generated is not proportional 
to input size

I think what I need is a hybrid loader (combination of concepts from random + 
poisson) which dynamically loads sample tuples based on the input dataset size. 

Any thoughts about how I can generate sample size proportional to input data 
size? Or is there any way I can find the number of rows available in a dataset? 
Am I missing any other ideas for finding/estimating the number of rows in the 
dataset?


> MR-Cube implementation (Distributed cubing for holistic measures)
> -

[jira] [Created] (PIG-2831) MR-Cube implementation (Distributed cubing for holistic measures)

2012-07-20 Thread Prasanth J (JIRA)
Prasanth J created PIG-2831:
---

 Summary: MR-Cube implementation (Distributed cubing for holistic 
measures)
 Key: PIG-2831
 URL: https://issues.apache.org/jira/browse/PIG-2831
 Project: Pig
  Issue Type: Sub-task
Reporter: Prasanth J


Implementing distributed cube materialization on holistic measure based on 
MR-Cube approach as described in http://arnab.org/files/mrcube.pdf. 
Primary steps involved:
1) Identify if the measure is holistic or not
2) Determine algebraic attribute (can be detected automatically for few cases, 
if automatic detection fails user should hint the algebraic attribute)
3) Modify MRPlan to insert a sampling job which executes naive cube algorithm 
and generates annotated cube lattice (contains large group partitioning 
information)
4) Modify plan to distribute annotated cube lattice to all mappers using 
distributed cache
5) Execute actual cube materialization on full dataset
6) Modify MRPlan to insert a post process job for combining the results of 
actual cube materialization job
7) OOM exception handling

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (PIG-2816) piggybank.jar not getting created with the current buil.xml

2012-07-20 Thread Swathi V (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418978#comment-13418978
 ] 

Swathi V commented on PIG-2816:
---

Hi Daniel,
I did a top level ant, ant compile-test. But it was not able to create 
piggybank.jar because it was pointing to pig-withouthadoop.jar and so couldn't 
find the classes in it and the created was pig-0.9.2-withouthadoop.jar. 
Correct me if I have done anything wrong!
Thank You.

> piggybank.jar not getting created with the current buil.xml
> ---
>
> Key: PIG-2816
> URL: https://issues.apache.org/jira/browse/PIG-2816
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.9.2
> Environment: Ubuntu 11.04
>Reporter: Swathi V
>Priority: Critical
>  Labels: newbie
> Fix For: 0.9.2
>
> Attachments: build.xml, error.txt, myPatch.patch
>
>
> The current build.xml inside contrib/piggybank/java Fails and does not 
> generate the piggybank.jar

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira