[jira] [Updated] (PIG-2833) org.apache.pig.pigunit.pig.PigServer does not initialize set default log level of pigContext
[ https://issues.apache.org/jira/browse/PIG-2833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-2833: --- Status: Patch Available (was: Open) > org.apache.pig.pigunit.pig.PigServer does not initialize set default log > level of pigContext > > > Key: PIG-2833 > URL: https://issues.apache.org/jira/browse/PIG-2833 > Project: Pig > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: pig-0.10.0, Hadoop 2.0.0-cdh4.0.1 on Kubuntu 12.04 64Bit. >Reporter: Johannes Schwenk >Assignee: Cheolsoo Park > Attachments: PIG-2833.patch > > > The class org.apache.pig.pigunit.pig.PigServer does not set the default log > level of its instance of PigContext so that pigunit tests that have > {code} > set debug off; > {code} > in them, will cause a NullPointerException at org.apache.pig.PigServer line > 291 because the default log level is not set. > So I think org.apache.pig.pigunit.pig.PigServer should do something like > {code} > pigContext.setDefaultLogLevel(Level.INFO); > {code} > in its contructors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2833) org.apache.pig.pigunit.pig.PigServer does not initialize set default log level of pigContext
[ https://issues.apache.org/jira/browse/PIG-2833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-2833: --- Attachment: PIG-2833.patch Attached is a patch that initializes the default log level of PigContext to Level.INFO. I also added two test cases to TestGrunt to verify "set debug on/off" work properly. > org.apache.pig.pigunit.pig.PigServer does not initialize set default log > level of pigContext > > > Key: PIG-2833 > URL: https://issues.apache.org/jira/browse/PIG-2833 > Project: Pig > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: pig-0.10.0, Hadoop 2.0.0-cdh4.0.1 on Kubuntu 12.04 64Bit. >Reporter: Johannes Schwenk > Attachments: PIG-2833.patch > > > The class org.apache.pig.pigunit.pig.PigServer does not set the default log > level of its instance of PigContext so that pigunit tests that have > {code} > set debug off; > {code} > in them, will cause a NullPointerException at org.apache.pig.PigServer line > 291 because the default log level is not set. > So I think org.apache.pig.pigunit.pig.PigServer should do something like > {code} > pigContext.setDefaultLogLevel(Level.INFO); > {code} > in its contructors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (PIG-2833) org.apache.pig.pigunit.pig.PigServer does not initialize set default log level of pigContext
[ https://issues.apache.org/jira/browse/PIG-2833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park reassigned PIG-2833: -- Assignee: Cheolsoo Park > org.apache.pig.pigunit.pig.PigServer does not initialize set default log > level of pigContext > > > Key: PIG-2833 > URL: https://issues.apache.org/jira/browse/PIG-2833 > Project: Pig > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: pig-0.10.0, Hadoop 2.0.0-cdh4.0.1 on Kubuntu 12.04 64Bit. >Reporter: Johannes Schwenk >Assignee: Cheolsoo Park > Attachments: PIG-2833.patch > > > The class org.apache.pig.pigunit.pig.PigServer does not set the default log > level of its instance of PigContext so that pigunit tests that have > {code} > set debug off; > {code} > in them, will cause a NullPointerException at org.apache.pig.PigServer line > 291 because the default log level is not set. > So I think org.apache.pig.pigunit.pig.PigServer should do something like > {code} > pigContext.setDefaultLogLevel(Level.INFO); > {code} > in its contructors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1891) Enable StoreFunc to make intelligent decision based on job success or failure
[ https://issues.apache.org/jira/browse/PIG-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Reisman updated PIG-1891: - Attachment: PIG-1891-1.patch > Enable StoreFunc to make intelligent decision based on job success or failure > - > > Key: PIG-1891 > URL: https://issues.apache.org/jira/browse/PIG-1891 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.10.0 >Reporter: Alex Rovner >Priority: Minor > Labels: patch > Attachments: PIG-1891-1.patch > > > We are in the process of using PIG for various data processing and component > integration. Here is where we feel pig storage funcs lack: > They are not aware if the over all job has succeeded. This creates a problem > for storage funcs which needs to "upload" results into another system: > DB, FTP, another file system etc. > I looked at the DBStorage in the piggybank > (http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/DBStorage.java?view=markup) > and what I see is essentially a mechanism which for each task does the > following: > 1. Creates a recordwriter (in this case open connection to db) > 2. Open transaction. > 3. Writes records into a batch > 4. Executes commit or rollback depending if the task was successful. > While this aproach works great on a task level, it does not work at all on a > job level. > If certain tasks will succeed but over job will fail, partial records are > going to get uploaded into the DB. > Any ideas on the workaround? > Our current workaround is fairly ugly: We created a java wrapper that > launches pig jobs and then uploads to DB's once pig's job is successful. > While the approach works, it's not really integrated into pig. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1891) Enable StoreFunc to make intelligent decision based on job success or failure
[ https://issues.apache.org/jira/browse/PIG-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Reisman updated PIG-1891: - Attachment: (was: PIG-1891-1.patch) > Enable StoreFunc to make intelligent decision based on job success or failure > - > > Key: PIG-1891 > URL: https://issues.apache.org/jira/browse/PIG-1891 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.10.0 >Reporter: Alex Rovner >Priority: Minor > Labels: patch > > We are in the process of using PIG for various data processing and component > integration. Here is where we feel pig storage funcs lack: > They are not aware if the over all job has succeeded. This creates a problem > for storage funcs which needs to "upload" results into another system: > DB, FTP, another file system etc. > I looked at the DBStorage in the piggybank > (http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/DBStorage.java?view=markup) > and what I see is essentially a mechanism which for each task does the > following: > 1. Creates a recordwriter (in this case open connection to db) > 2. Open transaction. > 3. Writes records into a batch > 4. Executes commit or rollback depending if the task was successful. > While this aproach works great on a task level, it does not work at all on a > job level. > If certain tasks will succeed but over job will fail, partial records are > going to get uploaded into the DB. > Any ideas on the workaround? > Our current workaround is fairly ugly: We created a java wrapper that > launches pig jobs and then uploads to DB's once pig's job is successful. > While the approach works, it's not really integrated into pig. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1891) Enable StoreFunc to make intelligent decision based on job success or failure
[ https://issues.apache.org/jira/browse/PIG-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Reisman updated PIG-1891: - Attachment: PIG-1891-1.patch > Enable StoreFunc to make intelligent decision based on job success or failure > - > > Key: PIG-1891 > URL: https://issues.apache.org/jira/browse/PIG-1891 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.10.0 >Reporter: Alex Rovner >Priority: Minor > Labels: patch > Attachments: PIG-1891-1.patch > > > We are in the process of using PIG for various data processing and component > integration. Here is where we feel pig storage funcs lack: > They are not aware if the over all job has succeeded. This creates a problem > for storage funcs which needs to "upload" results into another system: > DB, FTP, another file system etc. > I looked at the DBStorage in the piggybank > (http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/DBStorage.java?view=markup) > and what I see is essentially a mechanism which for each task does the > following: > 1. Creates a recordwriter (in this case open connection to db) > 2. Open transaction. > 3. Writes records into a batch > 4. Executes commit or rollback depending if the task was successful. > While this aproach works great on a task level, it does not work at all on a > job level. > If certain tasks will succeed but over job will fail, partial records are > going to get uploaded into the DB. > Any ideas on the workaround? > Our current workaround is fairly ugly: We created a java wrapper that > launches pig jobs and then uploads to DB's once pig's job is successful. > While the approach works, it's not really integrated into pig. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1891) Enable StoreFunc to make intelligent decision based on job success or failure
[ https://issues.apache.org/jira/browse/PIG-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Reisman updated PIG-1891: - Labels: patch (was: ) Affects Version/s: 0.10.0 Status: Patch Available (was: Open) A first attempt at the cleanupOnSuccess() solution proposed in the comment thread. And a first attempt at contributing to Pig ;) > Enable StoreFunc to make intelligent decision based on job success or failure > - > > Key: PIG-1891 > URL: https://issues.apache.org/jira/browse/PIG-1891 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.10.0 >Reporter: Alex Rovner >Priority: Minor > Labels: patch > Attachments: PIG-1891-1.patch > > > We are in the process of using PIG for various data processing and component > integration. Here is where we feel pig storage funcs lack: > They are not aware if the over all job has succeeded. This creates a problem > for storage funcs which needs to "upload" results into another system: > DB, FTP, another file system etc. > I looked at the DBStorage in the piggybank > (http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/DBStorage.java?view=markup) > and what I see is essentially a mechanism which for each task does the > following: > 1. Creates a recordwriter (in this case open connection to db) > 2. Open transaction. > 3. Writes records into a batch > 4. Executes commit or rollback depending if the task was successful. > While this aproach works great on a task level, it does not work at all on a > job level. > If certain tasks will succeed but over job will fail, partial records are > going to get uploaded into the DB. > Any ideas on the workaround? > Our current workaround is fairly ugly: We created a java wrapper that > launches pig jobs and then uploads to DB's once pig's job is successful. > While the approach works, it's not really integrated into pig. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: How to selectively ship a class and its dependencies?
You could look at the class bytecode and see what other classes it depends on recursively and then ship only those classes. That's assuming nothing uses reflection to load classes and instantiate them. Julien On Fri, Jul 20, 2012 at 10:07 AM, Jonathan Coveney wrote: > I think we already have this code, but I'm not sure. > > On the frontend, is there a way to say "Ship this class file, and > everything it depends on?" I ask this because I'm considering an > optimization using primitive collections, and most of the primitive > collection frameworks are pretty large (because they have to cover all > cases), but we would only need to actually ship a small subset of that. I'm > wondering how baked our methodology to do this is. > > Thanks! > Jon >
[jira] [Commented] (PIG-2824) Pushing checking number of fields into LoadFunc
[ https://issues.apache.org/jira/browse/PIG-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419629#comment-13419629 ] Jie Li commented on PIG-2824: - Also run a comparison using TPC-H 19: {code} lineitem = load '$input/lineitem' USING PigStorage('|') as (l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, l_receiptdate,l_shipinstruct, l_shipmode, l_comment); part = load '$input/part' USING PigStorage('|') as (p_partkey, p_name, p_mfgr, p_brand, p_type, p_size, p_container, p_retailprice, p_comment); lpart = JOIN lineitem BY l_partkey, part by p_partkey; fltResult = FILTER lpart BY ( p_brand == 'Brand#12' and p_container matches 'SM CASE|SM BOX|SM PACK|SM PKG' and l_quantity >= 1 and l_quantity <= 11 and p_size >= 1 and p_size <= 5 and l_shipmode matches 'AIR|AIR REG' and l_shipinstruct == 'DELIVER IN PERSON' ) or ( p_brand == 'Brand#23' and p_container matches 'MED BAG|MED BOX|MED PKG|MED PACK' and l_quantity >= 10 and l_quantity <= 20 and p_size >= 1 and p_size <= 10 and l_shipmode matches 'AIR|AIR REG' and l_shipinstruct == 'DELIVER IN PERSON' ) or ( p_brand == 'Brand#34' and p_container matches 'LG CASE|LG BOX|LG PACK|LG PKG' and l_quantity >= 20 and l_quantity <= 30 and p_size >= 1 and p_size <= 15 and l_shipmode matches 'AIR|AIR REG' and l_shipinstruct == 'DELIVER IN PERSON' ); volume = FOREACH fltResult GENERATE l_extendedprice * (1 - l_discount); grpResult = GROUP volume ALL; revenue = FOREACH grpResult GENERATE SUM(volume); store revenue into '$output/Q19out' USING PigStorage('|'); {code} It consists of a join job which dominates the running time, and a light-weight group job. Below is the comparison of the map phase time for processing 10GB data: ||trunk||this patch|| |7m54s||7m22s| The improvement is less significant as previous mini benchmark because half fields are pruned, but still we can see 30 seconds speed up (6%). > Pushing checking number of fields into LoadFunc > --- > > Key: PIG-2824 > URL: https://issues.apache.org/jira/browse/PIG-2824 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.9.0, 0.10.0 >Reporter: Jie Li > Attachments: 2824.patch, 2824.png > > > As described in PIG-1188, if users define a schema (w or w/o types), we need > to check the number of fields after loading data, so if there are less fields > we need to pad null fields, and if there are more fields we need to throw > them away. > For schema with types, Pig used to insert a Foreach after the loader for type > casting which also checks #fields. For schema without types there was no such > Foreach, thus PIG-1188 inserted one just for checking #fields. Unfortunately, > Foreach is too expensive for such checking, and ideally we can push it into > the loader. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2826) Training link on front page no longer points to Pig training
[ https://issues.apache.org/jira/browse/PIG-2826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419589#comment-13419589 ] Thejas M Nair commented on PIG-2826: +1 > Training link on front page no longer points to Pig training > > > Key: PIG-2826 > URL: https://issues.apache.org/jira/browse/PIG-2826 > Project: Pig > Issue Type: Bug > Components: site >Affects Versions: site >Reporter: Alan Gates >Assignee: Alan Gates > Fix For: site > > Attachments: PIG-2826.patch > > > The training link on Pig's website used to point to a Pig specific video on > Cloudera's site. It now points to a list of all their videos. Also, at the > time they were the only ones providing training videos for Hadoop. Now other > vendors do as well. This link should be replaced by a link to a wiki page > where vendors who wish to can list their training resources. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2824) Pushing checking number of fields into LoadFunc
[ https://issues.apache.org/jira/browse/PIG-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419516#comment-13419516 ] Jie Li commented on PIG-2824: - Here is the script I used: {code} LineItems = LOAD '$input/lineitem' USING PigStorage('|') AS (orderkey, partkey, suppkey, linenumber, quantity, extendedprice, discount, tax, returnflag, linestatus, shipdate, commitdate, receiptdate, shipinstruct, shipmode, comment); Result = filter LineItems by 1==0; STORE Result INTO '$output/filter'; {code} Note again we specified -t PushUpFilter to force processing Foreach before the filter, so we can observe the overhead of Foreach. With this patch, Foreach will not be inserted and we can achieve the improvement shown in 2824.png, which is about 234 seconds vs. 147 seconds for loading 10GB data. > Pushing checking number of fields into LoadFunc > --- > > Key: PIG-2824 > URL: https://issues.apache.org/jira/browse/PIG-2824 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.9.0, 0.10.0 >Reporter: Jie Li > Attachments: 2824.patch, 2824.png > > > As described in PIG-1188, if users define a schema (w or w/o types), we need > to check the number of fields after loading data, so if there are less fields > we need to pad null fields, and if there are more fields we need to throw > them away. > For schema with types, Pig used to insert a Foreach after the loader for type > casting which also checks #fields. For schema without types there was no such > Foreach, thus PIG-1188 inserted one just for checking #fields. Unfortunately, > Foreach is too expensive for such checking, and ideally we can push it into > the loader. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: PIG-2492 AvroStorage should recognize globs and commas
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/5936/#review9320 --- Ship it! Ship It! - Santhosh Srinivasan On July 20, 2012, 4:36 a.m., Cheolsoo Park wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/5936/ > --- > > (Updated July 20, 2012, 4:36 a.m.) > > > Review request for pig. > > > Description > --- > > Add glob support to AvroStorage: > > https://issues.apache.org/jira/browse/PIG-2492 > > > This addresses bug PIG-2492. > https://issues.apache.org/jira/browse/PIG-2492 > > > Diffs > - > > > contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java > 0f8ef27 > > contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorageUtils.java > c7de726 > > contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java > 48b093b > > contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorageUtils.java > e5d0c38 > > Diff: https://reviews.apache.org/r/5936/diff/ > > > Testing > --- > > 1. Added new unit tests as follows: > > - testDir verifies that AvroStorage recursively loads files in a directory > and its sub-directories. > - testGlob1 to 3 verify that glob patterns are expanded properly. > > To run the tests, please do the following: > > wget > https://issues.apache.org/jira/secure/attachment/12536534/avro_test_files.tar.gz > > tar -xf avro_test_files.tar.gz > ant clean compile-test piggybank -Dhadoopversion=20 > cd contrib/piggybank/java > ant test -Dtestcase=TestAvroStorage > > 2. Both TestAvroStorage and TestAvroStorageUtils pass. > > > Thanks, > > Cheolsoo Park > >
[jira] [Commented] (PIG-2729) Macro expansion does not use pig.import.search.path - UnitTest borked
[ https://issues.apache.org/jira/browse/PIG-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419343#comment-13419343 ] Rohini Palaniswamy commented on PIG-2729: - Few comments: 1) Exception should not be thrown here as it will break Amazon s3 filesystem support. {code} File macroFile = QueryParserUtils.getFileFromSearchImportPath(fname); +if (macroFile == null) { +throw new FileNotFoundException("Could not find the specified file '" ++ fname + "' using import search path"); +} +localFileRet = FileLocalizer.fetchFile(pigContext.getProperties(), + macroFile.getAbsolutePath()); {code} It should be {code} File localFile = QueryParserUtils.getFileFromSearchImportPath(fname); localFileRet = localFile == null ? FileLocalizer.fetchFile(pigContext.getProperties(), fname) : new FetchFileRet(localFile.getCanonicalFile(), false); {code} The reason is the macro path could be fully qualified s3 or some other supported file system path. So if we could not find it in the local filesystem with getFileFromSearchImportPath, then FileLocalizer.fetchFile will take care of looking at other filesystems and downloading it locally and returning the local file path. Also it will throw the FileNotFoundException if the file is missing. 2. Again for the same reason of s3 support, it is incorrect to use getFileFromSearchImportPath in this code. And getMacroFile already fetches the file. {code} FetchFileRet localFileRet = getMacroFile(fname); File macroFile = QueryParserUtils.getFileFromSearchImportPath( +localFileRet.file.getAbsolutePath()); try { -in = QueryParserUtils.getImportScriptAsReader(localFileRet.file.getAbsolutePath()); +in = new BufferedReader(new FileReader(macroFile)); {code} should be {code} in = new BufferedReader(new FileReader(localFileRet.file)); {code} 3. For the tests, can you extract out the common code to a method to cut down on the repetition of code. Something like {code} importUsingSearchPathTest() { verifyImportUsingSearchPath("/tmp/mytest2.pig", "mytest2.pig", "/tmp"); } importUsingSearchPathTest2() { verifyImportUsingSearchPath("/tmp/mytest2.pig", "./mytest2.pig", "/tmp"); } importUsingSearchPathTest3() { verifyImportUsingSearchPath("/tmp/mytest2.pig", "../mytest2.pig", "/tmp"); } importUsingSearchPathTest4() { verifyImportUsingSearchPath("/tmp/mytest2.pig", "/tmp/mytest2.pig", "/foo/bar"); } verifyImportUsingSearchPath(String macroFilePath, String importFilePath, String importSearchPath) { . } {code} 4) negtiveUsingSearchPathTest2 and 3 are not very useful, unless some file with same name and garbage text are created in the search path location. That way we can ensure that the right file is being picked up and not the other file. > Macro expansion does not use pig.import.search.path - UnitTest borked > - > > Key: PIG-2729 > URL: https://issues.apache.org/jira/browse/PIG-2729 > Project: Pig > Issue Type: Bug >Affects Versions: 0.9.2, 0.10.0 > Environment: pig-0.9.2 and pig-0.10.0, hadoop-0.20.2 from Clouderas > distribution cdh3u3 on Kubuntu 12.04 64Bit. >Reporter: Johannes Schwenk > Fix For: 0.10.0 > > Attachments: PIG-2729.patch, PIG-2729.patch, test-macros.tar.gz, > use-search-path-for-imports.patch > > > org.apache.pig.test.TestMacroExpansion, in function importUsingSearchPathTest > the import statement is provided with the full path to /tmp/mytest2.pig so > the pig.import.search.path is never used. I changed the import to > import 'mytest2.pig'; > and ran the UnitTest again. This time the test failed as expected from my > experience from earlier this day trying in vain to get pig eat my > pig.import.search.path property! Other properties in the same custom > properties file (provided via -propertyFile command line option) like > udf.import.list get read without any problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2833) org.apache.pig.pigunit.pig.PigServer does not initialize set default log level of pigContext
Johannes Schwenk created PIG-2833: - Summary: org.apache.pig.pigunit.pig.PigServer does not initialize set default log level of pigContext Key: PIG-2833 URL: https://issues.apache.org/jira/browse/PIG-2833 Project: Pig Issue Type: Bug Affects Versions: 0.10.0 Environment: pig-0.10.0, Hadoop 2.0.0-cdh4.0.1 on Kubuntu 12.04 64Bit. Reporter: Johannes Schwenk The class org.apache.pig.pigunit.pig.PigServer does not set the default log level of its instance of PigContext so that pigunit tests that have {code} set debug off; {code} in them, will cause a NullPointerException at org.apache.pig.PigServer line 291 because the default log level is not set. So I think org.apache.pig.pigunit.pig.PigServer should do something like {code} pigContext.setDefaultLogLevel(Level.INFO); {code} in its contructors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2832) org.apache.pig.pigunit.pig.PigServer does not initialize udf.import.list of PigContext
Johannes Schwenk created PIG-2832: - Summary: org.apache.pig.pigunit.pig.PigServer does not initialize udf.import.list of PigContext Key: PIG-2832 URL: https://issues.apache.org/jira/browse/PIG-2832 Project: Pig Issue Type: Bug Affects Versions: 0.10.0 Environment: pig-0.10.0, Hadoop 2.0.0-cdh4.0.1 on Kubuntu 12.04 64Bit. Reporter: Johannes Schwenk PigServer does not initialize udf.import.list. So, if you have a pig script that uses UDFs and want to pass the udf.import.list via a property file you can do so using the -propertyFile command line to pig. But you should also be able to do it using pigunits PigServer class that already has the corresponding contructor, e.g. doing something similar to : {code} Properties props = new Properties(); props.load(new FileInputStream("./testdata/test.properties")); pig = new PigServer(ExecType.LOCAL, props); String[] params = {"data_dir=testdata"}; test = new PigTest("test.pig", params, pig, cluster); test.assertSortedOutput("aggregated", new File("./testdata/expected.out")); {code} While udf.import.list is defined in test.properties and test.pig uses names of UDFs which should be resolved using that list. This does not work! I'd say the org.apache.pig.PigServer class is the problem. It should initialize the import list of the PigContext. {code} if(properties.get("udf.import.list") != null) { PigContext.initializeImportList((String)properties.get("udf.import.list")); }{code} Right now this is done in org.apache.pig.Main. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2729) Macro expansion does not use pig.import.search.path - UnitTest borked
[ https://issues.apache.org/jira/browse/PIG-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johannes Schwenk updated PIG-2729: -- Attachment: PIG-2729.patch Hi Rohini, I changed the PIG to your suggestion. I would post this on the review board, but I currently get an Error 500 everytime I try to submit. Test cases in TestMacroExpansion all succeed. Could you take a look again please? Thanks, Johannes > Macro expansion does not use pig.import.search.path - UnitTest borked > - > > Key: PIG-2729 > URL: https://issues.apache.org/jira/browse/PIG-2729 > Project: Pig > Issue Type: Bug >Affects Versions: 0.9.2, 0.10.0 > Environment: pig-0.9.2 and pig-0.10.0, hadoop-0.20.2 from Clouderas > distribution cdh3u3 on Kubuntu 12.04 64Bit. >Reporter: Johannes Schwenk > Fix For: 0.10.0 > > Attachments: PIG-2729.patch, PIG-2729.patch, test-macros.tar.gz, > use-search-path-for-imports.patch > > > org.apache.pig.test.TestMacroExpansion, in function importUsingSearchPathTest > the import statement is provided with the full path to /tmp/mytest2.pig so > the pig.import.search.path is never used. I changed the import to > import 'mytest2.pig'; > and ran the UnitTest again. This time the test failed as expected from my > experience from earlier this day trying in vain to get pig eat my > pig.import.search.path property! Other properties in the same custom > properties file (provided via -propertyFile command line option) like > udf.import.list get read without any problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2831) MR-Cube implementation (Distributed cubing for holistic measures)
[ https://issues.apache.org/jira/browse/PIG-2831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419009#comment-13419009 ] Prasanth J commented on PIG-2831: - Hello everyone With reference to the description of this issue, I am working on step 3 which involves creating a sampling job and executing naive cube computation algorithm over the sample dataset. The requirement for this sample job is that I should be able to select sample size proportional to the size of the input data size. This sampling job is required to determine the large group size and perform partitioning of the large groups so that no single reducer gets overloaded with large groups. One thing I am stuck with is dynamically choosing the sample size. In the current implementation I am using sample operator to load a fixed sample size (10% data). Since the sample size is not chosen dynamically this fixed sampling will result in over sampling for large datasets. For dynamically choosing the sample size, we need to know the total number of tuples in the input dataset. But finding the total number of tuples is not trivial. One way to find the total number of tuples is to first find the total input size and size of one tuple in memory. The problem with this approach is that since tuple is List the reported in-memory size of tuple will be much larger than actual size of row in bytes. To verify this I tested with a simple dataset Input file size : 319 bytes Actual number of rows: 13 Number of dimensions: 5 Schema: int, chararray, chararray, chararray, int Actual row size: 319/13 ~= 25 bytes In-memory tuple size reported: 264 bytes (~10x greater than actual size of row) Since, in-memory tuple size is higher we cannot make a good estimation of the total number of rows in the dataset and hence the sample size. Other approaches, I looked into how PoissonSampleLoader and RandomSampleLoader works. Both takes a different approach for loading sample dataset. PoissonSampleLoader uses the distribution of the skewed key to generate sample rows that best represent the underlying data. This loader inserts a special marker at the last tuple with the number of rows in the dataset. Since, this loader is specifically meant for handling skewed keys, I cannot use this in my case for generating sample dataset. For using RandomSampleLoader, we need to specify the number of samples to be loaded beforehand so that the loader stops after loading the specified number of tuples. Since we need to specify the sample size before loading we have no means to dynamically load samples for datasets of varying size. Also, for using these 2 loaders we need to copy the entire dataset to a temp file and use any of these loaders to load data from temp file. This consumes an additional map job. I don't know why there is a need for copying entire dataset to a temp file and then reading back again. I believe the reason (from what I can understand from the source) for copying the dataset to temp file and reading from it is that the loader classes can only read using InterStorage format. I have listed below few pros and cons of different approaches 1) +Using sample operator+ *Pros:* 1 less map job compared to other loaders *Cons:* Reads entire dataset for generating sample dataset because sample operator is implemented as filter + RANDOM udf + less than expression(sample size) after projecting the input columns. May result in oversampling for larger dataset 2) +RandomSampleLoader+ *Pros:* Fixed sample size (the paper provided in the description mentions that 2M sample size is good enough to represent 20B tuples, 100K is good enough for 2B tuples. plz refer page-6 in the paper.) Stops reading after sample size is reached (useful for large dataset) - NOT sure about this!! Please correct me if I am wrong. *Cons:* 1 additional map job required ( including post processing there will be 4 MR jobs with 2 map only jobs ) Since fixed sample size is used this method is not scalable 3) +PoissonSampleLoader+ *Pros:* Dynamically determines sample size Can determine number of rows in dataset using special tuple *Cons:* 1 additional map job required ( including post processing there will be 4 MR jobs with 2 map only jobs ) Not suitable for my usecase since the sample size generated is not proportional to input size I think what I need is a hybrid loader (combination of concepts from random + poisson) which dynamically loads sample tuples based on the input dataset size. Any thoughts about how I can generate sample size proportional to input data size? Or is there any way I can find the number of rows available in a dataset? Am I missing any other ideas for finding/estimating the number of rows in the dataset? > MR-Cube implementation (Distributed cubing for holistic measures) > -
[jira] [Created] (PIG-2831) MR-Cube implementation (Distributed cubing for holistic measures)
Prasanth J created PIG-2831: --- Summary: MR-Cube implementation (Distributed cubing for holistic measures) Key: PIG-2831 URL: https://issues.apache.org/jira/browse/PIG-2831 Project: Pig Issue Type: Sub-task Reporter: Prasanth J Implementing distributed cube materialization on holistic measure based on MR-Cube approach as described in http://arnab.org/files/mrcube.pdf. Primary steps involved: 1) Identify if the measure is holistic or not 2) Determine algebraic attribute (can be detected automatically for few cases, if automatic detection fails user should hint the algebraic attribute) 3) Modify MRPlan to insert a sampling job which executes naive cube algorithm and generates annotated cube lattice (contains large group partitioning information) 4) Modify plan to distribute annotated cube lattice to all mappers using distributed cache 5) Execute actual cube materialization on full dataset 6) Modify MRPlan to insert a post process job for combining the results of actual cube materialization job 7) OOM exception handling -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2816) piggybank.jar not getting created with the current buil.xml
[ https://issues.apache.org/jira/browse/PIG-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418978#comment-13418978 ] Swathi V commented on PIG-2816: --- Hi Daniel, I did a top level ant, ant compile-test. But it was not able to create piggybank.jar because it was pointing to pig-withouthadoop.jar and so couldn't find the classes in it and the created was pig-0.9.2-withouthadoop.jar. Correct me if I have done anything wrong! Thank You. > piggybank.jar not getting created with the current buil.xml > --- > > Key: PIG-2816 > URL: https://issues.apache.org/jira/browse/PIG-2816 > Project: Pig > Issue Type: Bug > Components: piggybank >Affects Versions: 0.9.2 > Environment: Ubuntu 11.04 >Reporter: Swathi V >Priority: Critical > Labels: newbie > Fix For: 0.9.2 > > Attachments: build.xml, error.txt, myPatch.patch > > > The current build.xml inside contrib/piggybank/java Fails and does not > generate the piggybank.jar -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira