[jira] [Created] (PIG-3078) Make a UDF that, given a string, returns just the columns prefixed by that string

2012-12-04 Thread Jonathan Coveney (JIRA)
Jonathan Coveney created PIG-3078:
-

 Summary: Make a UDF that, given a string, returns just the columns 
prefixed by that string
 Key: PIG-3078
 URL: https://issues.apache.org/jira/browse/PIG-3078
 Project: Pig
  Issue Type: Bug
Reporter: Jonathan Coveney
 Fix For: 0.12


This comes up fairly often, usually as the result of a join. Given that the 
resulting schema has the column name prepended, a udf in the following form 
could give just the columns from the desired relation:

Pluck('relation_name', *)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Subscription: PIG patch available

2012-12-04 Thread jira
Issue Subscription
Filter: PIG patch available (33 issues)

Subscriber: pigdaily

Key Summary
PIG-3075Allow AvroStorage STORE Operations To Use Schema Specified By URI
https://issues.apache.org/jira/browse/PIG-3075
PIG-3073POUserFunc creating log spam for large scripts
https://issues.apache.org/jira/browse/PIG-3073
PIG-3069Native Windows Compatibility for Pig E2E Tests and Harness
https://issues.apache.org/jira/browse/PIG-3069
PIG-3067HBaseStorage should be split up to become more managable
https://issues.apache.org/jira/browse/PIG-3067
PIG-3066Fix TestPigRunner in trunk
https://issues.apache.org/jira/browse/PIG-3066
PIG-3057make readField protected to be able to override it if we extend 
PigStorage
https://issues.apache.org/jira/browse/PIG-3057
PIG-3051java.lang.IndexOutOfBoundsException  failure with LimitOptimizer + 
ColumnPruning
https://issues.apache.org/jira/browse/PIG-3051
PIG-3033test-patch failed with javadoc warnings
https://issues.apache.org/jira/browse/PIG-3033
PIG-3029TestTypeCheckingValidatorNewLP has some path reference issues for 
cross-platform execution
https://issues.apache.org/jira/browse/PIG-3029
PIG-3028testGrunt dev test needs some command filters to run correctly 
without cygwin
https://issues.apache.org/jira/browse/PIG-3028
PIG-3027pigTest unit test needs a newline filter for comparisons of golden 
multi-line
https://issues.apache.org/jira/browse/PIG-3027
PIG-3026Pig checked-in baseline comparisons need a pre-filter to address 
OS-specific newline differences
https://issues.apache.org/jira/browse/PIG-3026
PIG-3025TestPruneColumn unit test - SimpleEchoStreamingCommand perl inline 
script needs simplification
https://issues.apache.org/jira/browse/PIG-3025
PIG-3024TestEmptyInputDir unit test - hadoop version detection logic is 
brittle
https://issues.apache.org/jira/browse/PIG-3024
PIG-3015Rewrite of AvroStorage
https://issues.apache.org/jira/browse/PIG-3015
PIG-3010Allow UDF's to flatten themselves
https://issues.apache.org/jira/browse/PIG-3010
PIG-2959Add a pig.cmd for Pig to run under Windows
https://issues.apache.org/jira/browse/PIG-2959
PIG-2957TetsScriptUDF fail due to volume prefix in jar
https://issues.apache.org/jira/browse/PIG-2957
PIG-2956Invalid cache specification for some streaming statement
https://issues.apache.org/jira/browse/PIG-2956
PIG-2955 Fix bunch of Pig e2e tests on Windows 
https://issues.apache.org/jira/browse/PIG-2955
PIG-2873Converting bin/pig shell script to python
https://issues.apache.org/jira/browse/PIG-2873
PIG-2834MultiStorage requires unused constructor argument
https://issues.apache.org/jira/browse/PIG-2834
PIG-2824Pushing checking number of fields into LoadFunc
https://issues.apache.org/jira/browse/PIG-2824
PIG-2661Pig uses an extra job for loading data in Pigmix L9
https://issues.apache.org/jira/browse/PIG-2661
PIG-2645PigSplit does not handle the case where SerializationFactory 
returns null
https://issues.apache.org/jira/browse/PIG-2645
PIG-2614AvroStorage crashes on LOADING a single bad error
https://issues.apache.org/jira/browse/PIG-2614
PIG-2507Semicolon in paramenters for UDF results in parsing error
https://issues.apache.org/jira/browse/PIG-2507
PIG-2433Jython import module not working if module path is in classpath
https://issues.apache.org/jira/browse/PIG-2433
PIG-2417Streaming UDFs -  allow users to easily write UDFs in scripting 
languages with no JVM implementation.
https://issues.apache.org/jira/browse/PIG-2417
PIG-2362Rework Ant build.xml to use macrodef instead of antcall
https://issues.apache.org/jira/browse/PIG-2362
PIG-2312NPE when relation and column share the same name and used in Nested 
Foreach 
https://issues.apache.org/jira/browse/PIG-2312
PIG-1942script UDF (jython) should utilize the intended output schema to 
more directly convert Py objects to Pig objects
https://issues.apache.org/jira/browse/PIG-1942
PIG-1237Piggybank MutliStorage - specify field to write in output
https://issues.apache.org/jira/browse/PIG-1237

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225&filterId=12322384


[jira] [Created] (PIG-3077) TestMultiQueryLocal should not write in /tmp

2012-12-04 Thread Julien Le Dem (JIRA)
Julien Le Dem created PIG-3077:
--

 Summary: TestMultiQueryLocal should not write in /tmp
 Key: PIG-3077
 URL: https://issues.apache.org/jira/browse/PIG-3077
 Project: Pig
  Issue Type: Test
Reporter: Julien Le Dem


temporary files from tests should be under build/test so that they are cleaned 
by "ant clean"
Currently two test suites running on the same machine step on each other and 
create flaky tests results

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3072) Pig job reporting negative progress

2012-12-04 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3072:


Release Note:   (was: Committed to trunk. Thanks Koji.)

Committed to trunk. Thanks Koji.

> Pig job reporting negative progress
> ---
>
> Key: PIG-3072
> URL: https://issues.apache.org/jira/browse/PIG-3072
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.10.0
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Fix For: 0.12
>
> Attachments: pig-3072-v01.txt, pig-3072-v02.txt, pig-3072-v03.txt, 
> pig-3072-v04.txt
>
>
> Our users pointed out that their jobs reporting negative progress.
> 2012-11-02 21:43:11,538 [main] INFO 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - -795% complete
> ...
> (due to TFileRecordReader)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3072) Pig job reporting negative progress

2012-12-04 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3072:


  Resolution: Fixed
Release Note: Committed to trunk. Thanks Koji.
  Status: Resolved  (was: Patch Available)

> Pig job reporting negative progress
> ---
>
> Key: PIG-3072
> URL: https://issues.apache.org/jira/browse/PIG-3072
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.10.0
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Fix For: 0.12
>
> Attachments: pig-3072-v01.txt, pig-3072-v02.txt, pig-3072-v03.txt, 
> pig-3072-v04.txt
>
>
> Our users pointed out that their jobs reporting negative progress.
> 2012-11-02 21:43:11,538 [main] INFO 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - -795% complete
> ...
> (due to TFileRecordReader)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3076) make TestScalarAliases more reliable

2012-12-04 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PIG-3076:
---

Attachment: PIG-3076.patch

PIG-3076.patch modifies the test so that input/output are written to the build 
folder (and are cleaned up by "ant clean") and data is deleted upfront so that 
it does not fail when a previous run failed before.

> make TestScalarAliases more reliable
> 
>
> Key: PIG-3076
> URL: https://issues.apache.org/jira/browse/PIG-3076
> Project: Pig
>  Issue Type: Test
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
> Fix For: 0.11, 0.12
>
> Attachments: PIG-3076.patch
>
>
> currently, this test writes in the root directory so its output is not 
> deleted by ant clean.
> Also it deletes its output in the end instead of the begining.
> The consequence is that if the test fail once then it will keep failing until 
> the directory is manually cleaned up (not good for CI)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2684) :: in field name causes AvroStorage to fail

2012-12-04 Thread Will Oberman (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510045#comment-13510045
 ] 

Will Oberman commented on PIG-2684:
---

I was just bit by this same bug.  For me it was because I'm changing from 
running Hadoop directly against Cassnadra, to doing Cassandra -> Amazon EMR -> 
Cassandra (using Pig as my Hadoop language of choice, and S3 as the data 
interchange layer).  And, my output schema that is cassandra compatible seems 
to have autogenerated ::'s.

> :: in field name causes AvroStorage to fail
> ---
>
> Key: PIG-2684
> URL: https://issues.apache.org/jira/browse/PIG-2684
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Reporter: Fabian Alenius
>
> There appears to be a bug in AvroStorage which causes it to fail when there 
> are field names that contain ::
> For example, the following will fail:
> data = load 'test.txt' as (one, two);
> grp = GROUP data by (one, two);
> result = foreach grp generate FLATTEN(group); 
>   
> 
> store result into 'test.avro' using 
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> ERROR 2999: Unexpected internal error. Illegal character in: group::one
> While the following will succeed:
> data = load 'test.txt' as (one, two);
> grp = GROUP data by (one, two);
> result = foreach grp generate FLATTEN(group) as (one,two);
>  
> store result into 'test.avro' using 
> org.apache.pig.piggybank.storage.avro.AvroStorage();
> Here is a minimal test case:
> data = load 'test.txt' as (one::two, three);  
>   
> 
> store data into 'test.avro' using 
> org.apache.pig.piggybank.storage.avro.AvroStorage();

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2012-12-04 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13510033#comment-13510033
 ] 

Cheolsoo Park commented on PIG-3015:


Yes, it does. Thank you, sir!

> Rewrite of AvroStorage
> --
>
> Key: PIG-3015
> URL: https://issues.apache.org/jira/browse/PIG-3015
> Project: Pig
>  Issue Type: Improvement
>  Components: piggybank
>Reporter: Joseph Adler
>Assignee: Joseph Adler
> Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old 
> versions of Avro, it copies data much more than needed, and it's verbose and 
> complicated. (One pet peeve of mine is that old versions of Avro don't 
> support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
> new implementation is significantly faster, and the code is a lot simpler. 
> Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best 
> way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Our release process

2012-12-04 Thread Olga Natkovich
I am ok with tests running nightly and reverting patches that cause failures. 
We used to have that. Does anybody know what happened? Is anybody volunteering 
to make it work again?

I would like to see specific criteria for what goes into the branch been 
published (rather than case-by-case). This way each team can decided if the 
criteria stringent enough of if they need to run a private branch.

Olga



 From: Santhosh M S 
To: Julien Le Dem ; "dev@pig.apache.org" 
 
Cc: "billgra...@gmail.com"  
Sent: Friday, November 30, 2012 11:46 PM
Subject: Re: Our release process
 
HI Julien,

You are making most of the points that I did on this thread (CI for e2e, not 
burdening clean e2e prior to every commit for a release branch). The only point 
on which there is no clear agreement is the definition of a bug that can be 
included in a previously released branch. I am fine with a case by case 
inclusion. 

Hi Olga,

Are you fine with Julien's proposal as it stands - bugs that are included will 
be determined at the time of inclusion instead of doing it now.

Santhosh



From: Julien Le Dem 
To: dev@pig.apache.org; Santhosh M S  
Cc: "billgra...@gmail.com"  
Sent: Friday, November 30, 2012 5:37 PM
Subject: Re: Our release process

Proposed criteria:
- it makes the tests fail. targets test-commit + test + e2e tests
- a critical bug is reported in a short time frame (definition of
critical not needed as it is rare and can be decided on a case by case
basis)

That raises another question: what are the existing CI servers running
the tests?
- the Apache CI runs test-commit and test (is it more stable now?)
and not e2e. It would be great if it did.
- we have a Jenkins build at Twitter where we run test-commit and
test, we could not run e2e easily in our environment.
- I understand there's a Yahoo/Hortonworks build (test-commit + test + e2e ???)

Whenever those builds fail we should open or reopen JIRAS and fix it.

The time it takes to run the full
test suite makes it impractical to
run on a desktop/laptop.

For the release Pig-0.11.0 we need to get this list of JIRAs down to 0
and publish the jar.
https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+PIG+AND+fixVersion+%3D+%220.11%22+AND+resolution+%3D+Unresolved+ORDER+BY+updated+DESC%2C+due+ASC%2C+priority+DESC

Julien

On Thu, Nov 29, 2012 at 11:16 PM, Santhosh M S
 wrote:
> Looks like everyone is interested in having frequent releases - I don't see 
> anyone disagreeing with that.
>
> Regarding "If a patch
makes the release branch unstable, we revert it" - what are the criteria? If we 
can't decide on the criteria on this thread (already pretty long) then lets get 
the release trains going. We can revisit the criteria for inclusion of bug 
fixes when that happens.
>
> Santhosh
>
>
> 
>  From: Julien Le Dem 
> To: dev@pig.apache.org; Santhosh M S 
> Cc: "billgra...@gmail.com" 
> Sent:
Thursday, November 29, 2012 9:45 AM
> Subject: Re: Our release process
>
> The release branch receives only bug fixes. Patch level releases (3rd
> version number) are issued out of the release branch and introduce
> only bug fixes and no new features.
> Deciding whether a patch is applied to the release branch is based on
> preserving stability (as Bill said). If a patch makes the release
> branch unstable, we revert it.
> New features are added to trunk where new major and minor releases will 
> happen.
> If we need a new feature out then we make a new minor release.
> Doing frequent releases is the industry standard and will resolve
> conflicts around what should go in a release branch.
>
> Making a new release is currently painful *because* we wait so long in
> between two releases. Let's fix that.
>
> Julien
>
> On Wed, Nov 28, 2012 at
10:09 PM, Santhosh M S
>  wrote:
>> Since releasing a major version once a month is agressive and we have not 
>> released on a quarterly basis, we should allow commits to a released branch 
>> to facilitate dot releases.
>>
>> If we are allowing commits to a released branch, the criteria for inclusion 
>> can be created anew or we use the industry standards for severity (or 
>> priority). It could be painful for a few folks but I don't see better 
>> alternatives.
>>
>> Regarding reverting commits based on e2e tests breaking:
>>         1. Who is running the tests?
>>         2. How often are they run?
>> If we have nightly e2e runs then its easier to catch these errors early. If 
>> not the barrier for inclusion is pretty high and time
consuming making it harder to develop.
>>
>> Santhosh
>>
>>
>> 
>>  From: Bill Graham 
>> To: dev@pig.apache.org
>> Sent: Wednesday, November 28, 2012 11:39 AM
>> Subject: Re: Our release process
>>
>> I agree releasing often is ideal, but releasing major versions once a month
>> would be a bit agressive.
>>
>> +1 to Olga's initial definition of how Yahoo

[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2012-12-04 Thread Joseph Adler (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509992#comment-13509992
 ] 

Joseph Adler commented on PIG-3015:
---

I think that approach makes sense; each object in a file should be wrapped in a 
Tuple. Suppose that a file example.avro contained the data:

  {[1, 2, 3, 4, 5]}
  {[6, 7, 8, 9, 10]}

and had this schema: {"name" : "IntArray", "type" : "array", "items" : "int"}, 
and we loaded this as

  A = LOAD 'example.avro' USING AvroStorage;

The bag A would have the Pig schema A:{(IntArray:{(int)})}; it would contain 
two tuples, which would in turn each contain one bag of integers. Does that 
sound correct? If so, I'll go implement that.


> Rewrite of AvroStorage
> --
>
> Key: PIG-3015
> URL: https://issues.apache.org/jira/browse/PIG-3015
> Project: Pig
>  Issue Type: Improvement
>  Components: piggybank
>Reporter: Joseph Adler
>Assignee: Joseph Adler
> Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old 
> versions of Avro, it copies data much more than needed, and it's verbose and 
> complicated. (One pet peeve of mine is that old versions of Avro don't 
> support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
> new implementation is significantly faster, and the code is a lot simpler. 
> Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best 
> way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3072) Pig job reporting negative progress

2012-12-04 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3072:
--

Attachment: pig-3072-v04.txt

bq. Can you use HadoopShims to create the TaskAttemptContext in your test. The 
test fails to compile with H23.

Thanks Rohini.  Uploading another patch with your suggestion.  

Ran both
$ ant clean test -Dtestcase=TestTmpFileCompression
$ ant -Dhadoopversion=23 clean test -Dtestcase=TestTmpFileCompression

> Pig job reporting negative progress
> ---
>
> Key: PIG-3072
> URL: https://issues.apache.org/jira/browse/PIG-3072
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.10.0
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Fix For: 0.12
>
> Attachments: pig-3072-v01.txt, pig-3072-v02.txt, pig-3072-v03.txt, 
> pig-3072-v04.txt
>
>
> Our users pointed out that their jobs reporting negative progress.
> 2012-11-02 21:43:11,538 [main] INFO 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - -795% complete
> ...
> (due to TFileRecordReader)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3076) make TestScalarAliases more reliable

2012-12-04 Thread Julien Le Dem (JIRA)
Julien Le Dem created PIG-3076:
--

 Summary: make TestScalarAliases more reliable
 Key: PIG-3076
 URL: https://issues.apache.org/jira/browse/PIG-3076
 Project: Pig
  Issue Type: Test
Reporter: Julien Le Dem
Assignee: Julien Le Dem
 Fix For: 0.11, 0.12


currently, this test writes in the root directory so its output is not deleted 
by ant clean.
Also it deletes its output in the end instead of the begining.
The consequence is that if the test fail once then it will keep failing until 
the directory is manually cleaned up (not good for CI)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3015) Rewrite of AvroStorage

2012-12-04 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509964#comment-13509964
 ] 

Cheolsoo Park commented on PIG-3015:


Hi Joe,

Thanks for your prompt response!

To answer your questions,
{quote}
I have always assumed that AvroStorage was designed to be used with Hadoop 
sequence files that contained a series of records, so I implemented AvroStorage 
to only work with a file in this format. Are there cases where the highest 
level schema for a file will be another type? If so... what does that mean for 
pig? Is there one record per file?
{quote}
This is a good question, and I see your argument. But this will be very 
different from what the current AvroStorage does. Currently, a non-record type 
is automatically wrapped in a tuple. For example, "1" is loaded as (1) in Pig. 
If a file includes multiple values, they are loaded as multiple tuples as 
follows:
{code:title=avro}
cheolsoo@localhost:~/workspace/avro $java -jar avro-tools-1.5.4.jar getschema 
multiple_int.avro 
"int"
cheolsoo@localhost:~/workspace/avro $java -jar avro-tools-1.5.4.jar tojson 
multiple_int.avro 
1
2
3
{code}
{code:title=pig}
in = LOAD 'multiple_int.avro' USING 
org.apache.pig.piggybank.storage.avro.AvroStorage();
DUMP in;
(1)
(2)
(3)
{code}
Agreed that we can tell users that the top-level schema must be a record type, 
but I am afraid that people might not agree. In my experience, people tend to 
think that every valid Avro file should be able to be loaded by AvroStorage. 
Granted, there exist some restrictions (e.g. recursive records and unions), but 
even these restrictions have been loosened recently. Unless there is a 
convincing reason to not, I think that we should keep it that way.

In many cases, people already have data pipeline in place (e.g. Flume produces 
Avro files => Pig consumes Avro files), and it is not guaranteed that the 
top-level schema is always a record type.
{quote}
Here's a specific example: suppose that we have this schema:
\{"name" : "IntArray", "type" : "array", "items" : "int"\}
Suppose that we have 3 files to load, each with this schema, each containing an 
array of 10 integers. Should we load this into pig as a single bag with 30 
integers? A bag containing three bags (each, in turn, containing 10 integers)? 
Or reject this file entirely?
{quote}
Currently, they are loaded as 3 tuples, and each tuple contains a bag of 10 
integers.
{code}
({(1),(2), ... ,(10)})
({(1),(2), ... ,(10)})
({(1),(2), ... ,(10)})
{code}
Thoughts?

> Rewrite of AvroStorage
> --
>
> Key: PIG-3015
> URL: https://issues.apache.org/jira/browse/PIG-3015
> Project: Pig
>  Issue Type: Improvement
>  Components: piggybank
>Reporter: Joseph Adler
>Assignee: Joseph Adler
> Attachments: PIG-3015.patch
>
>
> The current AvroStorage implementation has a lot of issues: it requires old 
> versions of Avro, it copies data much more than needed, and it's verbose and 
> complicated. (One pet peeve of mine is that old versions of Avro don't 
> support Snappy compression.)
> I rewrote AvroStorage from scratch to fix these issues. In early tests, the 
> new implementation is significantly faster, and the code is a lot simpler. 
> Rewriting AvroStorage also enabled me to implement support for Trevni.
> I'm opening this ticket to facilitate discussion while I figure out the best 
> way to contribute the changes back to Apache.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2812) Spill InternalCachedBag into only 1 file

2012-12-04 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PIG-2812:
---

Fix Version/s: (was: 0.11)

I'm detaching this from pig-0.11 as it is not ready yet

> Spill InternalCachedBag into only 1 file
> 
>
> Key: PIG-2812
> URL: https://issues.apache.org/jira/browse/PIG-2812
> Project: Pig
>  Issue Type: Bug
>  Components: data
>Reporter: Haitao Yao
>Assignee: Haitao Yao
> Attachments: aa.jpg, spill.patch
>
>
> I encountered a reducer's OOM because of java.io.DeleteOnExitHook. And I 
> found out that the InternalCachedBag creates a seperate tmp file, and the tmp 
> files is deleted on exit. So the file delete hook caused the OOM. 
> Why not just hold the tmp file handle and spill only one tmp file?
> Too many tmp files may block the tasktracker start process, if the tmp files 
> are not cleaned on time and the tasktracker restarts at this specific time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3072) Pig job reporting negative progress

2012-12-04 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509920#comment-13509920
 ] 

Rohini Palaniswamy commented on PIG-3072:
-

Koji,
   Can you use HadoopShims to create the TaskAttemptContext in your test. The 
test fails to compile with H23.

{noformat}
 [javac] 
/apache/pig/trunk/test/org/apache/pig/test/TestTmpFileCompression.java:369: 
org.apache.hadoop.mapreduce.TaskAttemptContext is abstract; cannot be 
instantiated
[javac] new TaskAttemptContext(conf, new 
TaskAttemptID()));
{noformat}

> Pig job reporting negative progress
> ---
>
> Key: PIG-3072
> URL: https://issues.apache.org/jira/browse/PIG-3072
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.10.0
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Fix For: 0.12
>
> Attachments: pig-3072-v01.txt, pig-3072-v02.txt, pig-3072-v03.txt
>
>
> Our users pointed out that their jobs reporting negative progress.
> 2012-11-02 21:43:11,538 [main] INFO 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - -795% complete
> ...
> (due to TFileRecordReader)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira