[jira] [Updated] (PIG-2599) Mavenize Pig

2013-03-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2599:


Description: 
Switch Pig build system from ant to maven.

This is a candidate project for Google summer of code 2013. More information 
about the program can be found at 
https://cwiki.apache.org/confluence/display/PIG/GSoc2013

  was:
Switch Pig build system from ant to maven.




> Mavenize Pig
> 
>
> Key: PIG-2599
> URL: https://issues.apache.org/jira/browse/PIG-2599
> Project: Pig
>  Issue Type: New Feature
>  Components: build
>Reporter: Daniel Dai
>  Labels: gsoc2013
> Fix For: 0.12
>
> Attachments: maven-pig.1.zip
>
>
> Switch Pig build system from ant to maven.
> This is a candidate project for Google summer of code 2013. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2013

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Put a "Google summer of code 2013" cwiki page

2013-03-20 Thread Daniel Dai
https://cwiki.apache.org/confluence/display/PIG/GSoc2013

Feel free to add more project which could fit in the timeline of a
student summer project.

I remember there are several projects we discussed in our last meetup:
* Allow Pig use Hive UDFs, Alan, do we have a ticket for that?
* A general framework for Pig performance test, Rohini, do we have a ticket?

Thanks,
Daniel


[jira] [Updated] (PIG-2599) Mavenize Pig

2013-03-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2599:


Labels: gsoc2013  (was: )

> Mavenize Pig
> 
>
> Key: PIG-2599
> URL: https://issues.apache.org/jira/browse/PIG-2599
> Project: Pig
>  Issue Type: New Feature
>  Components: build
>Reporter: Daniel Dai
>  Labels: gsoc2013
> Fix For: 0.12
>
> Attachments: maven-pig.1.zip
>
>
> Switch Pig build system from ant to maven.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2586) A better plan/data flow visualizer

2013-03-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2586:


Description: 
Pig supports a dot graph style plan to visualize the logical/physical/mapreduce 
plan (explain with -dot option, see 
http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). 
However, dot graph takes extra step to generate the plan graph and the quality 
of the output is not good. It's better we can implement a better visualizer for 
Pig. It should:
1. show operator type and alias
2. turn on/off output schema
3. dive into foreach inner plan on demand
4. provide a way to show operator source code, eg, tooltip of an operator (plan 
don't currently have this information, but you can assume this is in place)
5. besides visualize logical/physical/mapreduce plan, visualize the script 
itself is also useful
6. may rely on some java graphic library such as Swing

This is a candidate project for Google summer of code 2013. More information 
about the program can be found at 
https://cwiki.apache.org/confluence/display/PIG/GSoc2013

  was:
Pig supports a dot graph style plan to visualize the logical/physical/mapreduce 
plan (explain with -dot option, see 
http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). 
However, dot graph takes extra step to generate the plan graph and the quality 
of the output is not good. It's better we can implement a better visualizer for 
Pig. It should:
1. show operator type and alias
2. turn on/off output schema
3. dive into foreach inner plan on demand
4. provide a way to show operator source code, eg, tooltip of an operator (plan 
don't currently have this information, but you can assume this is in place)
5. besides visualize logical/physical/mapreduce plan, visualize the script 
itself is also useful
6. may rely on some java graphic library such as Swing

This is a candidate project for Google summer of code 2012. More information 
about the program can be found at 
https://cwiki.apache.org/confluence/display/PIG/GSoc2012


> A better plan/data flow visualizer
> --
>
> Key: PIG-2586
> URL: https://issues.apache.org/jira/browse/PIG-2586
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>  Labels: gsoc2013
>
> Pig supports a dot graph style plan to visualize the 
> logical/physical/mapreduce plan (explain with -dot option, see 
> http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). 
> However, dot graph takes extra step to generate the plan graph and the 
> quality of the output is not good. It's better we can implement a better 
> visualizer for Pig. It should:
> 1. show operator type and alias
> 2. turn on/off output schema
> 3. dive into foreach inner plan on demand
> 4. provide a way to show operator source code, eg, tooltip of an operator 
> (plan don't currently have this information, but you can assume this is in 
> place)
> 5. besides visualize logical/physical/mapreduce plan, visualize the script 
> itself is also useful
> 6. may rely on some java graphic library such as Swing
> This is a candidate project for Google summer of code 2013. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2013

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2586) A better plan/data flow visualizer

2013-03-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-2586:


Labels: gsoc2013  (was: gsoc2012)

> A better plan/data flow visualizer
> --
>
> Key: PIG-2586
> URL: https://issues.apache.org/jira/browse/PIG-2586
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>  Labels: gsoc2013
>
> Pig supports a dot graph style plan to visualize the 
> logical/physical/mapreduce plan (explain with -dot option, see 
> http://ofps.oreilly.com/titles/9781449302641/developing_and_testing.html). 
> However, dot graph takes extra step to generate the plan graph and the 
> quality of the output is not good. It's better we can implement a better 
> visualizer for Pig. It should:
> 1. show operator type and alias
> 2. turn on/off output schema
> 3. dive into foreach inner plan on demand
> 4. provide a way to show operator source code, eg, tooltip of an operator 
> (plan don't currently have this information, but you can assume this is in 
> place)
> 5. besides visualize logical/physical/mapreduce plan, visualize the script 
> itself is also useful
> 6. may rely on some java graphic library such as Swing
> This is a candidate project for Google summer of code 2012. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-1271) Provide a more flexible data format to load complex field (bag/tuple/map) in PigStorage

2013-03-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1271:


Description: 
With [PIG-613|https://issues.apache.org/jira/browse/PIG-613], we are able to 
load txt files containing complex data type (map/bag/tuple) according to 
schema. However, the format of complex data field is very strict. User have to 
use pre-determined special characters to mark the beginning and end of each 
field, and those special characters can not be used in the content. The goals 
of this issue are:

1. Provide a way for user to escape special characters
2. Make it easy for users to customize Utf8StorageConverter when they have 
their own data format

This is a candidate project for Google summer of code 2013. More information 
about the program can be found at 
https://cwiki.apache.org/confluence/display/PIG/GSoc2013

  was:
With [PIG-613|https://issues.apache.org/jira/browse/PIG-613], we are able to 
load txt files containing complex data type (map/bag/tuple) according to 
schema. However, the format of complex data field is very strict. User have to 
use pre-determined special characters to mark the beginning and end of each 
field, and those special characters can not be used in the content. The goals 
of this issue are:

1. Provide a way for user to escape special characters
2. Make it easy for users to customize Utf8StorageConverter when they have 
their own data format

This is a candidate project for Google summer of code 2012. More information 
about the program can be found at 
https://cwiki.apache.org/confluence/display/PIG/GSoc2012


> Provide a more flexible data format to load complex field (bag/tuple/map) in 
> PigStorage
> ---
>
> Key: PIG-1271
> URL: https://issues.apache.org/jira/browse/PIG-1271
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
>  Labels: gsoc2013
>
> With [PIG-613|https://issues.apache.org/jira/browse/PIG-613], we are able to 
> load txt files containing complex data type (map/bag/tuple) according to 
> schema. However, the format of complex data field is very strict. User have 
> to use pre-determined special characters to mark the beginning and end of 
> each field, and those special characters can not be used in the content. The 
> goals of this issue are:
> 1. Provide a way for user to escape special characters
> 2. Make it easy for users to customize Utf8StorageConverter when they have 
> their own data format
> This is a candidate project for Google summer of code 2013. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2013

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-1271) Provide a more flexible data format to load complex field (bag/tuple/map) in PigStorage

2013-03-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1271:


Labels: gsoc2013  (was: gsoc2012)

> Provide a more flexible data format to load complex field (bag/tuple/map) in 
> PigStorage
> ---
>
> Key: PIG-1271
> URL: https://issues.apache.org/jira/browse/PIG-1271
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
>  Labels: gsoc2013
>
> With [PIG-613|https://issues.apache.org/jira/browse/PIG-613], we are able to 
> load txt files containing complex data type (map/bag/tuple) according to 
> schema. However, the format of complex data field is very strict. User have 
> to use pre-determined special characters to mark the beginning and end of 
> each field, and those special characters can not be used in the content. The 
> goals of this issue are:
> 1. Provide a way for user to escape special characters
> 2. Make it easy for users to customize Utf8StorageConverter when they have 
> their own data format
> This is a candidate project for Google summer of code 2012. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (PIG-3256) Upgrade jython to 2.5.3 (legal concern)

2013-03-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai resolved PIG-3256.
-

  Resolution: Fixed
Hadoop Flags: Reviewed

Patch committed to 0.11 branch and trunk. Thanks Rohini!

> Upgrade jython to 2.5.3 (legal concern)
> ---
>
> Key: PIG-3256
> URL: https://issues.apache.org/jira/browse/PIG-3256
> Project: Pig
>  Issue Type: Bug
>Reporter: Daniel Dai
>Assignee: Daniel Dai
>Priority: Critical
> Fix For: 0.12, 0.11.1
>
> Attachments: PIG-3256-1.patch, PIG-3256-2.patch, PIG-3256-3.patch, 
> PIG-3256-4.patch
>
>
> When we review the legal documents with Microsoft for Windows work, here is 
> the recommend from lawyer:
> Jython 2.5.2 redistributes an external LGPL component (JNA) in a manner that 
> puts Jython out of compliance with the LGPL. As such dependent components 
> like Pig are also arguably out of compliance with the LGPL.
> It appears that this has been quietly found and fixed by the Jython guys, as 
> version 2.5.3 does not include the JNA component. However the status of 2.5.2 
> with respect to the LGPL and Apache legal is still unclear.
> The easiest way to remediate this whole problem is to simply move Pig to 
> Jython 2.5.3 and remove the question.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Subscription: PIG patch available

2013-03-20 Thread jira
Issue Subscription
Filter: PIG patch available (31 issues)

Subscriber: pigdaily

Key Summary
PIG-3247Piggybank functions to mimic OVER clause in SQL
https://issues.apache.org/jira/browse/PIG-3247
PIG-3238Pig current releases lack a UDF Stuff(). This UDF deletes a 
specified length of characters and inserts another set of characters at a 
specified starting point.
https://issues.apache.org/jira/browse/PIG-3238
PIG-3237Pig current releases lack a UDF MakeSet(). This UDF returns a set 
value (a string containing substrings separated by "," characters) consisting 
of the strings that have the corresponding bit in the first argument
https://issues.apache.org/jira/browse/PIG-3237
PIG-3215[piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated 
Values) files
https://issues.apache.org/jira/browse/PIG-3215
PIG-3210Pig fails to start when it cannot write log to log files
https://issues.apache.org/jira/browse/PIG-3210
PIG-3198Let users use any function from PigType -> PigType as if it were 
builtlin
https://issues.apache.org/jira/browse/PIG-3198
PIG-3193Fix "ant docs" warnings
https://issues.apache.org/jira/browse/PIG-3193
PIG-3190Add LuceneTokenizer and SnowballTokenizer to Pig - useful text 
tokenization
https://issues.apache.org/jira/browse/PIG-3190
PIG-3183rm or rmf commands should respect globbing/regex of path
https://issues.apache.org/jira/browse/PIG-3183
PIG-3173Partition filter push down does not happen partition keys condition 
include a AND and OR construct
https://issues.apache.org/jira/browse/PIG-3173
PIG-3166Update eclipse .classpath according to ivy library.properties
https://issues.apache.org/jira/browse/PIG-3166
PIG-3164Pig current releases lack a UDF endsWith.This UDF tests if a given 
string ends with the specified suffix.
https://issues.apache.org/jira/browse/PIG-3164
PIG-3141Giving CSVExcelStorage an option to handle header rows
https://issues.apache.org/jira/browse/PIG-3141
PIG-3123Simplify Logical Plans By Removing Unneccessary Identity Projections
https://issues.apache.org/jira/browse/PIG-3123
PIG-3122Operators should not implicitly become reserved keywords
https://issues.apache.org/jira/browse/PIG-3122
PIG-3114Duplicated macro name error when using pigunit
https://issues.apache.org/jira/browse/PIG-3114
PIG-3105Fix TestJobSubmission unit test failure.
https://issues.apache.org/jira/browse/PIG-3105
PIG-3088Add a builtin udf which removes prefixes
https://issues.apache.org/jira/browse/PIG-3088
PIG-3069Native Windows Compatibility for Pig E2E Tests and Harness
https://issues.apache.org/jira/browse/PIG-3069
PIG-3028testGrunt dev test needs some command filters to run correctly 
without cygwin
https://issues.apache.org/jira/browse/PIG-3028
PIG-3027pigTest unit test needs a newline filter for comparisons of golden 
multi-line
https://issues.apache.org/jira/browse/PIG-3027
PIG-3026Pig checked-in baseline comparisons need a pre-filter to address 
OS-specific newline differences
https://issues.apache.org/jira/browse/PIG-3026
PIG-3024TestEmptyInputDir unit test - hadoop version detection logic is 
brittle
https://issues.apache.org/jira/browse/PIG-3024
PIG-3015Rewrite of AvroStorage
https://issues.apache.org/jira/browse/PIG-3015
PIG-3010Allow UDF's to flatten themselves
https://issues.apache.org/jira/browse/PIG-3010
PIG-2959Add a pig.cmd for Pig to run under Windows
https://issues.apache.org/jira/browse/PIG-2959
PIG-2955 Fix bunch of Pig e2e tests on Windows 
https://issues.apache.org/jira/browse/PIG-2955
PIG-2643Use bytecode generation to make a performance replacement for 
InvokeForLong, InvokeForString, etc
https://issues.apache.org/jira/browse/PIG-2643
PIG-2641Create toJSON function for all complex types: tuples, bags and maps
https://issues.apache.org/jira/browse/PIG-2641
PIG-2591Unit tests should not write to /tmp but respect java.io.tmpdir
https://issues.apache.org/jira/browse/PIG-2591
PIG-1914Support load/store JSON data in Pig
https://issues.apache.org/jira/browse/PIG-1914

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225&filterId=12322384


Bytes to Long/Interger conversions

2013-03-20 Thread Prashant Kommireddi
Daniel and myself were discussing the way Pig does these conversions
currently and possibly simplify/optimize it further.

Long ret = null;
if (sanityCheckIntegerLong(s)) {
try {
ret = Long.valueOf(s);
} catch (NumberFormatException nfe) {
}
}


The code looks to see if all characters are numeric and then does a
conversion to Long.

private static boolean sanityCheckIntegerLong(String number){
for (int i=0; i < number.length(); i++){
if (number.charAt(i) >= '0' && number.charAt(i) <='9' || i == 0
&& number.charAt(i) == '-'){
// valid one
}
else{
// contains invalid characters, must not be a integer or
long.
return false;
}
}
return true;
}

If the input is not numeric (1234abcd) the code calls
Double.valueOf(String) regardless before finally returning null. Any script
that inadvertently (user's mistake or not) tries to cast alpha-numeric
column to int or long would result in many wasteful calls.

I think we can avoid this and only handle the cases we find the input to be
a decimal number (1234.56) and return null otherwise even before trying
Double.valueOf(String).

Thoughts/concerns? Just want to make sure such a change does not break
backward-compatibility.

-Prashant


[jira] [Commented] (PIG-3256) Upgrade jython to 2.5.3 (legal concern)

2013-03-20 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608469#comment-13608469
 ] 

Rohini Palaniswamy commented on PIG-3256:
-

+1

> Upgrade jython to 2.5.3 (legal concern)
> ---
>
> Key: PIG-3256
> URL: https://issues.apache.org/jira/browse/PIG-3256
> Project: Pig
>  Issue Type: Bug
>Reporter: Daniel Dai
>Assignee: Daniel Dai
>Priority: Critical
> Fix For: 0.12, 0.11.1
>
> Attachments: PIG-3256-1.patch, PIG-3256-2.patch, PIG-3256-3.patch, 
> PIG-3256-4.patch
>
>
> When we review the legal documents with Microsoft for Windows work, here is 
> the recommend from lawyer:
> Jython 2.5.2 redistributes an external LGPL component (JNA) in a manner that 
> puts Jython out of compliance with the LGPL. As such dependent components 
> like Pig are also arguably out of compliance with the LGPL.
> It appears that this has been quietly found and fixed by the Jython guys, as 
> version 2.5.3 does not include the JNA component. However the status of 2.5.2 
> with respect to the LGPL and Apache legal is still unclear.
> The easiest way to remediate this whole problem is to simply move Pig to 
> Jython 2.5.3 and remove the question.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3256) Upgrade jython to 2.5.3 (legal concern)

2013-03-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3256:


Attachment: PIG-3256-4.patch

Attached new patch. Thanks again!

> Upgrade jython to 2.5.3 (legal concern)
> ---
>
> Key: PIG-3256
> URL: https://issues.apache.org/jira/browse/PIG-3256
> Project: Pig
>  Issue Type: Bug
>Reporter: Daniel Dai
>Assignee: Daniel Dai
>Priority: Critical
> Fix For: 0.12, 0.11.1
>
> Attachments: PIG-3256-1.patch, PIG-3256-2.patch, PIG-3256-3.patch, 
> PIG-3256-4.patch
>
>
> When we review the legal documents with Microsoft for Windows work, here is 
> the recommend from lawyer:
> Jython 2.5.2 redistributes an external LGPL component (JNA) in a manner that 
> puts Jython out of compliance with the LGPL. As such dependent components 
> like Pig are also arguably out of compliance with the LGPL.
> It appears that this has been quietly found and fixed by the Jython guys, as 
> version 2.5.3 does not include the JNA component. However the status of 2.5.2 
> with respect to the LGPL and Apache legal is still unclear.
> The easiest way to remediate this whole problem is to simply move Pig to 
> Jython 2.5.3 and remove the question.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3193) Fix "ant docs" warnings

2013-03-20 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3193:
---

Assignee: Cheolsoo Park
  Status: Patch Available  (was: Open)

> Fix "ant docs" warnings
> ---
>
> Key: PIG-3193
> URL: https://issues.apache.org/jira/browse/PIG-3193
> Project: Pig
>  Issue Type: Bug
>  Components: build, documentation
>Affects Versions: 0.11
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>  Labels: newbie
> Fix For: 0.12
>
> Attachments: PIG-3193.patch
>
>
> I see many warnings every time when I run "ant clean docs". They don't break 
> build, but it would be nice if we could clean them if possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3193) Fix "ant docs" warnings

2013-03-20 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3193:
---

Attachment: PIG-3193.patch

I am attaching a patch that includes the following changes:
* Fixed Javadoc symbol not found warnings. For example,
{code}
[javadoc] 
/home/cheolsoo/workspace/pig-svn/src/org/apache/pig/tools/grunt/GruntParser.java:302:
 cannot find symbol
{code}
I updated the "javadoc-all" target in build.xml to fix the classpath.
* Fixed Forrest plugin installation error:
{code}
[exec] Can't get 
http://forrest.apache.org/plugins//0.9/org.apache.forrest.plugin.input.simplifiedDocbook.zip
 to 
/home/cheolsoo/workspace/apache-forrest-0.9/build/plugins/org.apache.forrest.plugi
 n.input.simplifiedDocbook.zip
{code}
The forrest 0.9 no longer has the "simplifiedDocbook" plugin, so I removed it 
from "project.required.plugins" in forrest.properties.
* Removed a legacy workaround for Forrest 0.8.
{code}
# PIG-1508: Workaround for http://issues.apache.org/jira/browse/FOR-984
# Remove when forrest-0.9 is available
forrest.validate.sitemap=false
{code} 
Apache jenkins uses Forrest 0.9, so this should be fine.
* Fixed Forrest unresolved id warnings. For example, 
{code}
[exec] WARN - Page 2: Unresolved id reference "Case-Sensitivity" found.
{code}
These warnings result with broken links in html docs.

I am NOT fixing the following warnings.
* Javadoc comment warnings. e.g.
{code}
LoadCaster.java:143: warning - @param argument "fieldSchema" is not a parameter 
name.
{code}
There are 165 of them, so I will address them in a separate jira.
* Forrest overflow warnings. e.g.
{code}
[exec] WARN - Line 1 of a paragraph overflows the available area by 1750mpt. 
(fo:block, "chararray")
{code}
This is because the text of the source tag is too wide, and it is apparently a 
bug in Forrest 0.9 (FOR-1104).

> Fix "ant docs" warnings
> ---
>
> Key: PIG-3193
> URL: https://issues.apache.org/jira/browse/PIG-3193
> Project: Pig
>  Issue Type: Bug
>  Components: build, documentation
>Affects Versions: 0.11
>Reporter: Cheolsoo Park
>  Labels: newbie
> Fix For: 0.12
>
> Attachments: PIG-3193.patch
>
>
> I see many warnings every time when I run "ant clean docs". They don't break 
> build, but it would be nice if we could clean them if possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3256) Upgrade jython to 2.5.3 (legal concern)

2013-03-20 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608455#comment-13608455
 ] 

Rohini Palaniswamy commented on PIG-3256:
-

Sorry. One last comment. Should have caught this earlier. Should be 
project.getProperty('jython.version') instead of ${jython.version} as this is 
javascript. And thanks for fixing this. Its a miss on my part to hardcode the 
version.

{code}
replace(":" + project.getProperty('ivy.default.ivy.user.dir') + 
"/cache/org.python/jython-standalone/jars/jython-standalone-" + 
project.getProperty('jython.version') + ".jar", ""));
{code}

> Upgrade jython to 2.5.3 (legal concern)
> ---
>
> Key: PIG-3256
> URL: https://issues.apache.org/jira/browse/PIG-3256
> Project: Pig
>  Issue Type: Bug
>Reporter: Daniel Dai
>Assignee: Daniel Dai
>Priority: Critical
> Fix For: 0.12, 0.11.1
>
> Attachments: PIG-3256-1.patch, PIG-3256-2.patch, PIG-3256-3.patch
>
>
> When we review the legal documents with Microsoft for Windows work, here is 
> the recommend from lawyer:
> Jython 2.5.2 redistributes an external LGPL component (JNA) in a manner that 
> puts Jython out of compliance with the LGPL. As such dependent components 
> like Pig are also arguably out of compliance with the LGPL.
> It appears that this has been quietly found and fixed by the Jython guys, as 
> version 2.5.3 does not include the JNA component. However the status of 2.5.2 
> with respect to the LGPL and Apache legal is still unclear.
> The easiest way to remediate this whole problem is to simply move Pig to 
> Jython 2.5.3 and remove the question.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3249) Pig startup script prints out a wrong version of hadoop when using fat jar

2013-03-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3249:


  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

+1, patch committed to trunk.

> Pig startup script prints out a wrong version of hadoop when using fat jar
> --
>
> Key: PIG-3249
> URL: https://issues.apache.org/jira/browse/PIG-3249
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Prashant Kommireddi
>Assignee: Prashant Kommireddi
>  Labels: newbie
> Fix For: 0.12
>
> Attachments: PIG-3249.patch
>
>
> Script suggests 0.20.2 is used with the bundled jar but we are using 1.0 at 
> the moment.
> {code}
> # fall back to use fat pig.jar
> if [ "$debug" == "true" ]; then
> echo "Cannot find local hadoop installation, using bundled hadoop 
> 20.2"
> fi
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3256) Upgrade jython to 2.5.3 (legal concern)

2013-03-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3256:


Attachment: PIG-3256-3.patch

Yes, miss that. Upload again.

> Upgrade jython to 2.5.3 (legal concern)
> ---
>
> Key: PIG-3256
> URL: https://issues.apache.org/jira/browse/PIG-3256
> Project: Pig
>  Issue Type: Bug
>Reporter: Daniel Dai
>Assignee: Daniel Dai
>Priority: Critical
> Fix For: 0.12, 0.11.1
>
> Attachments: PIG-3256-1.patch, PIG-3256-2.patch, PIG-3256-3.patch
>
>
> When we review the legal documents with Microsoft for Windows work, here is 
> the recommend from lawyer:
> Jython 2.5.2 redistributes an external LGPL component (JNA) in a manner that 
> puts Jython out of compliance with the LGPL. As such dependent components 
> like Pig are also arguably out of compliance with the LGPL.
> It appears that this has been quietly found and fixed by the Jython guys, as 
> version 2.5.3 does not include the JNA component. However the status of 2.5.2 
> with respect to the LGPL and Apache legal is still unclear.
> The easiest way to remediate this whole problem is to simply move Pig to 
> Jython 2.5.3 and remove the question.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3249) Pig startup script prints out a wrong version of hadoop when using fat jar

2013-03-20 Thread Prashant Kommireddi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Kommireddi updated PIG-3249:
-

Patch Info: Patch Available

> Pig startup script prints out a wrong version of hadoop when using fat jar
> --
>
> Key: PIG-3249
> URL: https://issues.apache.org/jira/browse/PIG-3249
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Prashant Kommireddi
>Assignee: Prashant Kommireddi
>  Labels: newbie
> Fix For: 0.12
>
> Attachments: PIG-3249.patch
>
>
> Script suggests 0.20.2 is used with the bundled jar but we are using 1.0 at 
> the moment.
> {code}
> # fall back to use fat pig.jar
> if [ "$debug" == "true" ]; then
> echo "Cannot find local hadoop installation, using bundled hadoop 
> 20.2"
> fi
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3256) Upgrade jython to 2.5.3 (legal concern)

2013-03-20 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608437#comment-13608437
 ] 

Rohini Palaniswamy commented on PIG-3256:
-

 is already there in 
build.xml. So the newly introduced line  can be removed. 

> Upgrade jython to 2.5.3 (legal concern)
> ---
>
> Key: PIG-3256
> URL: https://issues.apache.org/jira/browse/PIG-3256
> Project: Pig
>  Issue Type: Bug
>Reporter: Daniel Dai
>Assignee: Daniel Dai
>Priority: Critical
> Fix For: 0.12, 0.11.1
>
> Attachments: PIG-3256-1.patch, PIG-3256-2.patch
>
>
> When we review the legal documents with Microsoft for Windows work, here is 
> the recommend from lawyer:
> Jython 2.5.2 redistributes an external LGPL component (JNA) in a manner that 
> puts Jython out of compliance with the LGPL. As such dependent components 
> like Pig are also arguably out of compliance with the LGPL.
> It appears that this has been quietly found and fixed by the Jython guys, as 
> version 2.5.3 does not include the JNA component. However the status of 2.5.2 
> with respect to the LGPL and Apache legal is still unclear.
> The easiest way to remediate this whole problem is to simply move Pig to 
> Jython 2.5.3 and remove the question.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3110) pig corrupts chararrays with trailing whitespace when converting them to long

2013-03-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3110:


  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Sounds good. 

Patch committed to trunk. Thanks Prashant!

> pig corrupts chararrays with trailing whitespace when converting them to long
> -
>
> Key: PIG-3110
> URL: https://issues.apache.org/jira/browse/PIG-3110
> Project: Pig
>  Issue Type: Bug
>  Components: data
>Affects Versions: 0.10.0
>Reporter: Ido Hadanny
>Assignee: Prashant Kommireddi
> Fix For: 0.12
>
> Attachments: PIG-3110.patch
>
>
> when trying to convert the following string into long, pig corrupts it. data:
> 1703598819951657279 ,44081037
> data1 = load 'data' using CSVLoader as (a: chararray ,b: int);
> data2 = foreach data1 generate (long)a as a;
> dump data2;
> (1703598819951657216)<--- last 2 digits are corrupted
> data2 = foreach data1 generate (long)TRIM(a) as a;
> dump data2;
> (1703598819951657279)<--- correct

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3256) Upgrade jython to 2.5.3 (legal concern)

2013-03-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3256:


Attachment: PIG-3256-2.patch

Thanks Rohini. Updated the patch to include your suggestion.

I would like to commit to 0.11 as well. The legal concern should be treated as 
critical.

> Upgrade jython to 2.5.3 (legal concern)
> ---
>
> Key: PIG-3256
> URL: https://issues.apache.org/jira/browse/PIG-3256
> Project: Pig
>  Issue Type: Bug
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.12
>
> Attachments: PIG-3256-1.patch, PIG-3256-2.patch
>
>
> When we review the legal documents with Microsoft for Windows work, here is 
> the recommend from lawyer:
> Jython 2.5.2 redistributes an external LGPL component (JNA) in a manner that 
> puts Jython out of compliance with the LGPL. As such dependent components 
> like Pig are also arguably out of compliance with the LGPL.
> It appears that this has been quietly found and fixed by the Jython guys, as 
> version 2.5.3 does not include the JNA component. However the status of 2.5.2 
> with respect to the LGPL and Apache legal is still unclear.
> The easiest way to remediate this whole problem is to simply move Pig to 
> Jython 2.5.3 and remove the question.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3256) Upgrade jython to 2.5.3 (legal concern)

2013-03-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3256:


Fix Version/s: 0.11.1

> Upgrade jython to 2.5.3 (legal concern)
> ---
>
> Key: PIG-3256
> URL: https://issues.apache.org/jira/browse/PIG-3256
> Project: Pig
>  Issue Type: Bug
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.12, 0.11.1
>
> Attachments: PIG-3256-1.patch, PIG-3256-2.patch
>
>
> When we review the legal documents with Microsoft for Windows work, here is 
> the recommend from lawyer:
> Jython 2.5.2 redistributes an external LGPL component (JNA) in a manner that 
> puts Jython out of compliance with the LGPL. As such dependent components 
> like Pig are also arguably out of compliance with the LGPL.
> It appears that this has been quietly found and fixed by the Jython guys, as 
> version 2.5.3 does not include the JNA component. However the status of 2.5.2 
> with respect to the LGPL and Apache legal is still unclear.
> The easiest way to remediate this whole problem is to simply move Pig to 
> Jython 2.5.3 and remove the question.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3256) Upgrade jython to 2.5.3 (legal concern)

2013-03-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3256:


Priority: Critical  (was: Major)

> Upgrade jython to 2.5.3 (legal concern)
> ---
>
> Key: PIG-3256
> URL: https://issues.apache.org/jira/browse/PIG-3256
> Project: Pig
>  Issue Type: Bug
>Reporter: Daniel Dai
>Assignee: Daniel Dai
>Priority: Critical
> Fix For: 0.12, 0.11.1
>
> Attachments: PIG-3256-1.patch, PIG-3256-2.patch
>
>
> When we review the legal documents with Microsoft for Windows work, here is 
> the recommend from lawyer:
> Jython 2.5.2 redistributes an external LGPL component (JNA) in a manner that 
> puts Jython out of compliance with the LGPL. As such dependent components 
> like Pig are also arguably out of compliance with the LGPL.
> It appears that this has been quietly found and fixed by the Jython guys, as 
> version 2.5.3 does not include the JNA component. However the status of 2.5.2 
> with respect to the LGPL and Apache legal is still unclear.
> The easiest way to remediate this whole problem is to simply move Pig to 
> Jython 2.5.3 and remove the question.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3256) Upgrade jython to 2.5.3 (legal concern)

2013-03-20 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608356#comment-13608356
 ] 

Rohini Palaniswamy commented on PIG-3256:
-

+1. Should we put it in 0.11 branch also?

> Upgrade jython to 2.5.3 (legal concern)
> ---
>
> Key: PIG-3256
> URL: https://issues.apache.org/jira/browse/PIG-3256
> Project: Pig
>  Issue Type: Bug
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.12
>
> Attachments: PIG-3256-1.patch
>
>
> When we review the legal documents with Microsoft for Windows work, here is 
> the recommend from lawyer:
> Jython 2.5.2 redistributes an external LGPL component (JNA) in a manner that 
> puts Jython out of compliance with the LGPL. As such dependent components 
> like Pig are also arguably out of compliance with the LGPL.
> It appears that this has been quietly found and fixed by the Jython guys, as 
> version 2.5.3 does not include the JNA component. However the status of 2.5.2 
> with respect to the LGPL and Apache legal is still unclear.
> The easiest way to remediate this whole problem is to simply move Pig to 
> Jython 2.5.3 and remove the question.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3110) pig corrupts chararrays with trailing whitespace when converting them to long

2013-03-20 Thread Prashant Kommireddi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608326#comment-13608326
 ] 

Prashant Kommireddi commented on PIG-3110:
--

Yes, we do need that in the case input is decimal number (123.45)

In my previous comment I proposed calling Double.valueOf() directly and then 
turning that into a Long/Integer in such cases. Also we can avoid these calls 
altogether by checking for "numericness". Let's do that in a separate JIRA?

> pig corrupts chararrays with trailing whitespace when converting them to long
> -
>
> Key: PIG-3110
> URL: https://issues.apache.org/jira/browse/PIG-3110
> Project: Pig
>  Issue Type: Bug
>  Components: data
>Affects Versions: 0.10.0
>Reporter: Ido Hadanny
>Assignee: Prashant Kommireddi
> Fix For: 0.12
>
> Attachments: PIG-3110.patch
>
>
> when trying to convert the following string into long, pig corrupts it. data:
> 1703598819951657279 ,44081037
> data1 = load 'data' using CSVLoader as (a: chararray ,b: int);
> data2 = foreach data1 generate (long)a as a;
> dump data2;
> (1703598819951657216)<--- last 2 digits are corrupted
> data2 = foreach data1 generate (long)TRIM(a) as a;
> dump data2;
> (1703598819951657279)<--- correct

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3255) Avoid extra byte array copy in streaming deserialize

2013-03-20 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3255:


Status: Open  (was: Patch Available)

Had a chat with Koji. He pointed out HADOOP-6109 which doubles the size of 
byte[] in Text every time a append happens. 

Text.java

Hadoop 1.x
{code}
private void setCapacity(int len, boolean keepData) {
if (bytes == null || bytes.length < len) {
  byte[] newBytes = new byte[len];
  if (bytes != null && keepData) {
System.arraycopy(bytes, 0, newBytes, 0, length);
  }
  bytes = newBytes;
}
  }
{code}

Hadoop 0.23/2.x:
{code}
private void setCapacity(int len, boolean keepData) {
if (bytes == null || bytes.length < len) {
  if (bytes != null && keepData) {
bytes = Arrays.copyOf(bytes, Math.max(len,length << 1));
  } else {
bytes = new byte[len];
  }
}
  }
{code}

So value.getBytes().length == value.getLength() will be true only when the size 
of the line is < io.file.buffer.size. Since a copy of the byte[] needs to be 
created with the right size in any case, we can go with reusing the Text() for 
every getNext() in OutputHandler. It will be more beneficial when the record 
sizes are greater than io.file.buffer.size and value.getBytes().length is 
almost never equal to value.getLength() because of the doubling of the size.

I will modify the patch to reuse Text object.

> Avoid extra byte array copy in streaming deserialize
> 
>
> Key: PIG-3255
> URL: https://issues.apache.org/jira/browse/PIG-3255
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.12
>
> Attachments: PIG-3255-1.patch
>
>
> PigStreaming.java:
>  public Tuple deserialize(byte[] bytes) throws IOException {
> Text val = new Text(bytes);  
> return StorageUtil.textToTuple(val, fieldDel);
> }
> Should remove new Text(bytes) copy and construct the tuple directly from the 
> bytes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3256) Upgrade jython to 2.5.3 (legal concern)

2013-03-20 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3256:


Attachment: PIG-3256-1.patch

> Upgrade jython to 2.5.3 (legal concern)
> ---
>
> Key: PIG-3256
> URL: https://issues.apache.org/jira/browse/PIG-3256
> Project: Pig
>  Issue Type: Bug
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.12
>
> Attachments: PIG-3256-1.patch
>
>
> When we review the legal documents with Microsoft for Windows work, here is 
> the recommend from lawyer:
> Jython 2.5.2 redistributes an external LGPL component (JNA) in a manner that 
> puts Jython out of compliance with the LGPL. As such dependent components 
> like Pig are also arguably out of compliance with the LGPL.
> It appears that this has been quietly found and fixed by the Jython guys, as 
> version 2.5.3 does not include the JNA component. However the status of 2.5.2 
> with respect to the LGPL and Apache legal is still unclear.
> The easiest way to remediate this whole problem is to simply move Pig to 
> Jython 2.5.3 and remove the question.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3256) Upgrade jython to 2.5.3 (legal concern)

2013-03-20 Thread Daniel Dai (JIRA)
Daniel Dai created PIG-3256:
---

 Summary: Upgrade jython to 2.5.3 (legal concern)
 Key: PIG-3256
 URL: https://issues.apache.org/jira/browse/PIG-3256
 Project: Pig
  Issue Type: Bug
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.12


When we review the legal documents with Microsoft for Windows work, here is the 
recommend from lawyer:

Jython 2.5.2 redistributes an external LGPL component (JNA) in a manner that 
puts Jython out of compliance with the LGPL. As such dependent components like 
Pig are also arguably out of compliance with the LGPL.

It appears that this has been quietly found and fixed by the Jython guys, as 
version 2.5.3 does not include the JNA component. However the status of 2.5.2 
with respect to the LGPL and Apache legal is still unclear.

The easiest way to remediate this whole problem is to simply move Pig to Jython 
2.5.3 and remove the question.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3255) Avoid extra byte array copy in streaming deserialize

2013-03-20 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608227#comment-13608227
 ] 

Koji Noguchi commented on PIG-3255:
---

+1 Looks good to me. 
Probably another Jira, but I wonder if we really need to create new Text for 
every streaming outputs.  Can we reuse it with value.clear() ?
(But if we do this, then in most cases value.getBytes().length <> 
value.getLength().)

> Avoid extra byte array copy in streaming deserialize
> 
>
> Key: PIG-3255
> URL: https://issues.apache.org/jira/browse/PIG-3255
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.12
>
> Attachments: PIG-3255-1.patch
>
>
> PigStreaming.java:
>  public Tuple deserialize(byte[] bytes) throws IOException {
> Text val = new Text(bytes);  
> return StorageUtil.textToTuple(val, fieldDel);
> }
> Should remove new Text(bytes) copy and construct the tuple directly from the 
> bytes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3253) Misleading comment w.r.t getSplitIndex() method in PigSplit.java

2013-03-20 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3253:
---

   Resolution: Fixed
Fix Version/s: 0.12
   Status: Resolved  (was: Patch Available)

Committed to trunk. Thank you Daniel for the review!

> Misleading comment w.r.t getSplitIndex() method in PigSplit.java
> 
>
> Key: PIG-3253
> URL: https://issues.apache.org/jira/browse/PIG-3253
> Project: Pig
>  Issue Type: Bug
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>Priority: Minor
> Fix For: 0.12
>
> Attachments: PIG-3253.patch
>
>
> While reviewing the patch for PIG-3141, I noticed that the following comment 
> is out-of-date:
> {code:title=PigSplit.java}
> // package level access because we don't want LoadFunc implementations
> // to get this information - this is to be used only from
> // MergeJoinIndexer
> public int getSplitIndex() {
> return splitIndex;
> }
> {code}
> Looking at the commit history, the public qualifier was added by PIG-1309, 
> but the comment wasn't updated accordingly.
> Provided that more and more LoadFunc implementations use this method (e.g. 
> PIG-3141), we should remove this misleading comment to avoid any confusion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: PIG-3141 [piggybank] Giving CSVExcelStorage an option to handle header rows

2013-03-20 Thread Cheolsoo Park


> On March 20, 2013, 7:05 p.m., Cheolsoo Park wrote:
> > contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java,
> >  line 538
> > 
> >
> > Can you move this line to inside the if block? That's more intuitive to 
> > me.

Actually, I just realized that I was thinking wrong about this. Please ignore 
my comment.


> On March 20, 2013, 7:05 p.m., Cheolsoo Park wrote:
> > contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java,
> >  line 442
> > 
> >
> > Can you move this line to inside the if block? That's more intuitive to 
> > me.

Actually, I just realized that I was thinking wrong about this. Please ignore 
my comment.


- Cheolsoo


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9697/#review18100
---


On March 1, 2013, 2:52 p.m., Jonathan Packer wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9697/
> ---
> 
> (Updated March 1, 2013, 2:52 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Description
> ---
> 
> Reviewboard for https://issues.apache.org/jira/browse/PIG-3141
> 
> Adds a "header treatment" option to CSVExcelStorage allowing header rows 
> (first row with column names) in files to be skipped when loading, or for a 
> header row with column names to be written when storing. Should be backwards 
> compatible--all unit-tests from the old CSVExcelStorage pass.
> 
> 
> Diffs
> -
> 
>   
> contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
>  568b3f3 
>   
> contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestCSVExcelStorage.java
>  9bed527 
> 
> Diff: https://reviews.apache.org/r/9697/diff/
> 
> 
> Testing
> ---
> 
> cd contrib/piggybank/java
> ant -Dtestcase=TestCSVExcelStorage test
> 
> 
> Thanks,
> 
> Jonathan Packer
> 
>



[jira] [Updated] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-03-20 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3251:
--

Attachment: pig-3251-trunk-v03.patch

bq. Makes sense, we shall move to the new approach for Hadoop 1.1.0+, use 
Bzip2TextInputFormat otherwise for backward compatibility.

Would something like this work? pig-3251-trunk-v03.patch uses 
PigTextInputFormat even for bzip if TextInputFormat can split them. (I'll 
update the other FileInputLoadFunc if this change looks ok.  Also, this works 
with 'bz2' extension but not for 'bz' unless config is added.)



> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, 
> pig-3251-trunk-v03.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3255) Avoid extra byte array copy in streaming deserialize

2013-03-20 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3255:


Attachment: PIG-3255-1.patch

> Avoid extra byte array copy in streaming deserialize
> 
>
> Key: PIG-3255
> URL: https://issues.apache.org/jira/browse/PIG-3255
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.12
>
> Attachments: PIG-3255-1.patch
>
>
> PigStreaming.java:
>  public Tuple deserialize(byte[] bytes) throws IOException {
> Text val = new Text(bytes);  
> return StorageUtil.textToTuple(val, fieldDel);
> }
> Should remove new Text(bytes) copy and construct the tuple directly from the 
> bytes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3255) Avoid extra byte array copy in streaming deserialize

2013-03-20 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3255:


Status: Patch Available  (was: Open)

> Avoid extra byte array copy in streaming deserialize
> 
>
> Key: PIG-3255
> URL: https://issues.apache.org/jira/browse/PIG-3255
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.12
>
> Attachments: PIG-3255-1.patch
>
>
> PigStreaming.java:
>  public Tuple deserialize(byte[] bytes) throws IOException {
> Text val = new Text(bytes);  
> return StorageUtil.textToTuple(val, fieldDel);
> }
> Should remove new Text(bytes) copy and construct the tuple directly from the 
> bytes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3110) pig corrupts chararrays with trailing whitespace when converting them to long

2013-03-20 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608125#comment-13608125
 ] 

Daniel Dai commented on PIG-3110:
-

Looks good. Do we still need to check "ret == null" case if we do trim and 
sanity check before processing? That code is very confusing.

> pig corrupts chararrays with trailing whitespace when converting them to long
> -
>
> Key: PIG-3110
> URL: https://issues.apache.org/jira/browse/PIG-3110
> Project: Pig
>  Issue Type: Bug
>  Components: data
>Affects Versions: 0.10.0
>Reporter: Ido Hadanny
>Assignee: Prashant Kommireddi
> Fix For: 0.12
>
> Attachments: PIG-3110.patch
>
>
> when trying to convert the following string into long, pig corrupts it. data:
> 1703598819951657279 ,44081037
> data1 = load 'data' using CSVLoader as (a: chararray ,b: int);
> data2 = foreach data1 generate (long)a as a;
> dump data2;
> (1703598819951657216)<--- last 2 digits are corrupted
> data2 = foreach data1 generate (long)TRIM(a) as a;
> dump data2;
> (1703598819951657279)<--- correct

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (PIG-3255) Avoid extra byte array copy in streaming deserialize

2013-03-20 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy reassigned PIG-3255:
---

Assignee: Rohini Palaniswamy

> Avoid extra byte array copy in streaming deserialize
> 
>
> Key: PIG-3255
> URL: https://issues.apache.org/jira/browse/PIG-3255
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.12
>
>
> PigStreaming.java:
>  public Tuple deserialize(byte[] bytes) throws IOException {
> Text val = new Text(bytes);  
> return StorageUtil.textToTuple(val, fieldDel);
> }
> Should remove new Text(bytes) copy and construct the tuple directly from the 
> bytes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3255) Avoid extra byte array copy in streaming deserialize

2013-03-20 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3255:


Description: 
PigStreaming.java:

 public Tuple deserialize(byte[] bytes) throws IOException {
Text val = new Text(bytes);  
return StorageUtil.textToTuple(val, fieldDel);
}

Should remove new Text(bytes) copy and construct the tuple directly from the 
bytes

  was:
 public Tuple deserialize(byte[] bytes) throws IOException {
Text val = new Text(bytes);  
return StorageUtil.textToTuple(val, fieldDel);
}

Should remove new Text(bytes) copy and construct the tuple directly from the 
bytes


OutputHandler.java:
{code}
Text value = new Text();
int num = in.readLine(value);
if (num <= 0) {
return null;
}

byte[] newBytes = new byte[value.getLength()];
System.arraycopy(value.getBytes(), 0, newBytes, 0, value.getLength());
return deserializer.deserialize(newBytes);
{code}

  We can cut down another copy here if value.getLength() == 
value.getBytes().length as that would be the case mostly. 

> Avoid extra byte array copy in streaming deserialize
> 
>
> Key: PIG-3255
> URL: https://issues.apache.org/jira/browse/PIG-3255
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Rohini Palaniswamy
> Fix For: 0.12
>
>
> PigStreaming.java:
>  public Tuple deserialize(byte[] bytes) throws IOException {
> Text val = new Text(bytes);  
> return StorageUtil.textToTuple(val, fieldDel);
> }
> Should remove new Text(bytes) copy and construct the tuple directly from the 
> bytes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3255) Avoid extra byte array copy in streaming deserialize

2013-03-20 Thread Rohini Palaniswamy (JIRA)
Rohini Palaniswamy created PIG-3255:
---

 Summary: Avoid extra byte array copy in streaming deserialize
 Key: PIG-3255
 URL: https://issues.apache.org/jira/browse/PIG-3255
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11
Reporter: Rohini Palaniswamy
 Fix For: 0.12


 public Tuple deserialize(byte[] bytes) throws IOException {
Text val = new Text(bytes);  
return StorageUtil.textToTuple(val, fieldDel);
}

Should remove new Text(bytes) copy and construct the tuple directly from the 
bytes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Anybody using custom Serializer/Deserializer in Pig Streaming?

2013-03-20 Thread Rohini Palaniswamy
Nice summarization Koji. Wish we had some object that has byte[] and length
instead of byte[] as the return type of serialize() and method param of
deserialize(). That would enable reuse and cut down on some of the copy.

At least there is one copy we can cut down without any API changes by
having a new function StorageUtil.textToTuple(bytes, fieldDel)

@Override

public Tuple deserialize(byte[] bytes) throws IOException {

Text val = new Text(bytes);  //Remove this copy and construct the
tuple directly from the bytes

return StorageUtil.textToTuple(val, fieldDel);

}


Regards,
Rohini


On Wed, Mar 20, 2013 at 11:51 AM, Koji Noguchi wrote:

> Hi.
>
> Do you know anyone using custom serializer/deserializer in pig streaming?
>
> I was looking at http://wiki.apache.org/pig/PigStreamingFunctionalSpecand was 
> impressed on various features it supports.
> Then, looking at the code, I was sad to see many additional data copying
> done to support those features when simple case should be one copy out to
> stdin and another copy in from stdout.
>
> So far, this is my understanding.  2 extra copying on the sender side and
> 3 extra copying on the receiver side.
>
> Assuming Default(Input/Output)Handler + PigStreaming, then
>
> PigInputHandler.putNext(Tuple t)
> --> serializer.serialize(t)
> -->--> COPY to out(ByteArrayOutputStream)
> -->--> COPY by out.toByteArray()
> --> write to stdin (copy but necessary)
>
> Streaming
>
> --> OutputHandler.getNext()
> -->--> Text value = readLine(stdin)   (copy but necessary)
> -->--> System.arraycopy(value.getBytes(), 0, newBytes, 0,
> value.getLength());   COPY just because deserialize require exact size byte
> array?
> -->-->deserializer.deserialize(byte [])
> -->-->-->  Text val = new Text(bytes); COPY since Text somehow does not
> want to reuse the byte array
> -->-->-->  StorageUtil.textToTuple(val, fieldDel)
> -->-->-->--> Create ArrayList of DataByteArraysCOPY.
>
> Now wondering if we can simplify it somehow.
>
> Thanks,
> Koji
>
>


Re: Review Request: [PIG-3173] - Partition filter pushdown does not happen if partition keys condition include a AND and OR construct

2013-03-20 Thread Dmitriy Ryaboy


> On March 20, 2013, 12:42 a.m., Dmitriy Ryaboy wrote:
> > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/PColFilterExtractor.java,
> >  line 224
> > 
> >
> > (A and B) or (C and D)
> > 
> > is impossible if (A or C) is false. We can push this up, while 
> > retaining the original filter to apply the original filter.
> 
> Rohini Palaniswamy wrote:
> So just to confirm, you want to extract A and C from each AND condition 
> and push (A OR C) as the partition filter for optimization and still leave 
> ((A AND B) or (C AND D)) to be applied on each tuple?

correct, unless my logic is wrong.

I actually think we made a bad decision when we decided that if we can push 
partitions down, we can drop the filter on the pig side -- this means we can't 
take advantage of partial filters loaders might support (for example, a bloom 
filter a loader can consult to return just the rows that "probably" match the 
condition, as opposed to definitely match. With filter removal, we have to have 
loaders implement a second-pass filtering on top of such filters).


- Dmitriy


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10035/#review18127
---


On March 20, 2013, 12:16 a.m., Rohini Palaniswamy wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10035/
> ---
> 
> (Updated March 20, 2013, 12:16 a.m.)
> 
> 
> Review request for pig.
> 
> 
> Description
> ---
> 
> 1) Fixed cases where partition pushdown was not happening for AND and OR 
> construct
> 2) Commented out the negative test cases as they were actually not asserting 
> anything.
> 
> 
> This addresses bug PIG-3173.
> https://issues.apache.org/jira/browse/PIG-3173
> 
> 
> Diffs
> -
> 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/PColFilterExtractor.java
>  1458047 
>   
> http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestPartitionFilterPushDown.java
>  1458047 
> 
> Diff: https://reviews.apache.org/r/10035/diff/
> 
> 
> Testing
> ---
> 
> Unit tests added and tested few cases manually with hcat.
> 
> 
> Thanks,
> 
> Rohini Palaniswamy
> 
>



[jira] [Commented] (PIG-3254) Fail a failed Pig script quicker

2013-03-20 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608070#comment-13608070
 ] 

Dmitriy V. Ryaboy commented on PIG-3254:


Can I add a request for whoever will work on this ticket?

Right now we die with "MR Job Failed" but don't say which job. In cases when 
multiple jobs are launched, one of them fails, the other ones are killed, and 
users find it hard to figure out which job was the cause of all badness. It 
would be nice to print out the job id of the failed job.

> Fail a failed Pig script quicker
> 
>
> Key: PIG-3254
> URL: https://issues.apache.org/jira/browse/PIG-3254
> Project: Pig
>  Issue Type: Improvement
>Reporter: Daniel Dai
> Fix For: 0.12
>
>
> Credit to [~asitecn]. Currently Pig could launch several mapreduce job 
> simultaneously. When one mapreduce job fail, we need to wait for simultaneous 
> mapreduce job finish. In addition, we could potentially launch additional 
> jobs which is doomed to fail. However, this is unnecessary in some cases:
> * If "stop.on.failure==true", we can kill parallel jobs, and fail the whole 
> script
> * If "stop.on.failure==false", and no "store" could success, we can also kill 
> parallel jobs, and fail the whole script
> Consider simultaneous jobs may take a long time to finish, this could 
> significantly improve the turn around in some cases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-03-20 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608030#comment-13608030
 ] 

Daniel Dai commented on PIG-3251:
-

Makes sense, we shall move to the new approach for Hadoop 1.1.0+, use 
Bzip2TextInputFormat otherwise for backward compatibility.

> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3141) Giving CSVExcelStorage an option to handle header rows

2013-03-20 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608029#comment-13608029
 ] 

Cheolsoo Park commented on PIG-3141:


Thank you Jonathan P. for the patch! I made some comments to the RB.

> Giving CSVExcelStorage an option to handle header rows
> --
>
> Key: PIG-3141
> URL: https://issues.apache.org/jira/browse/PIG-3141
> Project: Pig
>  Issue Type: Improvement
>  Components: piggybank
>Affects Versions: 0.11
>Reporter: Jonathan Packer
>Assignee: Jonathan Packer
> Fix For: 0.12
>
> Attachments: csv.patch, csv_updated.patch, PIG-3141_update_3.diff
>
>
> Adds an argument to CSVExcelStorage to skip the header row when loading. This 
> works properly with multiple small files each with a header being combined 
> into one split, or a large file with a single header being split into 
> multiple splits.
> Also fixes a few bugs with CSVExcelStorage, including PIG-2470 and a bug 
> involving quoted fields at the end of a line not escaping properly.
> Removes the choice of delimiter, since a CSV file ought to only use a comma 
> delimiter, hence the name.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3223) AvroStorage does not handle comma separated input paths

2013-03-20 Thread Johnny Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608025#comment-13608025
 ] 

Johnny Zhang commented on PIG-3223:
---

Thanks for clarification, I will post patch soon.

> AvroStorage does not handle comma separated input paths
> ---
>
> Key: PIG-3223
> URL: https://issues.apache.org/jira/browse/PIG-3223
> Project: Pig
>  Issue Type: Bug
>  Components: piggybank
>Affects Versions: 0.10.0, 0.11
>Reporter: Michael Kramer
>Assignee: Johnny Zhang
> Attachments: AvroStorage.patch, AvroStorage.patch-2, 
> AvroStorageUtils.patch, AvroStorageUtils.patch-2, PIG-3223.patch.txt
>
>
> In pig 0.11, a patch was issued to AvroStorage to support globs and comma 
> separated input paths (PIG-2492).  While this function works fine for 
> glob-formatted input paths, it fails when issued a standard comma separated 
> list of paths.  fs.globStatus does not seem to be able to parse out such a 
> list, and a java.net.URISyntaxException is thrown when toURI is called on the 
> path.  
> I have a working fix for this, but it's extremely ugly (basically checking if 
> the string of input paths is globbed, otherwise splitting on ",").  I'm sure 
> there's a more elegant solution.  I'd be happy to post the relevant methods 
> and "fixes" if necessary.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: PIG-3141 [piggybank] Giving CSVExcelStorage an option to handle header rows

2013-03-20 Thread Cheolsoo Park

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9697/#review18100
---


Hi Jonathan,

Overall looks good to me. I made a few minor comment inline:


contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java


This variable is no longer used.



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java


Why don't you use "FIELD_DELIMITER_DEFAULT_STR" instead of "new String(new 
byte[] { (byte) ',' })"?



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java


Can you remove 'DEFAULT' from the error message? I think it's unnecessary 
to let the user know about this string.



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java


Can you log a warning message here?



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java


Can you move this line to inside the if block? That's more intuitive to me.



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java


Can you remove "modified from Pig 0.10 version" and just describe what's 
fixed here? e.g. "Substitute a null value with an empty string. See PIG-2470."

When the code is modified can be easily found using "git blame", so it's 
unnecessary to make a comment about it.



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java


Can you move this line to inside the if block? That's more intuitive to me.



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java


Can you remove "modified from Pig 0.10 version" and just describe what's 
fixed here?

When the code gets modified can be easily found using "git blame", so it's 
unnecessary to make a comment about it.



contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestCSVExcelStorage.java


Can you remove this? This isn't a useful comment.



contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestCSVExcelStorage.java


Can you remove this?



contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestCSVExcelStorage.java


Can you remove this?



contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestCSVExcelStorage.java


Can you instead add the description to each test case? This isn't a useful 
comment.


- Cheolsoo Park


On March 1, 2013, 2:52 p.m., Jonathan Packer wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/9697/
> ---
> 
> (Updated March 1, 2013, 2:52 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Description
> ---
> 
> Reviewboard for https://issues.apache.org/jira/browse/PIG-3141
> 
> Adds a "header treatment" option to CSVExcelStorage allowing header rows 
> (first row with column names) in files to be skipped when loading, or for a 
> header row with column names to be written when storing. Should be backwards 
> compatible--all unit-tests from the old CSVExcelStorage pass.
> 
> 
> Diffs
> -
> 
>   
> contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
>  568b3f3 
>   
> contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestCSVExcelStorage.java
>  9bed527 
> 
> Diff: https://reviews.apache.org/r/9697/diff/
> 
> 
> Testing
> ---
> 
> cd contrib/piggybank/java
> ant -Dtestcase=TestCSVExcelStorage test
> 
> 
> Thanks,
> 
> Jonathan Packer
> 
>



Re: Review Request: [PIG-3173] - Partition filter pushdown does not happen if partition keys condition include a AND and OR construct

2013-03-20 Thread Rohini Palaniswamy


> On March 20, 2013, 12:42 a.m., Dmitriy Ryaboy wrote:
> > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/PColFilterExtractor.java,
> >  line 229
> > 
> >
> > looks like tabulation is off?

The whole file uses tab for indent instead of spaces. The code I added has 
space for indent which makes it look as if the indentation is wrong in the 
patch. But viewing in a editor it is fine. Based on discussion in PIG-3008, new 
code or modifications should adhere to "No tabs, No trailing white spaces, 4 
spaces to indent the code" and not change parts of the code not touched or 
apply common sense as applicable. But in this case, changing tabs to spaces 
will change every line of code and also if blocks are not indented properly in 
many places. So did not make further whitespace changes. 


> On March 20, 2013, 12:42 a.m., Dmitriy Ryaboy wrote:
> > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/PColFilterExtractor.java,
> >  line 220
> > 
> >
> > remove dead code?

Will do


> On March 20, 2013, 12:42 a.m., Dmitriy Ryaboy wrote:
> > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/PColFilterExtractor.java,
> >  line 240
> > 
> >
> > please fix whitespace

Will do


> On March 20, 2013, 12:42 a.m., Dmitriy Ryaboy wrote:
> > http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestPartitionFilterPushDown.java,
> >  line 433
> > 
> >
> > is this new code or is this RB being funny?
> > 
> > either way, dead code can be deleted; this is preferable to commenting 
> > it out.

I didn't want those test cases to get lost. So commented them out for now. Will 
revert it in this patch and create a separate jira to write equivalent test 
cases that actually assert for those cases. Need to have this class well tested 
to avoid full table scans with hcatalog.


> On March 20, 2013, 12:42 a.m., Dmitriy Ryaboy wrote:
> > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/PColFilterExtractor.java,
> >  line 224
> > 
> >
> > (A and B) or (C and D)
> > 
> > is impossible if (A or C) is false. We can push this up, while 
> > retaining the original filter to apply the original filter.

So just to confirm, you want to extract A and C from each AND condition and 
push (A OR C) as the partition filter for optimization and still leave ((A AND 
B) or (C AND D)) to be applied on each tuple?


- Rohini


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10035/#review18127
---


On March 20, 2013, 12:16 a.m., Rohini Palaniswamy wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10035/
> ---
> 
> (Updated March 20, 2013, 12:16 a.m.)
> 
> 
> Review request for pig.
> 
> 
> Description
> ---
> 
> 1) Fixed cases where partition pushdown was not happening for AND and OR 
> construct
> 2) Commented out the negative test cases as they were actually not asserting 
> anything.
> 
> 
> This addresses bug PIG-3173.
> https://issues.apache.org/jira/browse/PIG-3173
> 
> 
> Diffs
> -
> 
>   
> http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/newplan/PColFilterExtractor.java
>  1458047 
>   
> http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestPartitionFilterPushDown.java
>  1458047 
> 
> Diff: https://reviews.apache.org/r/10035/diff/
> 
> 
> Testing
> ---
> 
> Unit tests added and tested few cases manually with hcat.
> 
> 
> Thanks,
> 
> Rohini Palaniswamy
> 
>



Anybody using custom Serializer/Deserializer in Pig Streaming?

2013-03-20 Thread Koji Noguchi
Hi.

Do you know anyone using custom serializer/deserializer in pig streaming?

I was looking at http://wiki.apache.org/pig/PigStreamingFunctionalSpec and was 
impressed on various features it supports.
Then, looking at the code, I was sad to see many additional data copying done 
to support those features when simple case should be one copy out to stdin and 
another copy in from stdout.

So far, this is my understanding.  2 extra copying on the sender side and 3 
extra copying on the receiver side.

Assuming Default(Input/Output)Handler + PigStreaming, then

PigInputHandler.putNext(Tuple t)
--> serializer.serialize(t)
-->--> COPY to out(ByteArrayOutputStream)
-->--> COPY by out.toByteArray()
--> write to stdin (copy but necessary)

Streaming

--> OutputHandler.getNext()
-->--> Text value = readLine(stdin)   (copy but necessary)
-->--> System.arraycopy(value.getBytes(), 0, newBytes, 0, value.getLength());   
COPY just because deserialize require exact size byte array?
-->-->deserializer.deserialize(byte [])
-->-->-->  Text val = new Text(bytes); COPY since Text somehow does not want to 
reuse the byte array
-->-->-->  StorageUtil.textToTuple(val, fieldDel)
-->-->-->--> Create ArrayList of DataByteArraysCOPY.

Now wondering if we can simplify it somehow.

Thanks,
Koji



[jira] [Commented] (PIG-3222) New UDFContextSignature assignments in Pig 0.11 breaks HCatalog.HCatStorer

2013-03-20 Thread Feng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607904#comment-13607904
 ] 

Feng Peng commented on PIG-3222:


Modified the test class to add instruments for all StoreFuncInterface functions 
and here is the sequence called on the frontend,

The first number is an instance id. We can see that:
* the storer is instantiated three times
* the first instance is assigned a different signature than the later two
* the 2nd instance generates two different Job objects in two different 
setStoreLocation calls

{noformat}
[0]setStoreFuncUDFContextSignature(samples_1-3)
[0]relToAbsPathForStoreLocation(testdb.samples,hdfs://..)
[1]setStoreFuncUDFContextSignature(samples_1-6)
[1]checkSchema(number:int)
[1]setStoreLocation(testdb.samples,Job@1817929566)
[1]getOutputFormat
[1]setStoreLocation(testdb.samples,Job@603950563)
[2]setStoreFuncUDFContextSignature(samples_1-6)
[2]setStoreLocation(testdb.samples,Job@1203999762)
[2]getOutputFormat
{noformat}

> New UDFContextSignature assignments in Pig 0.11 breaks HCatalog.HCatStorer 
> ---
>
> Key: PIG-3222
> URL: https://issues.apache.org/jira/browse/PIG-3222
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.11
>Reporter: Feng Peng
>  Labels: hcatalog
> Attachments: PigStorerDemo.java
>
>
> Pig 0.11 assigns different UDFContextSignature for different invocations of 
> the same load/store statement. This change breaks the HCatStorer which 
> assumes all front-end and back-end invocations of the same store statement 
> has the same UDFContextSignature so that it can read the previously stored 
> information correctly.
> The related HCatalog code is in 
> https://svn.apache.org/repos/asf/incubator/hcatalog/branches/branch-0.5/hcatalog-pig-adapter/src/main/java/org/apache/hcatalog/pig/HCatStorer.java
>  (the setStoreLocation() function).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-03-20 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607886#comment-13607886
 ] 

Koji Noguchi commented on PIG-3251:
---

bq. With HADOOP-7823, can we remove Bzip2TextInputFormat and just use 
PigTextInputFormat?
Since our platform has moved to 0.23, I'll be happy if we can simply remove 
Bzip2TextInputFormat just for hadoop 0.23 or later.

> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-03-20 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607860#comment-13607860
 ] 

Koji Noguchi commented on PIG-3251:
---

bq. With HADOOP-7823, can we remove Bzip2TextInputFormat and just use 
PigTextInputFormat?
That'll (almost) have the same effect of my initial patch 
pig-3251-trunk-v01.patch which takes to status (2) in my previous comment.  
With HADOOP-7823 + HADOOP-6109, then it'll be (3).
Without a doubt, HADOOP-7823 + HADOOP-6109 is the cleanest approach.



> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-03-20 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607820#comment-13607820
 ] 

Richard Ding commented on PIG-3251:
---

With HADOOP-7823, can we remove Bzip2TextInputFormat and just use 
PigTextInputFormat?

> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Fwd: Call for papers: Management of Big Data track - ICAC'13 by USENIX/ACM-SIGARCH

2013-03-20 Thread Alan Gates


Begin forwarded message:

> From: Dani Abel Rayan 
> Date: March 14, 2013 10:55:00 AM PDT
> To: user 
> Subject: Call for papers: Management of Big Data track - ICAC'13 by 
> USENIX/ACM-SIGARCH
> Reply-To: u...@hadoop.apache.org
> 
> Hi,
> 
> Join us for the 10th International Conference on Autonomic Computing - 
> Sponsored by USENIX in corporation with ACM SIGARCH.
> 
> Call for Papers for Management of Big Data (MBDS) track is open. Submit your 
> paper by April 2, 2013, 11:59 p.m. PDT.
> The objective of the MBDS track at ICAC '13 is to bring together researchers, 
> practitioners, system administrators, system programmers, and others 
> interested in sharing and presenting their perspectives on the effective 
> management of Big Data systems.
> The focus of the track is on novel and practical systems-oriented work. MBDS 
> offers an opportunity for researchers and practitioners from industry, 
> academia, and national labs to showcase the latest advances in this area and 
> also to discuss and identify future directions and challenges in all aspects 
> on autonomic management of Big Data systems.
> Two types of contributions are solicited on all aspects of Big Data 
> management: (1) short papers and (2) panel presentations. Short papers should 
> be no more than 6 pages, including the abstract, and will appear in the ICAC 
> '13 conference proceedings. 
> The committee comprises of academic and industry leaders.
> For more information, please visit: 
> https://www.usenix.org/conference/icac13/mbds-management-big-data-systems
> -Thanks and Regards,
> Dani Abel Rayan
> 



[jira] [Updated] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-03-20 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3251:
--

Attachment: pig-3251-trunk-v02.patch

(1) 
Current status (before any patch)
||hadoop version  || PigTextInputFormat  || Bzip2TextInputFormat.java 
|| 
| 0.20 | [i]  SLOW due to HADOOP-6109 | (iii) Needs EXTRA MEMORY. 
This Jira. |  
| 0.23 | [ii] Good.   |  (iv) Needs EXTRA MEMORY. 
This Jira. | 

(2) 
My initial patch (pig-3251-trunk-v01.patch) changes this to 
||hadoop version  || PigTextInputFormat  || Bzip2TextInputFormat.java 
|| 
| 0.20 | [i]  SLOW due to HADOOP-6109 | (iii) Slow due to 
HADOOP-6109 |  
| 0.23 | [ii] Good.   |  (iv) Good | 

(3) 
If we can backport hadoop-6109 to 0.20 + my pig-3251-trunk-v01.patch, it solves 
all the problem.
||hadoop version  || PigTextInputFormat  || Bzip2TextInputFormat.java 
|| 
| 0.20+Hadoop-6109 | [i]  Good| (iii) Good |  
| 0.23 | [ii] Good.   |  (iv) Good | 

However, I've seen a discussion about pig supporting 0.20.2 users.  
So I guess we can't ask them to backport HADOOP-6109 then.


I think my remaining options are
(a) Give up.  Wait till everyone upgrades to 0.23/2.0 or backport HADOOP-6109 
to hadoop 1.2* and wait till pig moves off from 0.20.2/1.0.*. 
(b) Try to workaround without touching hadoop code.

I think (a) is reasonable but tried (b).  This patch makes the status as below.

(4) 
Patch (pig-3251-trunk-v02.patch) 
||hadoop version  || PigTextInputFormat  || Bzip2TextInputFormat.java 
|| 
| 0.20 | [i]  SLOW due to HADOOP-6109 | (iii) Good |  
| 0.23 | [ii] Good.   |  (iv) Good | 


Penalty of not touching the hadoop code is, my patch adds two unnecessary 
bytearray copies when extending the Text size.  But frequency is low due to 
exponentially increasing sizes, so I hope the overall overhead is negligible.


> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira