[Pig Wiki] Update of "Pig070IncompatibleChanges" by Pra deepKamath

2010-03-17 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=34&rev2=35

--

  || Change || Section || Impact || Steps to address || Comments ||
  || Load/Store interface changes || Changes to the Load and Store Functions || 
High || [[LoadStoreMigrationGuide || Load Store Migration Guide]] 
[[Pig070LoadStoreHowTo || Pig 0.7.0 Load Store Guide]]|| ||
  || Data compression becomes load/store function specific || Handling 
Compressed Data || Unknown but hopefully low || If compression is needed, the 
underlying Input/Output format would need to support it || ||
- || Bzip compressed files in PigStorage format can no longer have .bz 
extension  || Handling Compressed Data || Low || 1. Rename existing .bz files 
to .bz2 files. 2. Update scripts to read/write files with bz2 extension || This 
change is due to the fact that Text{Input/Output}Format only supports bz2 
extension ||
  || Switching to Hadoop's local mode || Local Mode || Low || None || Main 
change is 10-20x performance slowdown. Also, local mode now uses the same UDF 
interfaces to execute UDFs as the MR mode. ||
  || Removing support for Load-Stream or Stream-Store optimization || Streaming 
|| Low to None || None || This feature was never documented so it is unlikely 
it was ever used ||
  || We no longer support serialization and decerialization via load/store 
functions || Streaming || Unknown but hopefully low to medium || Implement new 
PigToStream and StreamToPig interfaces for non-standard serialization || 
LoadStoreRedesignProposal ||
@@ -30, +29 @@

  
  With Pig 0.7.0 the read/write functionality is taking over by Hadoop's 
Input/OutputFormat and how compression is handled or whether it is handled at 
all depends on the Input/OutputFormat used by the loader/store function.
  
- The main input format that supports compression is TextInputFormat. It 
supports bzip files with .bz2 extension and gzip files with .gz extension. 
'''Note that it does not support .bz files'''. PigStorage is the only loader 
that comes with Pig that is derived from TextInputFormat which means it will be 
able to handle .bz2 and .gz files. Other loaders such as BinStorage will no 
longer support compression.
+ The main input format that supports compression is TextInputFormat. 
PigStorage is the only loader that comes with Pig that is derived from 
TextInputFormat which means it will be able to handle .bz2 and .gz files. Other 
loaders such as BinStorage will no longer support compression.
  
  On the store side, TextOutputFormat also supports compression but the store 
function needs do to additional work to enable it. Again, PigStorage will 
support compressions while other functions will not.
  


[Pig Wiki] Update of "Pig070IncompatibleChanges" by Pra deepKamath

2010-03-11 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=33&rev2=34

--

  || Switching to Hadoop's local mode || Local Mode || Low || None || Main 
change is 10-20x performance slowdown. Also, local mode now uses the same UDF 
interfaces to execute UDFs as the MR mode. ||
  || Removing support for Load-Stream or Stream-Store optimization || Streaming 
|| Low to None || None || This feature was never documented so it is unlikely 
it was ever used ||
  || We no longer support serialization and decerialization via load/store 
functions || Streaming || Unknown but hopefully low to medium || Implement new 
PigToStream and StreamToPig interfaces for non-standard serialization || 
LoadStoreRedesignProposal ||
+ || Removing BinaryStorage builtin || Streaming || Low to None || None || As 
far as we know, this class was only used internally by streaming ||
  || Output part files now have a "-m-" and "-r" in the name || Output file 
names || Low to medium || If you have a system which depends on output file 
names the names now have changed from part-X to part-m-X if the output 
is being written from the map phase of the job or part-r- if it is being 
written from the reduce phase || ||
- || Removing BinaryStorage builtin || Streaming || Low to None || None || As 
far as we know, this class was only used internally by streaming ||
  || Removing Split by file feature || Split by File || Low to None || Input 
format of the loader would need to support this || We don't know that this 
feature was widely/ever used ||
  || Local files no longer accessible from cluster || Access to Local Files 
from Map-Reduce Mode || low to none || copy the file to the cluster using 
copyToLocal command prior to the load || This feature was not documented ||
  || Removing Custom Comparators || Removing Custom Comparators || Low to None 
|| None || This feature has been deprecated since Pig 0.5.0 release. We don't 
have a single known use case ||


[Pig Wiki] Update of "Pig070IncompatibleChanges" by Pra deepKamath

2010-03-11 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=32&rev2=33

--

  || Switching to Hadoop's local mode || Local Mode || Low || None || Main 
change is 10-20x performance slowdown. Also, local mode now uses the same UDF 
interfaces to execute UDFs as the MR mode. ||
  || Removing support for Load-Stream or Stream-Store optimization || Streaming 
|| Low to None || None || This feature was never documented so it is unlikely 
it was ever used ||
  || We no longer support serialization and decerialization via load/store 
functions || Streaming || Unknown but hopefully low to medium || Implement new 
PigToStream and StreamToPig interfaces for non-standard serialization || 
LoadStoreRedesignProposal ||
- || Output part files now have a "-m-" and "-r" in the name || Output file 
names || Low to medium || If you have a system which depends on output file 
names tghe names now have changed from part-X to part-m-X if the output 
is being written from the map phase of the job or part-r- if it is being 
written from the reduce phase || || 
+ || Output part files now have a "-m-" and "-r" in the name || Output file 
names || Low to medium || If you have a system which depends on output file 
names the names now have changed from part-X to part-m-X if the output 
is being written from the map phase of the job or part-r- if it is being 
written from the reduce phase || ||
  || Removing BinaryStorage builtin || Streaming || Low to None || None || As 
far as we know, this class was only used internally by streaming ||
  || Removing Split by file feature || Split by File || Low to None || Input 
format of the loader would need to support this || We don't know that this 
feature was widely/ever used ||
  || Local files no longer accessible from cluster || Access to Local Files 
from Map-Reduce Mode || low to none || copy the file to the cluster using 
copyToLocal command prior to the load || This feature was not documented ||


[Pig Wiki] Update of "Pig070IncompatibleChanges" by Pra deepKamath

2010-03-11 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=31&rev2=32

--

  || Switching to Hadoop's local mode || Local Mode || Low || None || Main 
change is 10-20x performance slowdown. Also, local mode now uses the same UDF 
interfaces to execute UDFs as the MR mode. ||
  || Removing support for Load-Stream or Stream-Store optimization || Streaming 
|| Low to None || None || This feature was never documented so it is unlikely 
it was ever used ||
  || We no longer support serialization and decerialization via load/store 
functions || Streaming || Unknown but hopefully low to medium || Implement new 
PigToStream and StreamToPig interfaces for non-standard serialization || 
LoadStoreRedesignProposal ||
+ || Output part files now have a "-m-" and "-r" in the name || Output file 
names || Low to medium || If you have a system which depends on output file 
names tghe names now have changed from part-X to part-m-X if the output 
is being written from the map phase of the job or part-r- if it is being 
written from the reduce phase || || 
  || Removing BinaryStorage builtin || Streaming || Low to None || None || As 
far as we know, this class was only used internally by streaming ||
  || Removing Split by file feature || Split by File || Low to None || Input 
format of the loader would need to support this || We don't know that this 
feature was widely/ever used ||
  || Local files no longer accessible from cluster || Access to Local Files 
from Map-Reduce Mode || low to none || copy the file to the cluster using 
copyToLocal command prior to the load || This feature was not documented ||


[Pig Wiki] Update of "Pig070IncompatibleChanges" by Pra deepKamath

2010-03-03 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=30&rev2=31

--

  == Summary ==
  
  || Change || Section || Impact || Steps to address || Comments ||
- || Load/Store interface changes || Changes to the Load and Store Functions || 
High || [[LoadStoreMigrationGuide || Load Store Migration Guide]] || ||
+ || Load/Store interface changes || Changes to the Load and Store Functions || 
High || [[LoadStoreMigrationGuide || Load Store Migration Guide]] 
[[Pig070LoadStoreHowTo || Pig 0.7.0 Load Store Guide]]|| ||
  || Data compression becomes load/store function specific || Handling 
Compressed Data || Unknown but hopefully low || If compression is needed, the 
underlying Input/Output format would need to support it || ||
  || Bzip compressed files in PigStorage format can no longer have .bz 
extension  || Handling Compressed Data || Low || 1. Rename existing .bz files 
to .bz2 files. 2. Update scripts to read/write files with bz2 extension || This 
change is due to the fact that Text{Input/Output}Format only supports bz2 
extension ||
  || Switching to Hadoop's local mode || Local Mode || Low || None || Main 
change is 10-20x performance slowdown. Also, local mode now uses the same UDF 
interfaces to execute UDFs as the MR mode. ||


[Pig Wiki] Update of "Pig070IncompatibleChanges" by Pra deepKamath

2010-02-23 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=20&rev2=21

--

  
  First, in the initial (0.7.0) release, '''we will not support optimization''' 
where if streaming follows load of compatible format or is followed by format 
compatible store the data is not parsed but passed in chunks from the loader or 
to the store. The main reason we are not porting the optimization is that the 
work is not trivial and the optimization was never documented and so unlikely 
to be used.
  
- Second, '''you can no longer use load/store functions for 
(de)serialization.''' A new interface has been defined that has to be 
implemented for custom (de)serializations. The default (PigStorage) format will 
continue to work. This format is now implemented by a class called 
org.apache.pig.impl.streaming.PigStreaming that can be also used directly in 
the streaming statement. Note that this class handles arbitrary delimiters: For 
example your statement could look like:
+ Second, '''you can no longer use load/store functions for 
(de)serialization.''' A new interface has been defined that has to be 
implemented for custom (de)serializations. The default (PigStorage) format will 
continue to work. This format is now implemented by a class called 
org.apache.pig.builtin.PigStreaming that can be also used directly in the 
streaming statement. Note that this class handles arbitrary delimiters: For 
example your statement could look like:
  {{{
-  `perl StreamScript.pl` input(stdin using 
org.apache.pig.impl.streaming.PigStreaming(',')) output(stdout using 
org.apache.pig.impl.streaming.PigStreaming(';')) <...remaining options...>;
+  `perl StreamScript.pl` input(stdin using PigStreaming(',')) output(stdout 
PigStreaming(';')) <...remaining options...>;
  }}}Details of the new interface are describe in 
http://wiki.apache.org/pig/LoadStoreRedesignProposal.
  
  We have also removed org.apache.pig.builtin.BinaryStorage loader/store 
function and org.apache.pig.builtin.PigDump which were only used from within 
streaming. They can be restored if needed - we would just need to implement the 
corresponding Input/OutputFormats.
@@ -70, +70 @@

  
  == Removing Custom Comparators ==
  
- This functionality was added to deal with gap in Pig's early functionality - 
lack of numeric comparison in order by as well as lack of descending sort. This 
functionality has been present in last 4 releases and custom comparators has 
been depricated in the last several releases. They functionality is removed in 
this release.
+ This functionality was added to deal with gap in Pig's early functionality - 
lack of numeric comparison in order by as well as lack of descending sort. This 
functionality has been present in last 4 releases and custom comparators has 
been deprecated in the last several releases. They functionality is removed in 
this release.
  
  == Merge Join ==
  In Pig.0.6.0 there was a pre-condition for merge join: "The loadfunc for the 
right input of the join should implement the !SamplableLoader interface" - 
instead the !LoadFunc should now implement !OrderedLoadFunc interface in Pig 
0.7.0. All other pre-condtions still hold.


[Pig Wiki] Update of "Pig070IncompatibleChanges" by Pra deepKamath

2010-02-18 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=19&rev2=20

--

  
  First, in the initial (0.7.0) release, '''we will not support optimization''' 
where if streaming follows load of compatible format or is followed by format 
compatible store the data is not parsed but passed in chunks from the loader or 
to the store. The main reason we are not porting the optimization is that the 
work is not trivial and the optimization was never documented and so unlikely 
to be used.
  
- Second, '''you can no longer use load/store functions for 
(de)serialization.''' A new interface has been defined that has to be 
implemented for custom (de)serializations. The default (PigStorage) format will 
continue to work. This format is now implemented by a class called 
org.apache.pig.impl.streaming.PigStreaming that can be also used directly in 
the streaming statement. Details of the new interface are describe in 
http://wiki.apache.org/pig/LoadStoreRedesignProposal.
+ Second, '''you can no longer use load/store functions for 
(de)serialization.''' A new interface has been defined that has to be 
implemented for custom (de)serializations. The default (PigStorage) format will 
continue to work. This format is now implemented by a class called 
org.apache.pig.impl.streaming.PigStreaming that can be also used directly in 
the streaming statement. Note that this class handles arbitrary delimiters: For 
example your statement could look like:
+ {{{
+  `perl StreamScript.pl` input(stdin using 
org.apache.pig.impl.streaming.PigStreaming(',')) output(stdout using 
org.apache.pig.impl.streaming.PigStreaming(';')) <...remaining options...>;
+ }}}Details of the new interface are describe in 
http://wiki.apache.org/pig/LoadStoreRedesignProposal.
  
  We have also removed org.apache.pig.builtin.BinaryStorage loader/store 
function and org.apache.pig.builtin.PigDump which were only used from within 
streaming. They can be restored if needed - we would just need to implement the 
corresponding Input/OutputFormats.
  


[Pig Wiki] Update of "Pig070IncompatibleChanges" by Pra deepKamath

2010-02-18 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=18&rev2=19

--

  
  In the earlier versions of Pig, a user could specify "split by file" on the 
loader statement which would make sure that each map got the entire file rather 
than the files were further divided into blocks. This feature was primarily 
design for streaming optimization but could also be used with loaders that 
can't deal with incomplete records. We don't believe that this functionality 
has been widely used.
  
- Because the slicing of the data is no longer in Pig's control, we can't 
support this feature generically for every loader. If a particular loader needs 
this functionality, it will need to make sure that the underlying InputFormat 
supports it. 
+ Because the slicing of the data is no longer in Pig's control, we can't 
support this feature generically for every loader. If a particular loader needs 
this functionality, it will need to make sure that the underlying InputFormat 
supports it. (Any !InputFormat based on !FileInputFormat will support this 
through the mapred.min.split.size - if this property is set to a value greater 
than the size of any of the files to be loaded then each file will be split as 
a different split. This property can be provided on the pig command line as a 
java -D property - note that this will apply to all jobs that will be run as 
part of that script.
  
  We will have a different approach for streaming optimization if that 
functionality is necessary.
  


[Pig Wiki] Update of "Pig070IncompatibleChanges" by Pra deepKamath

2010-02-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=17&rev2=18

--

  This functionality was added to deal with gap in Pig's early functionality - 
lack of numeric comparison in order by as well as lack of descending sort. This 
functionality has been present in last 4 releases and custom comparators has 
been depricated in the last several releases. They functionality is removed in 
this release.
  
  == Merge Join ==
- In Pig.0.6.0 there was a pre-condition for merge join: "The loadfunc for the 
right input of the join should implement the SamplableLoader interface" - this 
condition is no longer required in Pig 0.7.0. All other pre-condtions still 
hold.
+ In Pig.0.6.0 there was a pre-condition for merge join: "The loadfunc for the 
right input of the join should implement the !SamplableLoader interface" - 
instead the !LoadFunc should now implement !OrderedLoadFunc interface in Pig 
0.7.0. All other pre-condtions still hold.
  


[Pig Wiki] Update of "Pig070IncompatibleChanges" by Pra deepKamath

2010-02-10 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=16&rev2=17

--

  
  == Changes to the Load and Store Functions ==
  
- TBW [Need to take a load (with and without custom slicer) and a store 
function and create new versions as examples. Can use PigStorage for (1) and 
(3) but need to choose a loader for (2).]
+ See [[LoadStoreMigrationGuide | Load Store Migration Guide]]
  
  
  == Handling Compressed Data ==


[Pig Wiki] Update of "Pig070IncompatibleChanges" by Pra deepKamath

2010-02-04 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=15&rev2=16

--

  
  This functionality was added to deal with gap in Pig's early functionality - 
lack of numeric comparison in order by as well as lack of descending sort. This 
functionality has been present in last 4 releases and custom comparators has 
been depricated in the last several releases. They functionality is removed in 
this release.
  
+ == Merge Join ==
+ In Pig.0.6.0 there was a pre-condition for merge join: "The loadfunc for the 
right input of the join should implement the SamplableLoader interface" - this 
condition is no longer required in Pig 0.7.0. All other pre-condtions still 
hold.
+ 


[Pig Wiki] Update of "Pig070IncompatibleChanges" by Pra deepKamath

2010-01-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=13&rev2=14

--

  
  == Access to Local Files from Map-Reduce Mode ==
  
- In the earlier version of Pig, you could access a local file from map-reduce 
mode by prepending file:// to the file location:
+ In the earlier version of Pig, you could access a local file from map-reduce 
mode by prepending file: to the file location:
  
  {{{
  A = load 'file:/mydir/myfile';


[Pig Wiki] Update of "Pig070IncompatibleChanges" by Pra deepKamath

2010-01-11 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=12&rev2=13

--

  
  First, in the initial (0.7.0) release, '''we will not support optimization''' 
where if streaming follows load of compatible format or is followed by format 
compatible store the data is not parsed but passed in chunks from the loader or 
to the store. The main reason we are not porting the optimization is that the 
work is not trivial and the optimization was never documented and so unlikely 
to be used.
  
- Second, '''you can no longer use load/store functions for 
(de)serialization.''' A new interface has been defined that needed to be 
implemented for custom (de)serializations. The default (PigStorage) format will 
continue to work. This formar is now implemented by a class called 
org.apache.pig.impl.streaming.PigStreaming that can be also used directly in 
the streaming statement. Details of the new interface are describe in 
http://wiki.apache.org/pig/LoadStoreRedesignProposal.
+ Second, '''you can no longer use load/store functions for 
(de)serialization.''' A new interface has been defined that has to be 
implemented for custom (de)serializations. The default (PigStorage) format will 
continue to work. This format is now implemented by a class called 
org.apache.pig.impl.streaming.PigStreaming that can be also used directly in 
the streaming statement. Details of the new interface are describe in 
http://wiki.apache.org/pig/LoadStoreRedesignProposal.
  
  We have also removed org.apache.pig.builtin.BinaryStorage loader/store 
function and org.apache.pig.builtin.PigDump which were only used from within 
streaming. They can be restored if needed - we would just need to implement the 
corresponding Input/OutputFormats.
  


[Pig Wiki] Update of "Pig070IncompatibleChanges" by Pra deepKamath

2009-12-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Pig070IncompatibleChanges" page has been changed by PradeepKamath.
http://wiki.apache.org/pig/Pig070IncompatibleChanges?action=diff&rev1=9&rev2=10

--

  
  Second, '''you can no longer use load/store functions for 
(de)serialization.''' A new interface has been defined that needed to be 
implemented for custom (de)serializations. The default (PigStorage) format will 
continue to work. This formar is now implemented by a class called 
org.apache.pig.impl.streaming.PigStreaming that can be also used directly in 
the streaming statement. Details of the new interface are describe in 
http://wiki.apache.org/pig/LoadStoreRedesignProposal.
  
- We have also removed org.apache.pig.builtin.BinaryStorage loader/store 
function and org.apache.pig.builtin.PigDump which were only used from within 
straming. They can be restored if needed - we would just need to implement the 
corresponding Input/OutputFormats.
+ We have also removed org.apache.pig.builtin.BinaryStorage loader/store 
function and org.apache.pig.builtin.PigDump which were only used from within 
streaming. They can be restored if needed - we would just need to implement the 
corresponding Input/OutputFormats.
  
  == Split by File ==
  
@@ -60, +60 @@

  In Pig 0.7.0, you can no longer do this and if this functionality is still 
desired, you can add the copy into your script manually:
  
  {{{
- fs copyFromLocal src dist
+ fs -copyFromLocal src dist
  A = load 'dist';
  
  }}}