[
https://issues.apache.org/jira/browse/PIG-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jerry Chen updated PIG-3404:
----------------------------
Attachment: PIG-3404.patch
Patch for reference
> Improve Pig to ignore bad files or inaccessible files or folders
> ----------------------------------------------------------------
>
> Key: PIG-3404
> URL: https://issues.apache.org/jira/browse/PIG-3404
> Project: Pig
> Issue Type: New Feature
> Components: data
> Affects Versions: 0.11.2
> Reporter: Jerry Chen
> Labels: Rhino
> Attachments: PIG-3404.patch
>
>
> There are use cases in Pig:
> * A directory is used as the input of a load operation. It is possible that
> one or more files in that directory are bad files (for example, corrupted or
> bad data caused by compression).
> * A directory is used as the input of a load operation. The current user may
> not have permission to access any subdirectories or files of that directory.
> The current Pig implementation will abort the whole Pig job for such cases.
> It would be useful to have option to allow the job to continue and ignore the
> bad files or inaccessible files/folders without abort the job, ideally, log
> or print a warning for such error or violations. This requirement is not
> trivial because for big data set for large analytics applications, this is
> not always possible to sort out the good data for processing; Ignore a few
> of bad files may be a better choice for such situations.
> We propose to use “Ignore bad files” flag to address this problem.
> AvroStorage and related file format in Pig already has this flag but it is
> not complete to cover all the cases mentioned above. We would improve the
> PigStorage and related text format to support this new flag as well as
> improve AvroStorage and related facilities to completely support the concept.
> The flag is “Storage” (For example, PigStorage or AvroStorage) based and can
> be set for each load operation respectively. The value of this flag will be
> false if it is not explicitly set. Ideally, we can provide a global pig
> parameter which forces the default value to true for all load functions even
> if it is not explicitly set in the LOAD statement.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira