Jacob Tolar created PIG-5432:
--------------------------------

             Summary: OrcStorage fails to detect schema in some cases
                 Key: PIG-5432
                 URL: https://issues.apache.org/jira/browse/PIG-5432
             Project: Pig
          Issue Type: New Feature
            Reporter: Jacob Tolar


OrcStorage needs to detect the schema of input data paths. If some data paths 
have no files this will fail. 

For example: 

{code}
A = LOAD '/path/to/20230101,/path/to/20230102' USING OrcStorage();
{code}

If {{/path/to/20230101}} contains only a _SUCCESS marker and {{20230102}} 
contains data, OrcStorage fails to detect the schema. 

The code tries to use a search algorithm to recursively search through all 
input paths for the data (via Utils.depthFirstSearchForFile), but it is 
implemented incorrectly and returns early in this scenario.

See: 
https://github.com/apache/pig/blob/c0d75ba930f9aa5c6454d0264a96f82b45279202/src/org/apache/pig/builtin/OrcStorage.java#L389-L408

https://github.com/apache/pig/blob/59ec4a326079c9f937a052194405415b1e3a2b06/src/org/apache/pig/impl/util/Utils.java#L629-L667


I'll attach a patch.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to