[
https://issues.apache.org/jira/browse/PIG-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884834#action_12884834
]
Richard Ding commented on PIG-1389:
-----------------------------------
We use Hadoop Path as a parser to parse the input/output locations for two use
cases:
* Determine the scheme of the location (e.g., hdfs, hbase, file, har, ...), and
* Get the short file name of the location
Ashutosh is right that this approach doesn't work with location string as in
PIG-1229:
{code}
jdbc:hsqldb:file:/tmp/batchtest;hsqldb.default_table_type=cached;hsqldb.cache_rows=100
{code}
A RuntimeException is thrown when trying to parse it.
The proposal is to use Java URI as parser instead for these use cases (Java URI
throws a checked exception for invalid syntax).
> Implement Pig counter to track number of rows for each input files
> -------------------------------------------------------------------
>
> Key: PIG-1389
> URL: https://issues.apache.org/jira/browse/PIG-1389
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.7.0
> Reporter: Richard Ding
> Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1389.patch, PIG-1389.patch, PIG-1389_1.patch
>
>
> A MR job generated by Pig not only can have multiple outputs (in the case of
> multiquery) but also can have multiple inputs (in the case of join or
> cogroup). In both cases, the existing Hadoop counters (e.g.
> MAP_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS) can not be used to count the number
> of records in the given input or output. PIG-1299 addressed the case of
> multiple outputs. We need to add new counters for jobs with multiple inputs.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.