[
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13868977#comment-13868977
]
Lorand Bendig commented on PIG-3642:
------------------------------------
[~cheolsoo], thanks for pointing out these issues!
{quote}org.apache.pig.test.TestDefaultDateTimeZone.testLocalExecution{quote}
It's because fetch didn't initialize pig.datetime.default.tz with the current
timezone. Fixed.
{quote}
org.apache.pig.test.TestEvalPipeline2.testNonStandardDataWithoutFetch
org.apache.pig.test.TestEvalPipeline2.testBinStorageByteArrayCastsSimple
org.apache.pig.test.TestEvalPipeline2.testLoadWithDifferentSchema
{quote}
This was a non-fetch issue, now fixed in PIG-3662
{quote}org.apache.pig.test.TestStoreInstances.testBackendStoreCommunication{quote}
The problem was here that FetchOptimizer initialized FileLocalizer#relativeRoot
to check whether POStore is related to a dump.
This initialized temporary path is in the threadlocal and it might happen that
a Wrong FS: file:/..., expected: hdfs://... exception
is thrown in those cases if the test is executed both in local and mapreduce
mode in the same session. The temp path is initialized
to file:/ for the local mode and is reused for the mapreduce mode which causes
the exception.
Now relativeRoot is not initialized by FetchOptimizer.
I managed to run test-core successfully with -Dhadoopversion=23.
> Direct HDFS access for small jobs (fetch)
> ------------------------------------------
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
> Issue Type: Improvement
> Reporter: Lorand Bendig
> Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive
> already has this feature (fetch). This patch shares some similarities with
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM,
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch
> * set opt.fetch true/false;
> There's no STORE support because I wanted to make it explicit that this
> "optimization" is for launching small/simple scripts during development,
> rather than querying and filtering large number of rows on the client
> machine. However, a threshold could be given on the input size (an
> estimation) to determine whether to prefer fetch over MR jobs, similar to
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's
> LoadMetadata#getStatistic ?)
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)