[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-04-09 Thread Lorand Bendig (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13963972#comment-13963972
 ] 

Lorand Bendig commented on PIG-3642:


[~aniket486] Currently the restriction is that only dump is allowed which 
implies that users won't query large amount of data.
If store were supported then 'fetch.task.conversion.threshold' would make 
sense. My concern is however, that most of the loaders return null for 
LoadMetadata#getStatistic. 

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642-5.patch, 
> PIG-3642-6.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-04-08 Thread Aniket Mokashi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13963529#comment-13963529
 ] 

Aniket Mokashi commented on PIG-3642:
-

bq. However, a threshold could be given on the input size (an estimation) to 
determine whether to prefer fetch over MR jobs, similar to what Hive's 
'hive.fetch.task.conversion.threshold' does. (through Pig's 
LoadMetadata#getStatistic ?)

[~lbendig], we currently do not have such a restriction, correct? Should we add 
it? (The only case that doesn't require a size restriction is - 
load-limit-dump).

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642-5.patch, 
> PIG-3642-6.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-02-03 Thread Lorand Bendig (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13889824#comment-13889824
 ] 

Lorand Bendig commented on PIG-3642:


[~cheolsoo], great! Thank you for your assistance! I'll take care of the docs.

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642-5.patch, 
> PIG-3642-6.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-28 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13884207#comment-13884207
 ] 

Cheolsoo Park commented on PIG-3642:


[~lbendig], yes, I still want to commit this patch. As soon as you rebase your 
patch, I can re-run the tests.

For e2e tests, I think you add "opt.fetch=false" to 
"./test/e2e/pig/conf/testpropertiesfile.conf". (I haven't tested it.)

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-28 Thread Lorand Bendig (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13884027#comment-13884027
 ] 

Lorand Bendig commented on PIG-3642:


[~cheolsoo], do you think this patch is a good candidate to have it in trunk? 
If so, I'd rebase it, had a look at the e2e conf and would run the unit tests 
again (e.g to see if it really doesn't conflict with PIG-3463).

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-21 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13877734#comment-13877734
 ] 

Cheolsoo Park commented on PIG-3642:


Just FYI- e2e tests in trunk pass as of 01/21 except StreamingPythonUDFs_10. I 
will file a jira for this failure.

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-20 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876352#comment-13876352
 ] 

Cheolsoo Park commented on PIG-3642:


[~lbendig], you're right. I was reviewing Aniket's patch today and realized 
that these two patches are fairly independent of each other.

[~aniket486], after understanding your patch more, I agree with Lorand 
regarding the complexity. Besides, it makes mapper only jobs almost instant. I 
couldn't compare runtime between PIG-3463 and PIG-3642 because the current 
patch for PIG-3463 didn't work for mapper only jobs. However, I imagine it 
would be quite slower since it still launches local MR jobs, etc. So shall we 
commit both?

In fact, what really concerns me is that these optimizations make many tests 
run differently than before. For eg, many e2e tests that are running as MR jobs 
now can run as fetch jobs. That significantly changes our code coverage. So I'd 
like to explicitly disable these optimizations in all the existing e2e tests. 
It should be trivial to do via conf files. Do you agree?



> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-20 Thread Lorand Bendig (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876315#comment-13876315
 ] 

Lorand Bendig commented on PIG-3642:


Cheolsoo, thanks for testing it through.
However, I don't see the complexity as crucial, and it's also fairly separated 
from the existing code,
I see Aniket's point of view. I of course have +1 but obviously you guys on 
board have the right to make the decision.

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-19 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876043#comment-13876043
 ] 

Cheolsoo Park commented on PIG-3642:


Actually, discard my e2e results. I found some environment issues and am 
rerunning them.

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-19 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875995#comment-13875995
 ] 

Cheolsoo Park commented on PIG-3642:


I will leave the decision to Aniket and Lorand.

Just FYI- I have been running e2e tests, and I found many test failures even 
without this patch. So it was hard to tell whether this patch breaks any tests 
or not.

Here is the result in the current trunk (without this patch)-
{code}
[exec] Final results ,PASSED: 536  FAILED: 22   SKIPPED: 24   ABORTED: 62   
FAILED DEPENDENCY: 0
{code}
We should fix these before things get worse.

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-16 Thread Aniket Mokashi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873754#comment-13873754
 ] 

Aniket Mokashi commented on PIG-3642:
-

[~lbendig], I'm not against the adding this feature to pig. However, from 
feature point of view, PIG-3463 is super set of PIG-3642 with less complexity 
(pre-tested code etc) and code maintenance. Hence, we should avoid adding this 
extra complexity and burden of maintenance, if possible.

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-16 Thread Lorand Bendig (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873440#comment-13873440
 ] 

Lorand Bendig commented on PIG-3642:


[~aniket486] PIG-3463 is a great add-on! Although the two patches target the 
same issue, as far as I see, they are not overlapping.
I think, in this case PIG-3642 can be regarded as a further optimization on top 
of PIG-3463 when only a subset of simple operators are used and there's even no 
need of kicking off the LocalJobRunner. Having both add-ons can result in a way 
faster development cycle, I think. Why do not have both?

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-15 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873082#comment-13873082
 ] 

Cheolsoo Park commented on PIG-3642:


[~aniket486], are you saying we should commit PIG-3463 instead of this patch 
because the former includes the latter?

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-15 Thread Aniket Mokashi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13872774#comment-13872774
 ] 

Aniket Mokashi commented on PIG-3642:
-

We have PIG-3463 that makes small jobs run in hadoop local mode. If that looks 
good, we can avoid adding complexity of FetchOptimizer.

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-13 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870364#comment-13870364
 ] 

Cheolsoo Park commented on PIG-3642:


Unit tests all pass! Thank you so much!

I am running e2e tests now.

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-12 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869187#comment-13869187
 ] 

Cheolsoo Park commented on PIG-3642:


Thank you Lorand! I am running unit tests again with your new patch.

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-12 Thread Lorand Bendig (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13868977#comment-13868977
 ] 

Lorand Bendig commented on PIG-3642:


[~cheolsoo], thanks for pointing out these issues!
{quote}org.apache.pig.test.TestDefaultDateTimeZone.testLocalExecution{quote}
It's because fetch didn't initialize pig.datetime.default.tz with the current 
timezone. Fixed.

{quote}
org.apache.pig.test.TestEvalPipeline2.testNonStandardDataWithoutFetch
org.apache.pig.test.TestEvalPipeline2.testBinStorageByteArrayCastsSimple
org.apache.pig.test.TestEvalPipeline2.testLoadWithDifferentSchema
{quote}
This was a non-fetch issue, now fixed in PIG-3662

{quote}org.apache.pig.test.TestStoreInstances.testBackendStoreCommunication{quote}
The problem was here that FetchOptimizer initialized FileLocalizer#relativeRoot 
to check whether POStore is related to a dump.
This initialized temporary path is in the threadlocal and it might happen that 
a Wrong FS: file:/..., expected: hdfs://... exception
is thrown in those cases if the test is executed both in local and mapreduce 
mode in the same session. The temp path is initialized
to file:/ for the local mode and is reused for the mapreduce mode which causes 
the exception.
Now relativeRoot is not initialized by FetchOptimizer.

I managed to run test-core successfully with -Dhadoopversion=23.

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-05 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862693#comment-13862693
 ] 

Cheolsoo Park commented on PIG-3642:


I see 6 failures in unit tests-
{code}
>>> org.apache.pig.impl.builtin.TestStreamingUDF.testPythonUDF_onCluster
>>> org.apache.pig.test.TestDefaultDateTimeZone.testLocalExecution
>>> org.apache.pig.test.TestEvalPipeline2.testNonStandardDataWithoutFetch
>>> org.apache.pig.test.TestEvalPipeline2.testBinStorageByteArrayCastsSimple
>>> org.apache.pig.test.TestEvalPipeline2.testLoadWithDifferentSchema
>>> org.apache.pig.test.TestStoreInstances.testBackendStoreCommunication
{code}
Can you please take a look at them?


> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642-3.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-02 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860531#comment-13860531
 ] 

Alan Gates commented on PIG-3642:
-

I don't think this will result in the same local mode/mr mode problem that we 
had before.  The issue there was we tried (and failed) to have two modes where 
Pig provided all features.  This is much more limited to doing things locally 
that can easily be done locally.

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-02 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860392#comment-13860392
 ] 

Cheolsoo Park commented on PIG-3642:


[~azaroth], thank you for raising a concern. But I still think we should commit 
this patch for the following reasons-

# Fetch optimization happens after physical plan is fully built. If the plan is 
fetchable (i.e. meets all the conditions Lorand listed in the description), Pig 
will launch a job via FetchLauncher instead via MapReduceLauncher. Given this 
code path, I think the possibility of introducing a weird optimization bug is 
minimal. In addition, the optimization is only applicable to fairly small 
queries.
# There are indeed changes to some backend operators such as POStream. This is 
because the logic about when to pull data from pipeline is different in some 
cases. But these changes are fairly minimal too.
# IMO, the benefit of this optimization is big. I am constantly asked by users 
about this feature. True that it won't improve any performance of production 
ETL jobs, but it will shorten development iteration. In addition, launching a 
full MR job for a simple load/dump query definitely makes a bad impression to 
new users.






> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-02 Thread Gianmarco De Francisci Morales (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860228#comment-13860228
 ] 

Gianmarco De Francisci Morales commented on PIG-3642:
-

I haven't reviewed the patch thoroughly so take my comments with the due care.
I am just afraid that we will redo the same "mistake" we did with the local 
mode execution of Pig that you mention in the ticket.
That mode of execution was removed because it was a burden to maintain, and in 
the end the two implementations (MR and local mode) were out of synch, 
resulting in the same script doing different things.
I just want to avoid the same thing happening again.

If [~cheolsoo] has reviewed the patch, I would like to hear his comments on 
this issue.

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-02 Thread Lorand Bendig (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860224#comment-13860224
 ] 

Lorand Bendig commented on PIG-3642:


[~azaroth], I took the idea of this patch from HIVE-2925 and PIG-2864. I agree, 
that the benefit is limited, however simple scripts/queries will run 
significantly faster than in local MR mode. As far as I can judge, aside from 
some mocking and initialization
the execution logic literally follows Pig's pull-based model. What optimization 
bugs do you think that can happen? 


> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2014-01-02 Thread Gianmarco De Francisci Morales (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860156#comment-13860156
 ] 

Gianmarco De Francisci Morales commented on PIG-3642:
-

I am -0 on this idea.
Skipping MR requires rewriting good part of the execution logic, and might 
introduce weird optimization bugs.
More importantly, the added advantage brought by this feature is small.
Usually, if you want to test your program on a small input, you copy it locally 
and run Pig in local mode.

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2013-12-29 Thread Lorand Bendig (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858473#comment-13858473
 ] 

Lorand Bendig commented on PIG-3642:


Please find attached the review request at : https://reviews.apache.org/r/16507/

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2013-12-29 Thread Lorand Bendig (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858471#comment-13858471
 ] 

Lorand Bendig commented on PIG-3642:


[~cheolsoo] es, I'd like to have it reviewed

> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3642) Direct HDFS access for small jobs (fetch)

2013-12-28 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858261#comment-13858261
 ] 

Cheolsoo Park commented on PIG-3642:


[~lbendig], is this patch ready for review? If so, do you mind uploading it to 
the [review board|https://reviews.apache.org]?

This feature will be very useful!


> Direct HDFS access for small jobs (fetch) 
> --
>
> Key: PIG-3642
> URL: https://issues.apache.org/jira/browse/PIG-3642
> Project: Pig
>  Issue Type: Improvement
>Reporter: Lorand Bendig
>Assignee: Lorand Bendig
> Fix For: 0.13.0
>
> Attachments: PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from 
> HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive 
> already has this feature (fetch). This patch shares some similarities with 
> the local mode of Pig 0.6. Here, fetching kicks off when the following holds 
> for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, 
> (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this 
> "optimization" is for launching small/simple scripts during development, 
> rather than querying and filtering large number of rows on the client 
> machine. However, a threshold could be given on the input size (an 
> estimation) to determine whether to prefer fetch over MR jobs, similar to 
> what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's 
> LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)