[jira] Commented: (HIVE-1016) Ability to access DistributedCache from UDFs

2010-08-31 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904860#action_12904860
 ] 

Carl Steinbach commented on HIVE-1016:
--

@Namit: GenericUDF.initialize() is called both at compile-time and run-time.

> Ability to access DistributedCache from UDFs
> 
>
> Key: HIVE-1016
> URL: https://issues.apache.org/jira/browse/HIVE-1016
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Carl Steinbach
>Assignee: Carl Steinbach
> Attachments: HIVE-1016.1.patch.txt
>
>
> There have been several requests on the mailing list for
> information about how to access the DistributedCache from UDFs, e.g.:
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01650.html
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01926.html
> While responses to these emails suggested several workarounds, the only 
> correct
> way of accessing the distributed cache is via the static methods of Hadoop's
> DistributedCache class, and all of these methods require that the JobConf be 
> passed
> in as a parameter. Hence, giving UDFs access to the distributed cache
> reduces to giving UDFs access to the JobConf.
> I propose the following changes to GenericUDF/UDAF/UDTF:
> * Add an exec_init(Configuration conf) method that is called during Operator 
> initialization at runtime.
> * Change the name of the "initialize" method to "compile_init" to make it 
> clear that this method is called at compile-time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1016) Ability to access DistributedCache from UDFs

2010-08-30 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904413#action_12904413
 ] 

Namit Jain commented on HIVE-1016:
--

Also, the method initialize() is called at runtime - and not at compile time 
(in case of generic UDFs)

> Ability to access DistributedCache from UDFs
> 
>
> Key: HIVE-1016
> URL: https://issues.apache.org/jira/browse/HIVE-1016
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Carl Steinbach
>Assignee: Carl Steinbach
> Attachments: HIVE-1016.1.patch.txt
>
>
> There have been several requests on the mailing list for
> information about how to access the DistributedCache from UDFs, e.g.:
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01650.html
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01926.html
> While responses to these emails suggested several workarounds, the only 
> correct
> way of accessing the distributed cache is via the static methods of Hadoop's
> DistributedCache class, and all of these methods require that the JobConf be 
> passed
> in as a parameter. Hence, giving UDFs access to the distributed cache
> reduces to giving UDFs access to the JobConf.
> I propose the following changes to GenericUDF/UDAF/UDTF:
> * Add an exec_init(Configuration conf) method that is called during Operator 
> initialization at runtime.
> * Change the name of the "initialize" method to "compile_init" to make it 
> clear that this method is called at compile-time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1016) Ability to access DistributedCache from UDFs

2010-08-30 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904410#action_12904410
 ] 

Namit Jain commented on HIVE-1016:
--

I agree, it will be more work right now - but we should take the hit now rather 
than going for a shortcut.

> Ability to access DistributedCache from UDFs
> 
>
> Key: HIVE-1016
> URL: https://issues.apache.org/jira/browse/HIVE-1016
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Carl Steinbach
>Assignee: Carl Steinbach
> Attachments: HIVE-1016.1.patch.txt
>
>
> There have been several requests on the mailing list for
> information about how to access the DistributedCache from UDFs, e.g.:
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01650.html
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01926.html
> While responses to these emails suggested several workarounds, the only 
> correct
> way of accessing the distributed cache is via the static methods of Hadoop's
> DistributedCache class, and all of these methods require that the JobConf be 
> passed
> in as a parameter. Hence, giving UDFs access to the distributed cache
> reduces to giving UDFs access to the JobConf.
> I propose the following changes to GenericUDF/UDAF/UDTF:
> * Add an exec_init(Configuration conf) method that is called during Operator 
> initialization at runtime.
> * Change the name of the "initialize" method to "compile_init" to make it 
> clear that this method is called at compile-time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1016) Ability to access DistributedCache from UDFs

2010-08-30 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904406#action_12904406
 ] 

John Sichi commented on HIVE-1016:
--

I think passing it in rather than using a singleton is preferable from the 
standpoint of sandboxing (not to mention threading).

The UDFContext approach can still be used--pass that in (rather than individual 
params) and then it's easier to add more accessors incrementally without 
breaking anything.

I agree it would be cleaner if we had a common base, but I don't think that 
lack should stop us from using the pass-in approach.  Is there a reason we 
can't add one?  By accompanying it with a corresponding (optional) interface, 
we can also avoid breaking existing functions.



> Ability to access DistributedCache from UDFs
> 
>
> Key: HIVE-1016
> URL: https://issues.apache.org/jira/browse/HIVE-1016
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Carl Steinbach
>Assignee: Carl Steinbach
> Attachments: HIVE-1016.1.patch.txt
>
>
> There have been several requests on the mailing list for
> information about how to access the DistributedCache from UDFs, e.g.:
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01650.html
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01926.html
> While responses to these emails suggested several workarounds, the only 
> correct
> way of accessing the distributed cache is via the static methods of Hadoop's
> DistributedCache class, and all of these methods require that the JobConf be 
> passed
> in as a parameter. Hence, giving UDFs access to the distributed cache
> reduces to giving UDFs access to the JobConf.
> I propose the following changes to GenericUDF/UDAF/UDTF:
> * Add an exec_init(Configuration conf) method that is called during Operator 
> initialization at runtime.
> * Change the name of the "initialize" method to "compile_init" to make it 
> clear that this method is called at compile-time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1016) Ability to access DistributedCache from UDFs

2010-08-30 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904400#action_12904400
 ] 

Carl Steinbach commented on HIVE-1016:
--

@Namit: I initially preferred that approach too, and I think it would make 
sense if all of the UDF
classes inherited from the same abstract base class. However, we have a bunch 
of unrelated
UDF base classes (UDF, UDAF, GenericUDF, GenericUDAFEvaluator (which already 
has a
runtime init() method), and GenericUDTF), and taking the approach you suggested 
would require
modifications to all of these classes as well as the code that calls them. I 
also think it's likely that
we'll want to make more runtime context available to UDFs in the future, and 
it's easier to proxy
this through the UDFContext singleton than to keep adding methods to each of 
the different UDF
base classes.

> Ability to access DistributedCache from UDFs
> 
>
> Key: HIVE-1016
> URL: https://issues.apache.org/jira/browse/HIVE-1016
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Carl Steinbach
>Assignee: Carl Steinbach
> Attachments: HIVE-1016.1.patch.txt
>
>
> There have been several requests on the mailing list for
> information about how to access the DistributedCache from UDFs, e.g.:
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01650.html
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01926.html
> While responses to these emails suggested several workarounds, the only 
> correct
> way of accessing the distributed cache is via the static methods of Hadoop's
> DistributedCache class, and all of these methods require that the JobConf be 
> passed
> in as a parameter. Hence, giving UDFs access to the distributed cache
> reduces to giving UDFs access to the JobConf.
> I propose the following changes to GenericUDF/UDAF/UDTF:
> * Add an exec_init(Configuration conf) method that is called during Operator 
> initialization at runtime.
> * Change the name of the "initialize" method to "compile_init" to make it 
> clear that this method is called at compile-time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1016) Ability to access DistributedCache from UDFs

2010-08-30 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904382#action_12904382
 ] 

Namit Jain commented on HIVE-1016:
--

Instead of keeping a global context, it might be better to pass the context at 
initialize time,
and then use it. The same testcase should work.

> Ability to access DistributedCache from UDFs
> 
>
> Key: HIVE-1016
> URL: https://issues.apache.org/jira/browse/HIVE-1016
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Carl Steinbach
>Assignee: Carl Steinbach
> Attachments: HIVE-1016.1.patch.txt
>
>
> There have been several requests on the mailing list for
> information about how to access the DistributedCache from UDFs, e.g.:
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01650.html
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01926.html
> While responses to these emails suggested several workarounds, the only 
> correct
> way of accessing the distributed cache is via the static methods of Hadoop's
> DistributedCache class, and all of these methods require that the JobConf be 
> passed
> in as a parameter. Hence, giving UDFs access to the distributed cache
> reduces to giving UDFs access to the JobConf.
> I propose the following changes to GenericUDF/UDAF/UDTF:
> * Add an exec_init(Configuration conf) method that is called during Operator 
> initialization at runtime.
> * Change the name of the "initialize" method to "compile_init" to make it 
> clear that this method is called at compile-time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1016) Ability to access DistributedCache from UDFs

2010-08-30 Thread HBase Review Board (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904377#action_12904377
 ] 

HBase Review Board commented on HIVE-1016:
--

Message from: "Carl Steinbach" 

---
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/748/
---

Review request for Hive Developers.


Summary
---

This patch adds UDFContext, a singleton class that provides UDFs with runtime
access to the JobConf. This patch borrows heavily from PIG-1085.


This addresses bug HIVE-1016.
http://issues.apache.org/jira/browse/HIVE-1016


Diffs
-

  ql/src/java/org/apache/hadoop/hive/ql/exec/ExecMapper.java c257217 
  ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 3d7c798 
  ql/src/java/org/apache/hadoop/hive/ql/exec/UDFContext.java PRE-CREATION 
  ql/src/test/org/apache/hadoop/hive/ql/udf/UDFGetConfProp.java PRE-CREATION 
  ql/src/test/queries/clientpositive/udf_testcontext.q PRE-CREATION 
  ql/src/test/results/clientpositive/udf_testcontext.q.out PRE-CREATION 

Diff: http://review.cloudera.org/r/748/diff


Testing
---


Thanks,

Carl




> Ability to access DistributedCache from UDFs
> 
>
> Key: HIVE-1016
> URL: https://issues.apache.org/jira/browse/HIVE-1016
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Carl Steinbach
>Assignee: Carl Steinbach
> Attachments: HIVE-1016.1.patch.txt
>
>
> There have been several requests on the mailing list for
> information about how to access the DistributedCache from UDFs, e.g.:
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01650.html
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01926.html
> While responses to these emails suggested several workarounds, the only 
> correct
> way of accessing the distributed cache is via the static methods of Hadoop's
> DistributedCache class, and all of these methods require that the JobConf be 
> passed
> in as a parameter. Hence, giving UDFs access to the distributed cache
> reduces to giving UDFs access to the JobConf.
> I propose the following changes to GenericUDF/UDAF/UDTF:
> * Add an exec_init(Configuration conf) method that is called during Operator 
> initialization at runtime.
> * Change the name of the "initialize" method to "compile_init" to make it 
> clear that this method is called at compile-time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1016) Ability to access DistributedCache from UDFs

2010-08-27 Thread Carl Steinbach (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903748#action_12903748
 ] 

Carl Steinbach commented on HIVE-1016:
--

Yes, I'm working on it. I'll have a patch ready for review by Monday. 
(Reassigned this back to myself).

> Ability to access DistributedCache from UDFs
> 
>
> Key: HIVE-1016
> URL: https://issues.apache.org/jira/browse/HIVE-1016
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Carl Steinbach
>Assignee: Carl Steinbach
>
> There have been several requests on the mailing list for
> information about how to access the DistributedCache from UDFs, e.g.:
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01650.html
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01926.html
> While responses to these emails suggested several workarounds, the only 
> correct
> way of accessing the distributed cache is via the static methods of Hadoop's
> DistributedCache class, and all of these methods require that the JobConf be 
> passed
> in as a parameter. Hence, giving UDFs access to the distributed cache
> reduces to giving UDFs access to the JobConf.
> I propose the following changes to GenericUDF/UDAF/UDTF:
> * Add an exec_init(Configuration conf) method that is called during Operator 
> initialization at runtime.
> * Change the name of the "initialize" method to "compile_init" to make it 
> clear that this method is called at compile-time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1016) Ability to access DistributedCache from UDFs

2010-08-27 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903707#action_12903707
 ] 

Namit Jain commented on HIVE-1016:
--

Carl, are you working on this ?

We need this pretty urgently - otherwise, I can take this

> Ability to access DistributedCache from UDFs
> 
>
> Key: HIVE-1016
> URL: https://issues.apache.org/jira/browse/HIVE-1016
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Carl Steinbach
>Assignee: Carl Steinbach
>
> There have been several requests on the mailing list for
> information about how to access the DistributedCache from UDFs, e.g.:
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01650.html
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01926.html
> While responses to these emails suggested several workarounds, the only 
> correct
> way of accessing the distributed cache is via the static methods of Hadoop's
> DistributedCache class, and all of these methods require that the JobConf be 
> passed
> in as a parameter. Hence, giving UDFs access to the distributed cache
> reduces to giving UDFs access to the JobConf.
> I propose the following changes to GenericUDF/UDAF/UDTF:
> * Add an exec_init(Configuration conf) method that is called during Operator 
> initialization at runtime.
> * Change the name of the "initialize" method to "compile_init" to make it 
> clear that this method is called at compile-time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1016) Ability to access DistributedCache from UDFs

2010-02-19 Thread Edward Capriolo (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836118#action_12836118
 ] 

Edward Capriolo commented on HIVE-1016:
---

Files added with 'add file' then can be access with "./file". I am wondering if 
we should pass other references to the UDF. This could open up opportunities 
for UDF to interact with each other or possibly with other hive components.

> Ability to access DistributedCache from UDFs
> 
>
> Key: HIVE-1016
> URL: https://issues.apache.org/jira/browse/HIVE-1016
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Query Processor
>Reporter: Carl Steinbach
>Assignee: Carl Steinbach
>
> There have been several requests on the mailing list for
> information about how to access the DistributedCache from UDFs, e.g.:
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01650.html
> http://www.mail-archive.com/hive-u...@hadoop.apache.org/msg01926.html
> While responses to these emails suggested several workarounds, the only 
> correct
> way of accessing the distributed cache is via the static methods of Hadoop's
> DistributedCache class, and all of these methods require that the JobConf be 
> passed
> in as a parameter. Hence, giving UDFs access to the distributed cache
> reduces to giving UDFs access to the JobConf.
> I propose the following changes to GenericUDF/UDAF/UDTF:
> * Add an exec_init(Configuration conf) method that is called during Operator 
> initialization at runtime.
> * Change the name of the "initialize" method to "compile_init" to make it 
> clear that this method is called at compile-time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.