Without using add files, we’d have to make sure these resources exist on every
node, and would configure a hive session like this:
set myCustomProperty=/path/to/directory/someSubDir/;
select myCustomUDF(‘param1’,’param2’);
With the shared resources, we can do this instead, at least with MR engine:
add files file:///path/to/directory;
set myCustomProperty=someSubDir/;
select myCustomUDF(‘param1’,’param2’);
In both cases, the property myCustomProperty is accessed inside the custom UDF,
interpreted as a path, and used to read the content of a file within
“someSubDir”. This works fine whenever we have the full path, or with the
relative path in the MR engine when using add resources. I’m wondering if
perhaps I’m getting lucky in that the MR engine is downloading the files to the
working directory, and so the relative path is being properly resolved there,
but some different behavior is happening in spark? I can give a full path if I
know ahead of time where this file will be available on the remote node,
hopefully by property, like ${hive.localResourceDir}/someSubDir.
Thanks for the quick response and your help with this.
Ray
From: Sahil Takiar [mailto:[email protected]]
Sent: Thursday, February 8, 2018 12:45 PM
To: [email protected]
Subject: Re: Resources/Distributed Cache on Spark
It should work. We have tests such as groupby_bigdata.q that run on HoS and
work. They use the "add file" command. What are the exact commands you are
running? What error are you seeing?
On Thu, Feb 8, 2018 at 6:28 AM, Ray Navarette
<[email protected]<mailto:[email protected]>> wrote:
Hello,
I’m hoping to find some information about using “ADD FILES <PATH>” when using
the spark execution engine. I’ve seen some jira tickets reference this
functionality, but little else. We have written some custom UDFs which require
some external resources. When using the MR execution engine, we can reference
the file paths using a relative path and they are properly distributed and
resolved. When I try to do the same under spark engine, I receive an error
saying the file is unavailable.
Does “ADD FILES <PATH>” work on spark, and if so, how should I properly
reference those files in order to read them in the executors?
Thanks much for your help,
Ray
--
Sahil Takiar
Software Engineer
[email protected]<mailto:[email protected]> | (510) 673-0309