RE: Resources/Distributed Cache on Spark

Ray Navarette Thu, 08 Feb 2018 15:33:06 -0800

Without using add files, we’d have to make sure these resources exist on every 
node, and would configure a hive session like this:
set myCustomProperty=/path/to/directory/someSubDir/;
select myCustomUDF(‘param1’,’param2’);


With the shared resources, we can do this instead, at least with MR engine:
add files file:///path/to/directory;
set myCustomProperty=someSubDir/;
select myCustomUDF(‘param1’,’param2’);

In both cases, the property myCustomProperty is accessed inside the custom UDF, 
interpreted as a path, and used to read the content of a file within 
“someSubDir”.  This works fine whenever we have the full path, or with the 
relative path in the MR engine when using add resources.  I’m wondering if 
perhaps I’m getting lucky in that the MR engine is downloading the files to the 
working directory, and so the relative path is being properly resolved there, 
but some different behavior is happening in spark?  I can give a full path if I 
know ahead of time where this file will be available on the remote node, 
hopefully by property, like ${hive.localResourceDir}/someSubDir.

Thanks for the quick response and your help with this.

Ray

From: Sahil Takiar [mailto:[email protected]]
Sent: Thursday, February 8, 2018 12:45 PM
To: [email protected]
Subject: Re: Resources/Distributed Cache on Spark

It should work. We have tests such as groupby_bigdata.q that run on HoS and 
work. They use the "add file" command. What are the exact commands you are 
running? What error are you seeing?

On Thu, Feb 8, 2018 at 6:28 AM, Ray Navarette 
<[email protected]<mailto:[email protected]>> wrote:
Hello,

I’m hoping to find some information about using “ADD FILES <PATH>” when using 
the spark execution engine.  I’ve seen some jira tickets reference this 
functionality, but little else.  We have written some custom UDFs which require 
some external resources.  When using the MR execution engine, we can reference 
the file paths using a relative path and they are properly distributed and 
resolved.  When I try to do the same under spark engine, I receive an error 
saying the file is unavailable.

Does “ADD FILES <PATH>” work on spark, and if so, how should I properly 
reference those files in order to read them in the executors?

Thanks much for your help,
Ray



--
Sahil Takiar
Software Engineer
[email protected]<mailto:[email protected]> | (510) 673-0309

RE: Resources/Distributed Cache on Spark

Reply via email to