Re: Submitting job with external dependencies to pyspark

2020-01-28 Thread Chris Teoh
Usually this isn't done as the data is meant to be on a shared/distributed
storage, eg HDFS, S3, etc.

Spark should then read this data into a dataframe and your code logic
applies to the dataframe in a distributed manner.

On Wed, 29 Jan 2020 at 09:37, Tharindu Mathew 
wrote:

> That was really helpful. Thanks! I actually solved my problem using by
> creating a venv and using the venv flags. Wondering now how to submit the
> data as an archive? Any idea?
>
> On Mon, Jan 27, 2020, 9:25 PM Chris Teoh  wrote:
>
>> Use --py-files
>>
>> See
>> https://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies
>>
>> I hope that helps.
>>
>> On Tue, 28 Jan 2020, 9:46 am Tharindu Mathew, 
>> wrote:
>>
>>> Hi,
>>>
>>> Newbie to pyspark/spark here.
>>>
>>> I'm trying to submit a job to pyspark with a dependency. Spark DL in
>>> this case. While the local environment has this the pyspark does not see
>>> it. How do I correctly start pyspark so that it sees this dependency?
>>>
>>> Using Spark 2.3.0 in a cloudera setup.
>>>
>>> --
>>> Regards,
>>> Tharindu Mathew
>>> http://tharindumathew.com
>>>
>>

-- 
Chris


Re: Submitting job with external dependencies to pyspark

2020-01-28 Thread Tharindu Mathew
That was really helpful. Thanks! I actually solved my problem using by
creating a venv and using the venv flags. Wondering now how to submit the
data as an archive? Any idea?

On Mon, Jan 27, 2020, 9:25 PM Chris Teoh  wrote:

> Use --py-files
>
> See
> https://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies
>
> I hope that helps.
>
> On Tue, 28 Jan 2020, 9:46 am Tharindu Mathew, 
> wrote:
>
>> Hi,
>>
>> Newbie to pyspark/spark here.
>>
>> I'm trying to submit a job to pyspark with a dependency. Spark DL in this
>> case. While the local environment has this the pyspark does not see it. How
>> do I correctly start pyspark so that it sees this dependency?
>>
>> Using Spark 2.3.0 in a cloudera setup.
>>
>> --
>> Regards,
>> Tharindu Mathew
>> http://tharindumathew.com
>>
>


Start a standalone server as root and use it with user accounts

2020-01-28 Thread Ben Caine
Hi,

I'd like to have a single standalone server, running as root on my machine,
on which jobs can be run from multiple user accounts on the same machine.

However, when I do this, writing files gives me error similar to the
one in this
Stackoverflow question . The
first answer to this question explains why: the server (running as root) is
creating a temporary file, and the job (running as the user) is trying to
move it, but it doesn't have access.

Is there a way around this? I feel like it should be possible to run a
Spark server as root and run a job on it from a user account.

Thanks,
Ben

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you should delete this 
message. You are hereby notified that any disclosure, copying, distribution or 
taking action in relation to the contents of this information is strictly 
prohibited and may be unlawful.