Disabling distributing local conf file during spark-submit

2023-12-10 Thread Eugene Miretsky
Hello,

It looks like local conf archives always get copied

to the target (HDFS) every time a job is submitted

   1.  Other files/archives don't get sent if they are local
   

-
   would it make sense to allow skipping upload of the local conf files as
   well?
   2. The archive seems to get copied on every 'distribute' call, which can
   happen multiple times per spark-submit job  (at least that's what I got
   from reading the code) - is that that intention?


The motivation for my questions is

   1. In some cases, spark-submit may not have direct access to HDFS, and
   hence cannot upload the files
   2. What would be the use-case for distributing the custom config to the
   YARN cluster. The cluster already has all the relevant YARN, HADOOP and
   Spark config. If anything, letting the end-user override the configs seems
   dangerous (if the override resource limits, etc. )

Cheers,
Eugene


Re: Encrypting jobs submitted by the client

2016-02-02 Thread eugene miretsky
Thanks Steve!
1. spark-submit submitting the YARN app for launch?  That you get it if you
turn hadoop IPC encruption on, by settingo hadoop.rpc.protection=privacy
across the cluster.
> That's what I meant: Is there something similar for stand alone or Mesos?

2. communications between spark driver and executor. that can use HTTPS
> My understanding is that that you can use HTTPS for the jar server on the
driver, and SASL for block transfer. Is there anything else I'm missing?

Cheers,
Eugene


On Tue, Feb 2, 2016 at 7:46 AM, Steve Loughran <ste...@hortonworks.com>
wrote:

>
> > On 1 Feb 2016, at 20:48, eugene miretsky <eugene.miret...@gmail.com>
> wrote:
> >
> > Spark supports client authentication via shared secret or kerberos (on
> YARN). However, the job itself is sent unencrypted over the network.  Is
> there a way to encrypt the jobs the client submits to cluster?
>
>
> define submission?


> 1. spark-submit submitting the YARN app for launch?  That you get it if
> you turn hadoop IPC encruption on, by settingo
> hadoop.rpc.protection=privacy across the cluster.

2. communications between spark driver and executor. that can use HTTPS
>
> > The rational for this is very similar to  encrypting the HTTP file
> server traffic - Jars may have sensitive data.
> >
> > Cheers,
> > Eugene
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Encrypting jobs submitted by the client

2016-02-01 Thread eugene miretsky
Spark supports client authentication via shared secret or kerberos (on
YARN). However, the job itself is sent unencrypted over the network.  Is
there a way to encrypt the jobs the client submits to cluster?
The rational for this is very similar to  encrypting the HTTP file server
traffic - Jars may have sensitive data.

Cheers,
Eugene


Secure multi tenancy on in stand alone mode

2016-02-01 Thread eugene miretsky
When having multiple users sharing the same Spark cluster, it's a good idea
to isolate the users - make sure that each users runs under a different
Linux account and prevent them from accessing data in jobs submitted by
other users. Is it currently possible to do with Spark?

The only thing I found about it online is
http://rnowling.github.io/spark/2015/04/07/multiuser-spark-mesos.html, and
some older Jira about adding support to YARN.

Cheers,
Eugene


Kafka consumer: Upgrading to use the the new Java Consumer

2015-12-23 Thread eugene miretsky
Hi,

The Kafka connector currently uses the older Kafka Scala consumer. Kafka
0.9 came out with a new Java Kafka consumer.

One of the main differences is that the Scala consumer uses
a Decoder( kafka.serializer.decoder) trait to decode keys/values while
the Java consumer uses the  Deserializer interface
(org.apache.kafka.common.serialization.deserializer).

The main difference between Decoder and Deserializer is that
Deserializer.deserialize accepts a topic and a payload while Decoder.decode
accepts only a payload. Topics in Kafka are pretty useful, as one example:
Confluent Schema Registry uses topic names to find the schema for each
key/value - while Confluent does provide a Decoder implementation, it is
mostly a hack that is incompatible  with the new Kafka Java Producer.

Any thoughts about changing the Kafka connector to work with the new Kafka
Java Consumer?

Cheers,
Eugene