Swift question regarding in-memory snapshots of compact table data
Daniel Schulz hat eine OneDrive-Datei mit Ihnen geteilt. Um diese anzuzeigen, klicken Sie unten auf den Link. <https://1drv.ms/u/s!Au1TsXeaVy95gThQuyrXhYGOvf7J> [https://r1.res.office365.com/owa/prem/images/dc-png_20.png]<https://1drv.ms/u/s!Au1TsXeaVy95gThQuyrXhYGOvf7J> timeline.PNG<https://1drv.ms/u/s!Au1TsXeaVy95gThQuyrXhYGOvf7J> Hello, I got a swift question regarding Spark to you. We'd like to do various calculations based on the very same Thresholds. The latter are compact domain-defined data. They are basically a Key/Value map (from Text to Text). The map has approx. 300 entries, so it is a (300 x 2)-matrix. All entries being read and provided immutable at the very start of the runtime is crucial to us. Even upon change of the persisted data on disk and when a Worker Node goes down and needs to restart "from the beginning." The timeline looks as follows: ... --||--|-|--> t start t1 t2end t1 read all Thresholds t2 start all calculations based on Thresholds It is crucial that Thresholds - once being read - will not change in memory. Even if persisted data on disk do; the values shall be the very same to all Worker Nodes - even if they need to start very late or restart all over again. Our Spark-Application is no Streaming-Application. The calculations for being correct need to have the very same data to all Worker Nodes - regardless of their start time. What do you consider best to provide Worker Nodes for those kind of calculations? 1. as Scala variable (no RDD), which is not lazy evaluated and being handed over to calculations 2. as RDD/DataFrame in Spark - no Broadcast 3. as RDD/DataFrame-Broadcast in Spark 4. very different approach - please elaborate a bit What are possible short comings of doing no Broadcast? What are possible short comings of doing the Broadcast? Thanks a lot. Kind regards, Daniel.
Spark Pattern and Anti-Pattern
Hi, We are currently working on a solution architecture to solve IoT workloads on Spark. Therefore, I am interested in getting to know whether it is considered an Anti-Pattern in Spark to get records from a database and make a ReST call to an external server with that data. This external server may and will be the bottleneck -- but from a Spark point of view: is it possibly harmful to open connections and wait for their responses for vast amounts of rows? In the same manner: is calling an external library (instead of making a ReST call) for any row possibly problematic? How to rather embed a C++ library in this workflow: is it best to make a function having a JNI call to run it natively -- iff we know we are single threaded then? Or is there a better way to include C++ code in Spark jobs? Many thanks in advance. Kind regards, Daniel.
Re: Ranger-like Security on Spark
Hi Matei, Thanks for your answer. My question is regarding simple authenticated Spark-on-YARN only, without Kerberos. So when I run Spark on YARN and HDFS, Spark will pass through my HDFS user and only be able to access files I am entitled to read/write? Will it enforce HDFS ACLs and Ranger policies as well? Best regards, Daniel. > On 03 Sep 2015, at 21:16, Matei Zaharia <matei.zaha...@gmail.com> wrote: > > If you run on YARN, you can use Kerberos, be authenticated as the right user, > etc in the same way as MapReduce jobs. > > Matei > >> On Sep 3, 2015, at 1:37 PM, Daniel Schulz <danielschulz2...@hotmail.com> >> wrote: >> >> Hi, >> >> I really enjoy using Spark. An obstacle to sell it to our clients currently >> is the missing Kerberos-like security on a Hadoop with simple >> authentication. Are there plans, a proposal, or a project to deliver a >> Ranger plugin or something similar to Spark. The target is to differentiate >> users and their privileges when reading and writing data to HDFS? Is >> Kerberos my only option then? >> >> Kind regards, Daniel. >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Ranger-like Security on Spark
Hi, I really enjoy using Spark. An obstacle to sell it to our clients currently is the missing Kerberos-like security on a Hadoop with simple authentication. Are there plans, a proposal, or a project to deliver a Ranger plugin or something similar to Spark. The target is to differentiate users and their privileges when reading and writing data to HDFS? Is Kerberos my only option then? Kind regards, Daniel. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Data Security on Spark-on-HDFS
Hi guys, In a nutshell: does Spark check and respect user privileges when reading/writing data. I am curious about the data security when Spark runs on top of HDFS — maybe though YARN. Is Spark running it's long-running JVM processes as a Spark user, that makes no distinction when accessing data? So is there a shortcoming when using Spark because the JVM processes are already running and therefore the launching user is omitted by Spark when accessing data residing on HDFS? Or is Spark only reading/writing data, that the user had access to, that launched this Thread? What about local store when running in Standalone mode? What about access calls to HBase or Hive then? Thanks for taking time. Best regards, Daniel. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org