Swift question regarding in-memory snapshots of compact table data

2016-11-09 Thread Daniel Schulz
Daniel Schulz hat eine OneDrive-Datei mit Ihnen geteilt. Um diese anzuzeigen, 
klicken Sie unten auf den Link.


<https://1drv.ms/u/s!Au1TsXeaVy95gThQuyrXhYGOvf7J>
[https://r1.res.office365.com/owa/prem/images/dc-png_20.png]<https://1drv.ms/u/s!Au1TsXeaVy95gThQuyrXhYGOvf7J>

timeline.PNG<https://1drv.ms/u/s!Au1TsXeaVy95gThQuyrXhYGOvf7J>



Hello,

I got a swift question regarding Spark to you.

We'd like to do various calculations based on the very same Thresholds. The 
latter are compact domain-defined data. They are basically a Key/Value map 
(from Text to Text). The map has approx. 300 entries, so it is a (300 x 
2)-matrix. All entries being read and provided immutable at the very start of 
the runtime is crucial to us. Even upon change of the persisted data on disk 
and when a Worker Node goes down and needs to restart "from the beginning." The 
timeline looks as follows:


... 
--||--|-|-->
 t
  start  t1 t2end

t1 read all Thresholds
t2 start all calculations based on Thresholds


It is crucial that Thresholds - once being read - will not change in memory. 
Even if persisted data on disk do; the values shall be the very same to all 
Worker Nodes - even if they need to start very late or restart all over again. 
Our Spark-Application is no Streaming-Application. The calculations for being 
correct need to have the very same data to all Worker Nodes - regardless of 
their start time.

What do you consider best to provide Worker Nodes for those kind of 
calculations?

  1.  as Scala variable (no RDD), which is not lazy evaluated and being handed 
over to calculations
  2.  as RDD/DataFrame in Spark - no Broadcast
  3.  as RDD/DataFrame-Broadcast in Spark
  4.  very different approach - please elaborate a bit

What are possible short comings of doing no Broadcast? What are possible short 
comings of doing the Broadcast?

Thanks a lot.

Kind regards, Daniel.





Spark Pattern and Anti-Pattern

2016-01-26 Thread Daniel Schulz
Hi,
We are currently working on a solution architecture to solve IoT workloads on 
Spark. Therefore, I am interested in getting to know whether it is considered 
an Anti-Pattern in Spark to get records from a database and make a ReST call to 
an external server with that data. This external server may and will be the 
bottleneck -- but from a Spark point of view: is it possibly harmful to open 
connections and wait for their responses for vast amounts of rows?
In the same manner: is calling an external library (instead of making a ReST 
call) for any row possibly problematic?
How to rather embed a C++ library in this workflow: is it best to make a 
function having a JNI call to run it natively -- iff we know we are single 
threaded then? Or is there a better way to include C++ code in Spark jobs?
Many thanks in advance.
Kind regards, Daniel. 

Re: Ranger-like Security on Spark

2015-09-03 Thread Daniel Schulz
Hi Matei,

Thanks for your answer.

My question is regarding simple authenticated Spark-on-YARN only, without 
Kerberos. So when I run Spark on YARN and HDFS, Spark will pass through my HDFS 
user and only be able to access files I am entitled to read/write? Will it 
enforce HDFS ACLs and Ranger policies as well?

Best regards, Daniel.

> On 03 Sep 2015, at 21:16, Matei Zaharia <matei.zaha...@gmail.com> wrote:
> 
> If you run on YARN, you can use Kerberos, be authenticated as the right user, 
> etc in the same way as MapReduce jobs.
> 
> Matei
> 
>> On Sep 3, 2015, at 1:37 PM, Daniel Schulz <danielschulz2...@hotmail.com> 
>> wrote:
>> 
>> Hi,
>> 
>> I really enjoy using Spark. An obstacle to sell it to our clients currently 
>> is the missing Kerberos-like security on a Hadoop with simple 
>> authentication. Are there plans, a proposal, or a project to deliver a 
>> Ranger plugin or something similar to Spark. The target is to differentiate 
>> users and their privileges when reading and writing data to HDFS? Is 
>> Kerberos my only option then?
>> 
>> Kind regards, Daniel.
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Ranger-like Security on Spark

2015-09-03 Thread Daniel Schulz
Hi,

I really enjoy using Spark. An obstacle to sell it to our clients currently is 
the missing Kerberos-like security on a Hadoop with simple authentication. Are 
there plans, a proposal, or a project to deliver a Ranger plugin or something 
similar to Spark. The target is to differentiate users and their privileges 
when reading and writing data to HDFS? Is Kerberos my only option then?

Kind regards, Daniel.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Data Security on Spark-on-HDFS

2015-08-31 Thread Daniel Schulz
Hi guys,

In a nutshell: does Spark check and respect user privileges when 
reading/writing data.

I am curious about the data security when Spark runs on top of HDFS — maybe 
though YARN. Is Spark running it's long-running JVM processes as a Spark user, 
that makes no distinction when accessing data? So is there a shortcoming when 
using Spark because the JVM processes are already running and therefore the 
launching user is omitted by Spark when accessing data residing on HDFS? Or is 
Spark only reading/writing data, that the user had access to, that launched 
this Thread?

What about local store when running in Standalone mode? What about access calls 
to HBase or Hive then?

Thanks for taking time.

Best regards, Daniel.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org