What about concurrent access (read / update) to the small file with same
key ?
That can get a bit tricky.
On Thu, Sep 3, 2015 at 2:47 PM, Jörn Franke <jornfra...@gmail.com> wrote:
> Well it is the same as in normal hdfs, delete file and put a new one with
> the same name works.
&
>>> That can get a bit tricky.
>>>
>>> On Thu, Sep 3, 2015 at 2:47 PM, Jörn Franke <jornfra...@gmail.com>
>>> wrote:
>>>
>>>> Well it is the same as in normal hdfs, delete file and put a new one
>>>> with the same name w
he process as such does not
need any distributed file system.
Now, we do want to start distributing this procesing across a few machines
and make a real cluster out of it. However, I am not sure if HDFS is a hard
requirement for that to happen. I am thinking about the Shuffle spills,
DStream/RDD p
gt; a écrit :
>
>> Hello,
>> I'am currently using Spark Streaming to collect small messages (events) ,
>> size being <50 KB , volume is high (several millions per day) and I have to
>> store those messages in HDFS.
>> I understood that storing small files can be problem
Shuffle spills will use local disk, HDFS not needed.
Spark and Spark Streaming checkpoint info WILL NEED HDFS for
fault-tolerance. So that stuff can be recovered even if the spark cluster
nodes go down.
TD
On Fri, Sep 4, 2015 at 2:45 PM, N B <nb.nos...@gmail.com> wrote:
> Hello,
&g
Were you able to find a solution to your problem?
My main question in case of HAR usage is , is it possible to use Pig on it and
what about performances ?
- Mail original -
De: "Jörn Franke" <jornfra...@gmail.com>
À: nib...@free.fr, user@spark.apache.org
Envoyé: Jeudi 3 Septembre 2015 15:54:42
Objet: Re: Small File t
Store them as hadoop archive (har)
Le mer. 2 sept. 2015 à 18:07, <nib...@free.fr> a écrit :
> Hello,
> I'am currently using Spark Streaming to collect small messages (events) ,
> size being <50 KB , volume is high (several millions per day) and I have to
> store those
HAR usage is , is it possible to use Pig on it
> and what about performances ?
>
> - Mail original -
> De: "Jörn Franke" <jornfra...@gmail.com>
> À: nib...@free.fr, user@spark.apache.org
> Envoyé: Jeudi 3 Septembre 2015 15:54:42
> Objet: Re: Small File to HD
.
Regarding update and delete:
As far as I know HDFS does not support update and delete. Tools like HBase
realize this by using several HDFS files and rewriting them from time to
time. Depending on the frequence you need to update / delete data, you can
think about housekeeping your HDFS file
by a new content
(remove/replace)
Tks a lot
Nicolas
- Mail original -
De: "Jörn Franke" <jornfra...@gmail.com>
À: nib...@free.fr
Cc: user@spark.apache.org
Envoyé: Jeudi 3 Septembre 2015 19:29:42
Objet: Re: Small File to HDFS
Har is transparent and hardly any performance o
o use Pig on it
> and what about performances ?
>
> - Mail original -
> De: "Jörn Franke" <jornfra...@gmail.com>
> À: nib...@free.fr, user@spark.apache.org
> Envoyé: Jeudi 3 Septembre 2015 15:54:42
> Objet: Re: Small File to HDFS
>
>
>
>
> Store them as h
Well it is the same as in normal hdfs, delete file and put a new one with
the same name works.
Le jeu. 3 sept. 2015 à 21:18, <nib...@free.fr> a écrit :
> HAR archive seems a good idea , but just a last question to be sure to do
> the best choice :
> - Is it possible to override
Ok but so some questions :
- Sometimes I have to remove some messages from HDFS (cancel/replace cases) ,
is it possible ?
- In the case of a big zip file, is it possible to easily process Pig on it
directly ?
Tks
Nicolas
- Mail original -
De: "Tao Lu" <taolu2...@gmai
Hi Nibiau,
Hbase seems to be a good solution to your problems. As you may know storing
yours messages as a key-value pairs in Hbase saves you the overhead of manually
resizing blocks of data using zip files.
The added advantage along with the fact that Hbase uses HDFS for storage
messages as a key-value pairs in Hbase saves you the overhead of
> manually resizing blocks of data using zip files.
> The added advantage along with the fact that Hbase uses HDFS for storage, is
> the capability of updating your records for example with the "put" function.
>
You may consider storing it in one big HDFS file, and to keep appending new
messages to it.
For instance,
one message -> zip it -> append it to the HDFS as one line
On Wed, Sep 2, 2015 at 12:43 PM, <nib...@free.fr> wrote:
> Hi,
> I already store them in MongoDB in parral
Hello,
I'am currently using Spark Streaming to collect small messages (events) , size
being <50 KB , volume is high (several millions per day) and I have to store
those messages in HDFS.
I understood that storing small files can be problematic in HDFS , how can I
manage it ?
Tks
Nico
pache.org>
Envoyé: Mercredi 2 Septembre 2015 18:34:17
Objet: Re: Small File to HDFS
Instead of storing those messages in HDFS, have you considered storing them in
key-value store (e.g. hbase) ?
Cheers
On Wed, Sep 2, 2015 at 9:07 AM, < nib...@free.fr > wrote:
Hello,
I'am curr
Instead of storing those messages in HDFS, have you considered storing them
in key-value store (e.g. hbase) ?
Cheers
On Wed, Sep 2, 2015 at 9:07 AM, <nib...@free.fr> wrote:
> Hello,
> I'am currently using Spark Streaming to collect small messages (events) ,
> size being <50 K
Hi I have a Spark dataframe which I want to save as hive table with
partitions. I tried the following two statements but they dont work I dont
see any ORC files in HDFS directory its empty. I can see baseTable is there
in Hive console but obviously its empty because of no files inside HDFS
at 1:34 PM, unk1102 <umesh.ka...@gmail.com> wrote:
> Hi I have a Spark dataframe which I want to save as hive table with
> partitions. I tried the following two statements but they dont work I dont
> see any ORC files in HDFS directory its empty. I can see baseTable is there
&g
Hi guys,
In a nutshell: does Spark check and respect user privileges when
reading/writing data.
I am curious about the data security when Spark runs on top of HDFS — maybe
though YARN. Is Spark running it's long-running JVM processes as a Spark user,
that makes no distinction when accessing
urious about the data security when Spark runs on top of HDFS — maybe
> though YARN. Is Spark running it's long-running JVM processes as a Spark
> user, that makes no distinction when accessing data? So is there a
> shortcoming when using Spark because the JVM processes are already runn
You can also mount HDFS through the NFS gateway and access i think.
Thanks
Best Regards
On Tue, Aug 25, 2015 at 3:43 AM, Dino Fancellu d...@felstar.com wrote:
http://hortonworks.com/blog/windows-explorer-experience-hdfs/
Seemed to exist, now now sign.
Anything similar to tie HDFS
See
https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html
FYI
On Sat, Aug 29, 2015 at 1:04 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You can also mount HDFS through the NFS gateway and access i think.
Thanks
Best Regards
On Tue, Aug 25, 2015 at 3
If HDFS is on a linux VM, you could also mount it with FUSE and export it
with samba
2015-08-29 2:26 GMT-07:00 Ted Yu yuzhih...@gmail.com:
See
https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html
FYI
On Sat, Aug 29, 2015 at 1:04 AM, Akhil Das ak
It depends, if HDFS is running under windows, FUSE won't work, but if HDFS
is on a linux VM, Box, or cluster, then you can have the linux box/vm mount
HDFS through FUSE and at the same time export its mount point on samba. At
that point, your windows machine can just connect to the samba share.
R
I'm using Windows.
Are you saying it works with Windows?
Dino.
On 29 August 2015 at 09:04, Akhil Das ak...@sigmoidanalytics.com wrote:
You can also mount HDFS through the NFS gateway and access i think.
Thanks
Best Regards
On Tue, Aug 25, 2015 at 3:43 AM, Dino Fancellu d...@felstar.com
is not the only port you need tunnelled for HDFS to work. If you
only list the contents of a directory, port 8020 is enough... for instance,
using something
val p = new org.apache.hadoop.fs.Path(hdfs://localhost:8020/)
val fs = p.getFileSystem(sc.hadoopConfiguration)
fs.listStatus(p)
you
Port 8020 is not the only port you need tunnelled for HDFS to work. If you
only list the contents of a directory, port 8020 is enough... for instance,
using something
val p = new org.apache.hadoop.fs.Path(hdfs://localhost:8020/)
val fs = p.getFileSystem(sc.hadoopConfiguration)
fs.listStatus(p
.
I can't imagine I'm the only person on the planet wanting to do this.
Anyway, thanks for trying to help.
Dino.
On 25 August 2015 at 08:22, Roberto Congiu roberto.con...@gmail.com
wrote:
Port 8020 is not the only port you need tunnelled for HDFS to work. If
you
only list
Based on what I've read it appears that when using spark streaming there is
no good way of optimizing the files on HDFS. Spark streaming writes many
small files which is not scalable in apache hadoop. Only other way seem to
be to read files after it has been written and merge them to a bigger file
for HDFS to work. If you
only list the contents of a directory, port 8020 is enough... for instance,
using something
val p = new org.apache.hadoop.fs.Path(hdfs://localhost:8020/)
val fs = p.getFileSystem(sc.hadoopConfiguration)
fs.listStatus(p)
you should see the file list.
But then, when
I have a file in HDFS inside my HortonWorks HDP 2.3_1 VirtualBox VM.
If I go into the guest spark-shell and refer to the file thus, it works fine
val words=sc.textFile(hdfs:///tmp/people.txt)
words.count
However if I try to access it from a local Spark app on my Windows host, it
doesn't
the default HDP VM is set up, that is, if it only
binds HDFS to 127.0.0.1 or to all addresses. You can check that with netstat
-a.
R.
2015-08-24 11:46 GMT-07:00 Dino Fancellu d...@felstar.com:
I have a file in HDFS inside my HortonWorks HDP 2.3_1 VirtualBox VM.
If I go into the guest spark
if you
use vagrant, there's also a vagrant module that can do that automatically.
Also, I am not sure how the default HDP VM is set up, that is, if it only
binds HDFS to 127.0.0.1 or to all addresses. You can check that with
netstat -a.
R.
2015-08-24 11:46 GMT-07:00 Dino Fancellu d...@felstar.com:
I
http://hortonworks.com/blog/windows-explorer-experience-hdfs/
Seemed to exist, now now sign.
Anything similar to tie HDFS into windows explorer?
Thanks,
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Where-is-Redgate-s-HDFS-explorer-tp24431.html
Sent
.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167
or have a separate program which will do the clean up for you.
Thanks
Best Regards
On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:
Spark stream seems to be creating 0 bytes files even when
into hdfs in
parquet at a longer interval. One problem is that storing parquet is
sometimes time consuming, and that cause delay of my regular
stats-generating tasks. I am thinking of splitting my streaming job into
two, one for parquet output and one for stats generation, but obviously
this would
Hi Sunil,
Have you seen this fix in Spark 1.5 that may fix the locality issue?:
https://issues.apache.org/jira/browse/SPARK-4352
On Thu, Aug 20, 2015 at 4:09 AM, Sunil sdhe...@gmail.com wrote:
Hello . I am seeing some unexpected issues with achieving HDFS
data
locality. I expect
Hello . I am seeing some unexpected issues with achieving HDFS data
locality. I expect the tasks to be executed only on the node which has the
data but this is not happening (ofcourse, unless the node is busy in which
case, I understand tasks can go to some other node). Could anyone
to it (whether it has anything or not). There is no
direct append call as of now, but you can achieve this either with
FileUtil.copyMerge
http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167
or have a separate program which
-to-single-file-on-hdfs-td21124.html#a21167
or have a separate program which will do the clean up for you.
Thanks
Best Regards
On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:
Spark stream seems to be creating 0 bytes files even when there is no
data. Also, I have
.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167
or have a separate program which will do the clean up for you.
Thanks
Best Regards
On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:
Spark stream seems to be creating 0 bytes
with
FileUtil.copyMerge
http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167
or have a separate program which will do the clean up for you.
Thanks
Best Regards
On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:
Spark
-streaming-output-to-single-file-on-hdfs-td21124.html#a21167
or have a separate program which will do the clean up for you.
Thanks
Best Regards
On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:
Spark stream seems to be creating 0 bytes files even when there is no
data. Also
Spark stream seems to be creating 0 bytes files even when there is no data.
Also, I have 2 concerns here:
1) Extra unnecessary files is being created from the output
2) Hadoop doesn't work really well with too many files and I see that it is
creating a directory with a timestamp every 1 second.
Did you try this way?
/usr/local/spark/bin/spark-submit --master mesos://mesos.master:5050 --conf
spark.mesos.executor.docker.image=docker.repo/spark:latest --class
org.apache.spark.examples.SparkPi *--jars hdfs://hdfs1/tmp/spark-*
*examples-1.4.1-hadoop2.6.0-**cdh5.4.4.jar* 100
Thanks
Best
org.apache.spark.examples.SparkPi *--jars hdfs://hdfs1/tmp/spark-*
*examples-1.4.1-hadoop2.6.0-**cdh5.4.4.jar* 100
Thanks
Best Regards
On Fri, Aug 7, 2015 at 5:51 AM, Alan Braithwaite a...@cloudflare.com
wrote:
Hi All,
We're trying to run spark with mesos and docker in client mode (since
mesos
Did you try this way?
/usr/local/spark/bin/spark-submit --master mesos://mesos.master:5050 --conf
spark.mesos.executor.docker.image=docker.repo/spark:latest --class
org.apache.spark.examples.SparkPi --jars
hdfs://hdfs1/tmp/spark-examples-1.4.1-hadoop2.6.0-cdh5.4.4.jar 100
I did, and got
Currently, I use rdd.isEmpty()
Thanks,
Patanachai
On 08/06/2015 12:02 PM, gpatcham wrote:
Is there a way to filter out empty partitions before I write to HDFS other
than using reparition and colasce ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com
Not that I'm aware of. We ran into the similar issue where we didn't want
to keep accumulating all these empty part files in storage on S3 or HDFS.
There didn't seem to be any performance free way to do it with an RDD, so
we just run a non-spark post-batch operation to delete empty files from
This isn't really a Spark question. You're trying to parse a string to an
integer, but it contains an invalid character. The exception message
explains this.
On Wed, Aug 5, 2015 at 11:34 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Code:
import java.text.SimpleDateFormat
import
Hi All,
We're trying to run spark with mesos and docker in client mode (since mesos
doesn't support cluster mode) and load the application Jar from HDFS. The
following is the command we're running:
We're getting the following warning before an exception from that command:
Before I debug
Code:
import java.text.SimpleDateFormat
import java.util.Calendar
import java.sql.Date
import org.apache.spark.storage.StorageLevel
def formatStringAsDate(dateStr: String) = new java.sql.Date(new
SimpleDateFormat(-MM-dd).parse(dateStr).getTime())
Please see the comments at the tail of SPARK-2356
Cheers
On Wed, Aug 5, 2015 at 6:04 PM, Ashish Dutt ashish.du...@gmail.com wrote:
*Use Case:* To automate the process of data extraction (HDFS), data
analysis (pySpark/sparkR) and saving the data back to HDFS
programmatically.
*Prospective
*Use Case:* To automate the process of data extraction (HDFS), data
analysis (pySpark/sparkR) and saving the data back to HDFS
programmatically.
*Prospective solutions:*
1. Create a remote server connectivity program in an IDE like pyCharm or
RStudio and use it to retrieve the data from HDFS
Just to add rdd.take(1) won't trigger the entire computation, it will just
pull out the first record. You need to do a rdd.count() or rdd.saveAs*Files
to trigger the complete pipeline. How many partitions do you see in the
last stage?
Thanks
Best Regards
On Tue, Aug 4, 2015 at 7:10 AM, ayan guha
Hi Spark users and developers,
I have been trying to use spark-ec2. After I launched the spark cluster
(1.4.1) with ephemeral hdfs (using hadoop 2.4.0), I tried to execute a job
where the data is stored in the ephemeral hdfs. It does not matter what I
tried to do, there is no data locality at all
Is your data skewed? What happens if you do rdd.count()?
On 4 Aug 2015 05:49, Jasleen Kaur jasleenkaur1...@gmail.com wrote:
I am executing a spark job on a cluster as a yarn-client(Yarn cluster not
an option due to permission issues).
- num-executors 800
- spark.akka.frameSize=1024
I am executing a spark job on a cluster as a yarn-client(Yarn cluster not
an option due to permission issues).
- num-executors 800
- spark.akka.frameSize=1024
- spark.default.parallelism=25600
- driver-memory=4G
- executor-memory=32G.
- My input size is around 1.5TB.
My problem
file can be a directory (look at all children) or even a glob
(/path/*.ext, for example).
On Fri, Jul 31, 2015 at 11:35 AM, swetha swethakasire...@gmail.com wrote:
Hi,
How to add multiple sequence files from HDFS to a Spark Context to do Batch
processing? I have something like the following
Hi,
How to add multiple sequence files from HDFS to a Spark Context to do Batch
processing? I have something like the following in my code. Do I have to add
Comma separated list of Sequence file paths to the Spark Context.
val data = if(args.length0 args(0)!=null)
sc.sequenceFile(file
Hi,
After I create a table in spark sql and load infile an hdfs file to
it, the file is no longer queryable if I do hadoop fs -ls.
Is this expected?
Thanks,
Ron
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
MongoDB and HDFS (Mongo
Key as file name)
Basically, I would say that I have to manage message one by one inside a
foreach loop of the RDD and write each message one by one in MongoDB and HDFS.
Do you think it is the best way to dot it ?
Tks
Nicolas
of the RDD (xml2json + some enrichments)
- Spark store the transformed/enriched messages inside MongoDB and HDFS
(Mongo Key as file name)
Basically, I would say that I have to manage message one by one inside a
foreach loop of the RDD and write each message one by one in MongoDB and
HDFS.
Do
Hi,
When running two experiments with the same application, we see a 50%
performance difference between using HDFS and files on disk, both using the
textFile/saveAsTextFile call. Almost all performance loss is in Stage 1.
Input (in Stage 0):
The file is read in using val input = sc.textFile
Hi.
Assuming your have the data in an RDD you can save your RDD (regardless of
structure) with nameRDD.saveAsObjectFile(path) where path can be
hdfs:///myfolderonHDFS or the local file system.
Alternatively you can also use .saveAsTextFile()
Regards,
Gylfi.
--
View this message
in this service. Can any one tell me how to solve this
issue
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Invalid-HDFS-path-exception-tp23875.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
I have Spark 1.4 on my local machine and I would like to connect to our local 4
nodes Cloudera cluster. But how?
In the example it says text_file = spark.textFile(hdfs://...), but can you
advise me in where to get this hdfs://... -address?
Thanks!
Elina
On Wed, Jul 15, 2015 at 5:36 AM, Jeskanen, Elina elina.jeska...@cgi.com
wrote:
I have Spark 1.4 on my local machine and I would like to connect to our
local 4 nodes Cloudera cluster. But how?
In the example it says text_file = spark.textFile(hdfs://...), but can
you advise me in where
Assuming you run spark locally (ie either local mode or standalone cluster
on your localm/c)
1. You need to have hadoop binaries locally
2. You need to have hdfs-site on Spark Classpath of your local m/c
I would suggest you to start off with local files to play around.
If you need to run spark
:
Assuming you run spark locally (ie either local mode or standalone cluster
on your localm/c)
1. You need to have hadoop binaries locally
2. You need to have hdfs-site on Spark Classpath of your local m/c
I would suggest you to start off with local files to play around.
If you need to run spark
wrote:
Hi,
I have several issues related to HDFS, that may have different roots. I'm
posting as much information as I can, with the hope that I can get your
opinion on at least some of them. Basically the cases are:
- HDFS classes not found
- Connections with some datanode seems to be slow
Hi,
I have several issues related to HDFS, that may have different roots. I'm
posting as much information as I can, with the hope that I can get your
opinion on at least some of them. Basically the cases are:
- HDFS classes not found
- Connections with some datanode seems to be slow
On 11 Jul 2015, at 19:20, Aaron Davidson
ilike...@gmail.commailto:ilike...@gmail.com wrote:
Note that if you use multi-part upload, each part becomes 1 block, which allows
for multiple concurrent readers. One would typically use fixed-size block sizes
which align with Spark's default HDFS
Note that if you use multi-part upload, each part becomes 1 block, which
allows for multiple concurrent readers. One would typically use fixed-size
block sizes which align with Spark's default HDFS block size (64 MB, I
think) to ensure the reads are aligned.
On Sat, Jul 11, 2015 at 11:14 AM
Are there any significant performance differences between reading text
files from S3 and hdfs?
latency is much bigger for S3 (if that matters)
And with HDFS you'd get data-locality that will boost your app performance.
I did some light experimenting on this.
see my presentation here for some benchmark numbers ..etc
http://www.slideshare.net/sujee/hadoop-to-sparkv2
from slide# 34
cheers
I recommend testing it for yourself. Even if you have no application, you
can just run the spark-ec2 script, log in, run spark-shell and try reading
files from an S3 bucket and from hdfs://master IP:9000/. (This is the
ephemeral HDFS cluster, which uses SSD.)
I just tested our application
, ravi tella ddpis...@gmail.com wrote:
Hello,
How should I write a text file stream DStream to HDFS.
I tried the the following
val lines = ssc.textFileStream(hdfs:/user/hadoop/spark/streaming/input/)
lines.saveAsTextFile(hdfs:/user/hadoop/output1)
val lines = ssc.textFileStream(hdfs
Hello,
How should I write a text file stream DStream to HDFS.
I tried the the following
val lines = ssc.textFileStream(hdfs:/user/hadoop/spark/streaming/input/)
lines.saveAsTextFile(hdfs:/user/hadoop/output1)
val lines = ssc.textFileStream(hdfs:/user/hadoop/spark/streaming/input
.
- Driver authenticates to Kerberos via
UserGroupInformation.loginUserFromKeytab(principal, keytab)
- Driver instantiates a Hadoop configuration via hdfs-site.xml and core-site.xml
- Driver instantiates the Hadoop file system from a path based on the Hadoop
root URI (hdfs://hadoop-cluster.site.org
, thanks everyone.
From: Steve Loughran [mailto:ste...@hortonworks.com]
Sent: Monday, June 29, 2015 10:32 AM
To: Dave Ariens
Cc: Tim Chen; Marcelo Vanzin; Olivier Girardot; user@spark.apache.org
Subject: Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos
On 29 Jun 2015, at 14:18, Dave
, and serialize/cache it for the
executors to use instead of them having to instantiate their own.
- Driver authenticates to Kerberos via
UserGroupInformation.loginUserFromKeytab(principal, keytab)
- Driver instantiates a Hadoop configuration via hdfs-site.xml and core-site.xml
- Driver instantiates
I am running a spark streaming example from learning spark book with one
change. The change I made was for streaming a file from HDFS.
val lines = ssc.textFileStream(hdfs:/user/hadoop/spark/streaming/input)
I ran the application number of times and every time dropped a new file in
the input
: Accessing Kerberos Secured HDFS Resources from Spark on
Mesos
On Fri, Jun 26, 2015 at 3:09 PM, Dave Ariens dari...@blackberry.com
wrote:
Would there be any way to have the task instances in the slaves call
the UGI login with a principal/keytab provided to the driver?
That would only work
to the the request.
2. The YARN RM uses the HDFS token for the localisation, so the node managers
can access the content the user has the rights to.
3. There's some other stuff related to token refresh of restarted app masters,
essentially guaranteeing that even an AM restarted 3 days after the first
launch
*Sent: *Friday, June 26, 2015 6:20 PM
*To: *Dave Ariens
*Cc: *Tim Chen; Olivier Girardot; user@spark.apache.org
*Subject: *Re: Accessing Kerberos Secured HDFS Resources from Spark on
Mesos
On Fri, Jun 26, 2015 at 3:09 PM, Dave Ariens dari...@blackberry.com
wrote:
Would there be any way
having to kinit first), but not on Mesos. Is there a way
to have the slaves running the tasks perform the same kerberos login before
they attempt to access HDFS?
Putting aside the security of Spark/Mesos and how that keytab would get
distributed, I'm just looking for a working POC
. Is there a
way to have the slaves running the tasks perform the same kerberos login
before they attempt to access HDFS?
Putting aside the security of Spark/Mesos and how that keytab would get
distributed, I'm just looking for a working POC.
Is there a way to leverage the Broadcast capability
Mesos do support running containers as specific users passed to it.
Thanks for chiming in, what else does YARN do with Kerberos besides keytab
file and user?
Tim
On Fri, Jun 26, 2015 at 1:20 PM, Marcelo Vanzin van...@cloudera.com wrote:
On Fri, Jun 26, 2015 at 1:13 PM, Tim Chen
runs as
requesting user, or in a separate container that cannot be accessed by
other applications in any way).
On top of that, for HDFS and other Hadoop services, the applications
themselves need to be aware that Kerberos is enabled and that they need to
do certain things. For example, they need
login
before accessing any HDFS resources. To login, they just need the name of
the principal (username) and a keytab file. Then they just need to invoke
the following java:
import org.apache.hadoop.security.UserGroupInformation
UserGroupInformation.loginUserFromKeytab(adminPrincipal
On Fri, Jun 26, 2015 at 1:13 PM, Tim Chen t...@mesosphere.io wrote:
So correct me if I'm wrong, sounds like all you need is a principal user
name and also a keytab file downloaded right?
I'm not familiar with Mesos so don't know what kinds of features it has,
but at the very least it would
of the executor
isn't hard, it's being able to call the login method before the HDFS resources
are accessed.
See the gist below. That login completes successfully but it's only on the
driver. Once that HDFS resource is read with the Avro input format and key
and the tasks are created
Hi Timothy,
Because I'm running Spark on Mesos alongside a secured Hadoop cluster, I need
to ensure that my tasks running on the slaves perform a Kerberos login before
accessing any HDFS resources. To login, they just need the name of the
principal (username) and a keytab file
perform a Kerberos login
before accessing any HDFS resources. To login, they just need the name of
the principal (username) and a keytab file. Then they just need to invoke
the following java:
import org.apache.hadoop.security.UserGroupInformation
UserGroupInformation.loginUserFromKeytab
On Fri, Jun 26, 2015 at 3:09 PM, Dave Ariens dari...@blackberry.com wrote:
Would there be any way to have the task instances in the slaves call the
UGI login with a principal/keytab provided to the driver?
That would only work with a very small number of executors. If you have
many login
701 - 800 of 1329 matches
Mail list logo