Re: Small File to HDFS

2015-09-04 Thread Ted Yu
What about concurrent access (read / update) to the small file with same key ? That can get a bit tricky. On Thu, Sep 3, 2015 at 2:47 PM, Jörn Franke <jornfra...@gmail.com> wrote: > Well it is the same as in normal hdfs, delete file and put a new one with > the same name works. &

Re: Small File to HDFS

2015-09-04 Thread Ted Yu
>>> That can get a bit tricky. >>> >>> On Thu, Sep 3, 2015 at 2:47 PM, Jörn Franke <jornfra...@gmail.com> >>> wrote: >>> >>>> Well it is the same as in normal hdfs, delete file and put a new one >>>> with the same name w

Is HDFS required for Spark streaming?

2015-09-04 Thread N B
he process as such does not need any distributed file system. Now, we do want to start distributing this procesing across a few machines and make a real cluster out of it. However, I am not sure if HDFS is a hard requirement for that to happen. I am thinking about the Shuffle spills, DStream/RDD p

Re: Small File to HDFS

2015-09-04 Thread Jörn Franke
gt; a écrit : > >> Hello, >> I'am currently using Spark Streaming to collect small messages (events) , >> size being <50 KB , volume is high (several millions per day) and I have to >> store those messages in HDFS. >> I understood that storing small files can be problem

Re: Is HDFS required for Spark streaming?

2015-09-04 Thread Tathagata Das
Shuffle spills will use local disk, HDFS not needed. Spark and Spark Streaming checkpoint info WILL NEED HDFS for fault-tolerance. So that stuff can be recovered even if the spark cluster nodes go down. TD On Fri, Sep 4, 2015 at 2:45 PM, N B <nb.nos...@gmail.com> wrote: > Hello, &g

Spark Streaming - Small file in HDFS

2015-09-04 Thread Pravesh Jain
Were you able to find a solution to your problem?

Re: Small File to HDFS

2015-09-03 Thread nibiau
My main question in case of HAR usage is , is it possible to use Pig on it and what about performances ? - Mail original - De: "Jörn Franke" <jornfra...@gmail.com> À: nib...@free.fr, user@spark.apache.org Envoyé: Jeudi 3 Septembre 2015 15:54:42 Objet: Re: Small File t

Re: Small File to HDFS

2015-09-03 Thread Jörn Franke
Store them as hadoop archive (har) Le mer. 2 sept. 2015 à 18:07, <nib...@free.fr> a écrit : > Hello, > I'am currently using Spark Streaming to collect small messages (events) , > size being <50 KB , volume is high (several millions per day) and I have to > store those

Re: Small File to HDFS

2015-09-03 Thread Tao Lu
HAR usage is , is it possible to use Pig on it > and what about performances ? > > - Mail original - > De: "Jörn Franke" <jornfra...@gmail.com> > À: nib...@free.fr, user@spark.apache.org > Envoyé: Jeudi 3 Septembre 2015 15:54:42 > Objet: Re: Small File to HD

Re: Small File to HDFS

2015-09-03 Thread Martin Menzel
. Regarding update and delete: As far as I know HDFS does not support update and delete. Tools like HBase realize this by using several HDFS files and rewriting them from time to time. Depending on the frequence you need to update / delete data, you can think about housekeeping your HDFS file

Re: Small File to HDFS

2015-09-03 Thread nibiau
by a new content (remove/replace) Tks a lot Nicolas - Mail original - De: "Jörn Franke" <jornfra...@gmail.com> À: nib...@free.fr Cc: user@spark.apache.org Envoyé: Jeudi 3 Septembre 2015 19:29:42 Objet: Re: Small File to HDFS Har is transparent and hardly any performance o

Re: Small File to HDFS

2015-09-03 Thread Jörn Franke
o use Pig on it > and what about performances ? > > - Mail original - > De: "Jörn Franke" <jornfra...@gmail.com> > À: nib...@free.fr, user@spark.apache.org > Envoyé: Jeudi 3 Septembre 2015 15:54:42 > Objet: Re: Small File to HDFS > > > > > Store them as h

Re: Small File to HDFS

2015-09-03 Thread Jörn Franke
Well it is the same as in normal hdfs, delete file and put a new one with the same name works. Le jeu. 3 sept. 2015 à 21:18, <nib...@free.fr> a écrit : > HAR archive seems a good idea , but just a last question to be sure to do > the best choice : > - Is it possible to override

Re: Small File to HDFS

2015-09-03 Thread nibiau
Ok but so some questions : - Sometimes I have to remove some messages from HDFS (cancel/replace cases) , is it possible ? - In the case of a big zip file, is it possible to easily process Pig on it directly ? Tks Nicolas - Mail original - De: "Tao Lu" <taolu2...@gmai

Re: Small File to HDFS

2015-09-03 Thread Ndjido Ardo Bar
Hi Nibiau, Hbase seems to be a good solution to your problems. As you may know storing yours messages as a key-value pairs in Hbase saves you the overhead of manually resizing blocks of data using zip files. The added advantage along with the fact that Hbase uses HDFS for storage

Re: Small File to HDFS

2015-09-03 Thread Ted Yu
messages as a key-value pairs in Hbase saves you the overhead of > manually resizing blocks of data using zip files. > The added advantage along with the fact that Hbase uses HDFS for storage, is > the capability of updating your records for example with the "put" function. >

Re: Small File to HDFS

2015-09-02 Thread Tao Lu
You may consider storing it in one big HDFS file, and to keep appending new messages to it. For instance, one message -> zip it -> append it to the HDFS as one line On Wed, Sep 2, 2015 at 12:43 PM, <nib...@free.fr> wrote: > Hi, > I already store them in MongoDB in parral

Small File to HDFS

2015-09-02 Thread nibiau
Hello, I'am currently using Spark Streaming to collect small messages (events) , size being <50 KB , volume is high (several millions per day) and I have to store those messages in HDFS. I understood that storing small files can be problematic in HDFS , how can I manage it ? Tks Nico

Re: Small File to HDFS

2015-09-02 Thread nibiau
pache.org> Envoyé: Mercredi 2 Septembre 2015 18:34:17 Objet: Re: Small File to HDFS Instead of storing those messages in HDFS, have you considered storing them in key-value store (e.g. hbase) ? Cheers On Wed, Sep 2, 2015 at 9:07 AM, < nib...@free.fr > wrote: Hello, I'am curr

Re: Small File to HDFS

2015-09-02 Thread Ted Yu
Instead of storing those messages in HDFS, have you considered storing them in key-value store (e.g. hbase) ? Cheers On Wed, Sep 2, 2015 at 9:07 AM, <nib...@free.fr> wrote: > Hello, > I'am currently using Spark Streaming to collect small messages (events) , > size being <50 K

Spark DataFrame saveAsTable with partitionBy creates no ORC file in HDFS

2015-09-02 Thread unk1102
Hi I have a Spark dataframe which I want to save as hive table with partitions. I tried the following two statements but they dont work I dont see any ORC files in HDFS directory its empty. I can see baseTable is there in Hive console but obviously its empty because of no files inside HDFS

Re: Spark DataFrame saveAsTable with partitionBy creates no ORC file in HDFS

2015-09-02 Thread Michael Armbrust
at 1:34 PM, unk1102 <umesh.ka...@gmail.com> wrote: > Hi I have a Spark dataframe which I want to save as hive table with > partitions. I tried the following two statements but they dont work I dont > see any ORC files in HDFS directory its empty. I can see baseTable is there &g

Data Security on Spark-on-HDFS

2015-08-31 Thread Daniel Schulz
Hi guys, In a nutshell: does Spark check and respect user privileges when reading/writing data. I am curious about the data security when Spark runs on top of HDFS — maybe though YARN. Is Spark running it's long-running JVM processes as a Spark user, that makes no distinction when accessing

Re: Data Security on Spark-on-HDFS

2015-08-31 Thread Steve Loughran
urious about the data security when Spark runs on top of HDFS — maybe > though YARN. Is Spark running it's long-running JVM processes as a Spark > user, that makes no distinction when accessing data? So is there a > shortcoming when using Spark because the JVM processes are already runn

Re: Where is Redgate's HDFS explorer?

2015-08-29 Thread Akhil Das
You can also mount HDFS through the NFS gateway and access i think. Thanks Best Regards On Tue, Aug 25, 2015 at 3:43 AM, Dino Fancellu d...@felstar.com wrote: http://hortonworks.com/blog/windows-explorer-experience-hdfs/ Seemed to exist, now now sign. Anything similar to tie HDFS

Re: Where is Redgate's HDFS explorer?

2015-08-29 Thread Ted Yu
See https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html FYI On Sat, Aug 29, 2015 at 1:04 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can also mount HDFS through the NFS gateway and access i think. Thanks Best Regards On Tue, Aug 25, 2015 at 3

Re: Where is Redgate's HDFS explorer?

2015-08-29 Thread Roberto Congiu
If HDFS is on a linux VM, you could also mount it with FUSE and export it with samba 2015-08-29 2:26 GMT-07:00 Ted Yu yuzhih...@gmail.com: See https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html FYI On Sat, Aug 29, 2015 at 1:04 AM, Akhil Das ak

Re: Where is Redgate's HDFS explorer?

2015-08-29 Thread Roberto Congiu
It depends, if HDFS is running under windows, FUSE won't work, but if HDFS is on a linux VM, Box, or cluster, then you can have the linux box/vm mount HDFS through FUSE and at the same time export its mount point on samba. At that point, your windows machine can just connect to the samba share. R

Re: Where is Redgate's HDFS explorer?

2015-08-29 Thread Dino Fancellu
I'm using Windows. Are you saying it works with Windows? Dino. On 29 August 2015 at 09:04, Akhil Das ak...@sigmoidanalytics.com wrote: You can also mount HDFS through the NFS gateway and access i think. Thanks Best Regards On Tue, Aug 25, 2015 at 3:43 AM, Dino Fancellu d...@felstar.com

Re: Local Spark talking to remote HDFS?

2015-08-25 Thread Steve Loughran
is not the only port you need tunnelled for HDFS to work. If you only list the contents of a directory, port 8020 is enough... for instance, using something val p = new org.apache.hadoop.fs.Path(hdfs://localhost:8020/) val fs = p.getFileSystem(sc.hadoopConfiguration) fs.listStatus(p) you

Re: Local Spark talking to remote HDFS?

2015-08-25 Thread Roberto Congiu
Port 8020 is not the only port you need tunnelled for HDFS to work. If you only list the contents of a directory, port 8020 is enough... for instance, using something val p = new org.apache.hadoop.fs.Path(hdfs://localhost:8020/) val fs = p.getFileSystem(sc.hadoopConfiguration) fs.listStatus(p

Re: Local Spark talking to remote HDFS?

2015-08-25 Thread Roberto Congiu
. I can't imagine I'm the only person on the planet wanting to do this. Anyway, thanks for trying to help. Dino. On 25 August 2015 at 08:22, Roberto Congiu roberto.con...@gmail.com wrote: Port 8020 is not the only port you need tunnelled for HDFS to work. If you only list

Re: Too many files/dirs in hdfs

2015-08-25 Thread Mohit Anchlia
Based on what I've read it appears that when using spark streaming there is no good way of optimizing the files on HDFS. Spark streaming writes many small files which is not scalable in apache hadoop. Only other way seem to be to read files after it has been written and merge them to a bigger file

Re: Local Spark talking to remote HDFS?

2015-08-25 Thread Dino Fancellu
for HDFS to work. If you only list the contents of a directory, port 8020 is enough... for instance, using something val p = new org.apache.hadoop.fs.Path(hdfs://localhost:8020/) val fs = p.getFileSystem(sc.hadoopConfiguration) fs.listStatus(p) you should see the file list. But then, when

Local Spark talking to remote HDFS?

2015-08-24 Thread Dino Fancellu
I have a file in HDFS inside my HortonWorks HDP 2.3_1 VirtualBox VM. If I go into the guest spark-shell and refer to the file thus, it works fine val words=sc.textFile(hdfs:///tmp/people.txt) words.count However if I try to access it from a local Spark app on my Windows host, it doesn't

Re: Local Spark talking to remote HDFS?

2015-08-24 Thread Dino Fancellu
the default HDP VM is set up, that is, if it only binds HDFS to 127.0.0.1 or to all addresses. You can check that with netstat -a. R. 2015-08-24 11:46 GMT-07:00 Dino Fancellu d...@felstar.com: I have a file in HDFS inside my HortonWorks HDP 2.3_1 VirtualBox VM. If I go into the guest spark

Re: Local Spark talking to remote HDFS?

2015-08-24 Thread Roberto Congiu
if you use vagrant, there's also a vagrant module that can do that automatically. Also, I am not sure how the default HDP VM is set up, that is, if it only binds HDFS to 127.0.0.1 or to all addresses. You can check that with netstat -a. R. 2015-08-24 11:46 GMT-07:00 Dino Fancellu d...@felstar.com: I

Where is Redgate's HDFS explorer?

2015-08-24 Thread Dino Fancellu
http://hortonworks.com/blog/windows-explorer-experience-hdfs/ Seemed to exist, now now sign. Anything similar to tie HDFS into windows explorer? Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Where-is-Redgate-s-HDFS-explorer-tp24431.html Sent

Re: Too many files/dirs in hdfs

2015-08-24 Thread Mohit Anchlia
.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167 or have a separate program which will do the clean up for you. Thanks Best Regards On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Spark stream seems to be creating 0 bytes files even when

Re: Using spark streaming to load data from Kafka to HDFS

2015-08-22 Thread Xu (Simon) Chen
into hdfs in parquet at a longer interval. One problem is that storing parquet is sometimes time consuming, and that cause delay of my regular stats-generating tasks. I am thinking of splitting my streaming job into two, one for parquet output and one for stats generation, but obviously this would

Re: Data locality with HDFS not being seen

2015-08-21 Thread Sameer Farooqui
Hi Sunil, Have you seen this fix in Spark 1.5 that may fix the locality issue?: https://issues.apache.org/jira/browse/SPARK-4352 On Thu, Aug 20, 2015 at 4:09 AM, Sunil sdhe...@gmail.com wrote: Hello . I am seeing some unexpected issues with achieving HDFS data locality. I expect

Data locality with HDFS not being seen

2015-08-20 Thread Sunil
Hello . I am seeing some unexpected issues with achieving HDFS data locality. I expect the tasks to be executed only on the node which has the data but this is not happening (ofcourse, unless the node is busy in which case, I understand tasks can go to some other node). Could anyone

Re: Too many files/dirs in hdfs

2015-08-19 Thread Mohit Anchlia
to it (whether it has anything or not). There is no direct append call as of now, but you can achieve this either with FileUtil.copyMerge http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167 or have a separate program which

Re: Too many files/dirs in hdfs

2015-08-18 Thread Mohit Anchlia
-to-single-file-on-hdfs-td21124.html#a21167 or have a separate program which will do the clean up for you. Thanks Best Regards On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Spark stream seems to be creating 0 bytes files even when there is no data. Also, I have

Re: Too many files/dirs in hdfs

2015-08-18 Thread UMESH CHAUDHARY
.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167 or have a separate program which will do the clean up for you. Thanks Best Regards On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Spark stream seems to be creating 0 bytes

Re: Too many files/dirs in hdfs

2015-08-17 Thread UMESH CHAUDHARY
with FileUtil.copyMerge http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167 or have a separate program which will do the clean up for you. Thanks Best Regards On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Spark

Re: Too many files/dirs in hdfs

2015-08-17 Thread Akhil Das
-streaming-output-to-single-file-on-hdfs-td21124.html#a21167 or have a separate program which will do the clean up for you. Thanks Best Regards On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Spark stream seems to be creating 0 bytes files even when there is no data. Also

Too many files/dirs in hdfs

2015-08-14 Thread Mohit Anchlia
Spark stream seems to be creating 0 bytes files even when there is no data. Also, I have 2 concerns here: 1) Extra unnecessary files is being created from the output 2) Hadoop doesn't work really well with too many files and I see that it is creating a directory with a timestamp every 1 second.

Re: Spark-submit fails when jar is in HDFS

2015-08-09 Thread Akhil Das
Did you try this way? /usr/local/spark/bin/spark-submit --master mesos://mesos.master:5050 --conf spark.mesos.executor.docker.image=docker.repo/spark:latest --class org.apache.spark.examples.SparkPi *--jars hdfs://hdfs1/tmp/spark-* *examples-1.4.1-hadoop2.6.0-**cdh5.4.4.jar* 100 Thanks Best

Re: Spark-submit fails when jar is in HDFS

2015-08-09 Thread Dean Wampler
org.apache.spark.examples.SparkPi *--jars hdfs://hdfs1/tmp/spark-* *examples-1.4.1-hadoop2.6.0-**cdh5.4.4.jar* 100 Thanks Best Regards On Fri, Aug 7, 2015 at 5:51 AM, Alan Braithwaite a...@cloudflare.com wrote: Hi All, We're trying to run spark with mesos and docker in client mode (since mesos

Re: Spark-submit fails when jar is in HDFS

2015-08-09 Thread Alan Braithwaite
Did you try this way? /usr/local/spark/bin/spark-submit --master mesos://mesos.master:5050 --conf spark.mesos.executor.docker.image=docker.repo/spark:latest --class org.apache.spark.examples.SparkPi --jars hdfs://hdfs1/tmp/spark-examples-1.4.1-hadoop2.6.0-cdh5.4.4.jar 100 I did, and got

Re: Removing empty partitions before we write to HDFS

2015-08-06 Thread Patanachai Tangchaisin
Currently, I use rdd.isEmpty() Thanks, Patanachai On 08/06/2015 12:02 PM, gpatcham wrote: Is there a way to filter out empty partitions before I write to HDFS other than using reparition and colasce ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com

Re: Removing empty partitions before we write to HDFS

2015-08-06 Thread Richard Marscher
Not that I'm aware of. We ran into the similar issue where we didn't want to keep accumulating all these empty part files in storage on S3 or HDFS. There didn't seem to be any performance free way to do it with an RDD, so we just run a non-spark post-batch operation to delete empty files from

Re: Unable to persist RDD to HDFS

2015-08-06 Thread Philip Weaver
This isn't really a Spark question. You're trying to parse a string to an integer, but it contains an invalid character. The exception message explains this. On Wed, Aug 5, 2015 at 11:34 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Code: import java.text.SimpleDateFormat import

Spark-submit fails when jar is in HDFS

2015-08-06 Thread abraithwaite
Hi All, We're trying to run spark with mesos and docker in client mode (since mesos doesn't support cluster mode) and load the application Jar from HDFS. The following is the command we're running: We're getting the following warning before an exception from that command: Before I debug

Unable to persist RDD to HDFS

2015-08-06 Thread ๏̯͡๏
Code: import java.text.SimpleDateFormat import java.util.Calendar import java.sql.Date import org.apache.spark.storage.StorageLevel def formatStringAsDate(dateStr: String) = new java.sql.Date(new SimpleDateFormat(-MM-dd).parse(dateStr).getTime())

Re: How to connect to remote HDFS programmatically to retrieve data, analyse it and then write the data back to HDFS?

2015-08-05 Thread Ted Yu
Please see the comments at the tail of SPARK-2356 Cheers On Wed, Aug 5, 2015 at 6:04 PM, Ashish Dutt ashish.du...@gmail.com wrote: *Use Case:* To automate the process of data extraction (HDFS), data analysis (pySpark/sparkR) and saving the data back to HDFS programmatically. *Prospective

How to connect to remote HDFS programmatically to retrieve data, analyse it and then write the data back to HDFS?

2015-08-05 Thread Ashish Dutt
*Use Case:* To automate the process of data extraction (HDFS), data analysis (pySpark/sparkR) and saving the data back to HDFS programmatically. *Prospective solutions:* 1. Create a remote server connectivity program in an IDE like pyCharm or RStudio and use it to retrieve the data from HDFS

Re: Writing to HDFS

2015-08-04 Thread Akhil Das
Just to add rdd.take(1) won't trigger the entire computation, it will just pull out the first record. You need to do a rdd.count() or rdd.saveAs*Files to trigger the complete pipeline. How many partitions do you see in the last stage? Thanks Best Regards On Tue, Aug 4, 2015 at 7:10 AM, ayan guha

Poor HDFS Data Locality on Spark-EC2

2015-08-04 Thread Jerry Lam
Hi Spark users and developers, I have been trying to use spark-ec2. After I launched the spark cluster (1.4.1) with ephemeral hdfs (using hadoop 2.4.0), I tried to execute a job where the data is stored in the ephemeral hdfs. It does not matter what I tried to do, there is no data locality at all

Re: Writing to HDFS

2015-08-03 Thread ayan guha
Is your data skewed? What happens if you do rdd.count()? On 4 Aug 2015 05:49, Jasleen Kaur jasleenkaur1...@gmail.com wrote: I am executing a spark job on a cluster as a yarn-client(Yarn cluster not an option due to permission issues). - num-executors 800 - spark.akka.frameSize=1024

Writing to HDFS

2015-08-03 Thread Jasleen Kaur
I am executing a spark job on a cluster as a yarn-client(Yarn cluster not an option due to permission issues). - num-executors 800 - spark.akka.frameSize=1024 - spark.default.parallelism=25600 - driver-memory=4G - executor-memory=32G. - My input size is around 1.5TB. My problem

Re: How to add multiple sequence files from HDFS to a Spark Context to do Batch processing?

2015-07-31 Thread Marcelo Vanzin
file can be a directory (look at all children) or even a glob (/path/*.ext, for example). On Fri, Jul 31, 2015 at 11:35 AM, swetha swethakasire...@gmail.com wrote: Hi, How to add multiple sequence files from HDFS to a Spark Context to do Batch processing? I have something like the following

How to add multiple sequence files from HDFS to a Spark Context to do Batch processing?

2015-07-31 Thread swetha
Hi, How to add multiple sequence files from HDFS to a Spark Context to do Batch processing? I have something like the following in my code. Do I have to add Comma separated list of Sequence file paths to the Spark Context. val data = if(args.length0 args(0)!=null) sc.sequenceFile(file

Losing files in hdfs after creating spark sql table

2015-07-30 Thread Ron Gonzalez
Hi, After I create a table in spark sql and load infile an hdfs file to it, the file is no longer queryable if I do hadoop fs -ls. Is this expected? Thanks, Ron - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Best practice for transforming and storing from Spark to Mongo/HDFS

2015-07-25 Thread nibiau
MongoDB and HDFS (Mongo Key as file name) Basically, I would say that I have to manage message one by one inside a foreach loop of the RDD and write each message one by one in MongoDB and HDFS. Do you think it is the best way to dot it ? Tks Nicolas

Re: Best practice for transforming and storing from Spark to Mongo/HDFS

2015-07-25 Thread Cody Koeninger
of the RDD (xml2json + some enrichments) - Spark store the transformed/enriched messages inside MongoDB and HDFS (Mongo Key as file name) Basically, I would say that I have to manage message one by one inside a foreach loop of the RDD and write each message one by one in MongoDB and HDFS. Do

50% performance decrease when using local file vs hdfs

2015-07-24 Thread Tom Hubregtsen
Hi, When running two experiments with the same application, we see a 50% performance difference between using HDFS and files on disk, both using the textFile/saveAsTextFile call. Almost all performance loss is in Stage 1. Input (in Stage 0): The file is read in using val input = sc.textFile

Re: write a HashMap to HDFS in Spark

2015-07-18 Thread Gylfi
Hi. Assuming your have the data in an RDD you can save your RDD (regardless of structure) with nameRDD.saveAsObjectFile(path) where path can be hdfs:///myfolderonHDFS or the local file system. Alternatively you can also use .saveAsTextFile() Regards, Gylfi. -- View this message

Invalid HDFS path exception

2015-07-16 Thread wazza
in this service. Can any one tell me how to solve this issue -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Invalid-HDFS-path-exception-tp23875.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Spark and HDFS

2015-07-15 Thread Jeskanen, Elina
I have Spark 1.4 on my local machine and I would like to connect to our local 4 nodes Cloudera cluster. But how? In the example it says text_file = spark.textFile(hdfs://...), but can you advise me in where to get this hdfs://... -address? Thanks! Elina

Re: Spark and HDFS

2015-07-15 Thread Marcelo Vanzin
On Wed, Jul 15, 2015 at 5:36 AM, Jeskanen, Elina elina.jeska...@cgi.com wrote: I have Spark 1.4 on my local machine and I would like to connect to our local 4 nodes Cloudera cluster. But how? In the example it says text_file = spark.textFile(hdfs://...), but can you advise me in where

Re: Spark and HDFS

2015-07-15 Thread ayan guha
Assuming you run spark locally (ie either local mode or standalone cluster on your localm/c) 1. You need to have hadoop binaries locally 2. You need to have hdfs-site on Spark Classpath of your local m/c I would suggest you to start off with local files to play around. If you need to run spark

Re: Spark and HDFS

2015-07-15 Thread Naveen Madhire
: Assuming you run spark locally (ie either local mode or standalone cluster on your localm/c) 1. You need to have hadoop binaries locally 2. You need to have hdfs-site on Spark Classpath of your local m/c I would suggest you to start off with local files to play around. If you need to run spark

Re: HDFS performances + unexpected death of executors.

2015-07-14 Thread Max Demoulin
wrote: Hi, I have several issues related to HDFS, that may have different roots. I'm posting as much information as I can, with the hope that I can get your opinion on at least some of them. Basically the cases are: - HDFS classes not found - Connections with some datanode seems to be slow

HDFS performances + unexpected death of executors.

2015-07-13 Thread maxdml
Hi, I have several issues related to HDFS, that may have different roots. I'm posting as much information as I can, with the hope that I can get your opinion on at least some of them. Basically the cases are: - HDFS classes not found - Connections with some datanode seems to be slow

Re: S3 vs HDFS

2015-07-12 Thread Steve Loughran
On 11 Jul 2015, at 19:20, Aaron Davidson ilike...@gmail.commailto:ilike...@gmail.com wrote: Note that if you use multi-part upload, each part becomes 1 block, which allows for multiple concurrent readers. One would typically use fixed-size block sizes which align with Spark's default HDFS

Re: S3 vs HDFS

2015-07-11 Thread Aaron Davidson
Note that if you use multi-part upload, each part becomes 1 block, which allows for multiple concurrent readers. One would typically use fixed-size block sizes which align with Spark's default HDFS block size (64 MB, I think) to ensure the reads are aligned. On Sat, Jul 11, 2015 at 11:14 AM

S3 vs HDFS

2015-07-09 Thread Brandon White
Are there any significant performance differences between reading text files from S3 and hdfs?

Re: S3 vs HDFS

2015-07-09 Thread Sujee Maniyam
latency is much bigger for S3 (if that matters) And with HDFS you'd get data-locality that will boost your app performance. I did some light experimenting on this. see my presentation here for some benchmark numbers ..etc http://www.slideshare.net/sujee/hadoop-to-sparkv2 from slide# 34 cheers

Re: S3 vs HDFS

2015-07-09 Thread Daniel Darabos
I recommend testing it for yourself. Even if you have no application, you can just run the spark-ec2 script, log in, run spark-shell and try reading files from an S3 bucket and from hdfs://master IP:9000/. (This is the ephemeral HDFS cluster, which uses SSD.) I just tested our application

Re: text file stream to HDFS

2015-07-04 Thread Ted Yu
, ravi tella ddpis...@gmail.com wrote: Hello, How should I write a text file stream DStream to HDFS. I tried the the following val lines = ssc.textFileStream(hdfs:/user/hadoop/spark/streaming/input/) lines.saveAsTextFile(hdfs:/user/hadoop/output1) val lines = ssc.textFileStream(hdfs

text file stream to HDFS

2015-07-04 Thread ravi tella
Hello, How should I write a text file stream DStream to HDFS. I tried the the following val lines = ssc.textFileStream(hdfs:/user/hadoop/spark/streaming/input/) lines.saveAsTextFile(hdfs:/user/hadoop/output1) val lines = ssc.textFileStream(hdfs:/user/hadoop/spark/streaming/input

RE: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-29 Thread Dave Ariens
. - Driver authenticates to Kerberos via UserGroupInformation.loginUserFromKeytab(principal, keytab) - Driver instantiates a Hadoop configuration via hdfs-site.xml and core-site.xml - Driver instantiates the Hadoop file system from a path based on the Hadoop root URI (hdfs://hadoop-cluster.site.org

RE: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-29 Thread Dave Ariens
, thanks everyone. From: Steve Loughran [mailto:ste...@hortonworks.com] Sent: Monday, June 29, 2015 10:32 AM To: Dave Ariens Cc: Tim Chen; Marcelo Vanzin; Olivier Girardot; user@spark.apache.org Subject: Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos On 29 Jun 2015, at 14:18, Dave

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-29 Thread Steve Loughran
, and serialize/cache it for the executors to use instead of them having to instantiate their own. - Driver authenticates to Kerberos via UserGroupInformation.loginUserFromKeytab(principal, keytab) - Driver instantiates a Hadoop configuration via hdfs-site.xml and core-site.xml - Driver instantiates

spark streaming HDFS file issue

2015-06-29 Thread ravi tella
I am running a spark streaming example from learning spark book with one change. The change I made was for streaming a file from HDFS. val lines = ssc.textFileStream(hdfs:/user/hadoop/spark/streaming/input) I ran the application number of times and every time dropped a new file in the input

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-28 Thread Iulian Dragoș
: Accessing Kerberos Secured HDFS Resources from Spark on Mesos On Fri, Jun 26, 2015 at 3:09 PM, Dave Ariens dari...@blackberry.com wrote: Would there be any way to have the task instances in the slaves call the UGI login with a principal/keytab provided to the driver? That would only work

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-28 Thread Steve Loughran
to the the request. 2. The YARN RM uses the HDFS token for the localisation, so the node managers can access the content the user has the rights to. 3. There's some other stuff related to token refresh of restarted app masters, essentially guaranteeing that even an AM restarted 3 days after the first launch

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-27 Thread Tim Chen
*Sent: *Friday, June 26, 2015 6:20 PM *To: *Dave Ariens *Cc: *Tim Chen; Olivier Girardot; user@spark.apache.org *Subject: *Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos On Fri, Jun 26, 2015 at 3:09 PM, Dave Ariens dari...@blackberry.com wrote: Would there be any way

Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Dave Ariens
having to kinit first), but not on Mesos. Is there a way to have the slaves running the tasks perform the same kerberos login before they attempt to access HDFS? Putting aside the security of Spark/Mesos and how that keytab would get distributed, I'm just looking for a working POC

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Timothy Chen
. Is there a way to have the slaves running the tasks perform the same kerberos login before they attempt to access HDFS? Putting aside the security of Spark/Mesos and how that keytab would get distributed, I'm just looking for a working POC. Is there a way to leverage the Broadcast capability

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Tim Chen
Mesos do support running containers as specific users passed to it. Thanks for chiming in, what else does YARN do with Kerberos besides keytab file and user? Tim On Fri, Jun 26, 2015 at 1:20 PM, Marcelo Vanzin van...@cloudera.com wrote: On Fri, Jun 26, 2015 at 1:13 PM, Tim Chen

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Marcelo Vanzin
runs as requesting user, or in a separate container that cannot be accessed by other applications in any way). On top of that, for HDFS and other Hadoop services, the applications themselves need to be aware that Kerberos is enabled and that they need to do certain things. For example, they need

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Olivier Girardot
login before accessing any HDFS resources. To login, they just need the name of the principal (username) and a keytab file. Then they just need to invoke the following java: import org.apache.hadoop.security.UserGroupInformation UserGroupInformation.loginUserFromKeytab(adminPrincipal

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Marcelo Vanzin
On Fri, Jun 26, 2015 at 1:13 PM, Tim Chen t...@mesosphere.io wrote: So correct me if I'm wrong, sounds like all you need is a principal user name and also a keytab file downloaded right? I'm not familiar with Mesos so don't know what kinds of features it has, but at the very least it would

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Dave Ariens
of the executor isn't hard, it's being able to call the login method before the HDFS resources are accessed. See the gist below. That login completes successfully but it's only on the driver. Once that HDFS resource is read with the Avro input format and key and the tasks are created

RE: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Dave Ariens
Hi Timothy, Because I'm running Spark on Mesos alongside a secured Hadoop cluster, I need to ensure that my tasks running on the slaves perform a Kerberos login before accessing any HDFS resources. To login, they just need the name of the principal (username) and a keytab file

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Tim Chen
perform a Kerberos login before accessing any HDFS resources. To login, they just need the name of the principal (username) and a keytab file. Then they just need to invoke the following java: import org.apache.hadoop.security.UserGroupInformation UserGroupInformation.loginUserFromKeytab

Re: Accessing Kerberos Secured HDFS Resources from Spark on Mesos

2015-06-26 Thread Marcelo Vanzin
On Fri, Jun 26, 2015 at 3:09 PM, Dave Ariens dari...@blackberry.com wrote: Would there be any way to have the task instances in the slaves call the UGI login with a principal/keytab provided to the driver? That would only work with a very small number of executors. If you have many login

<    3   4   5   6   7   8   9   10   11   12   >