Re: Spark dataframe hdfs vs s3

2020-05-30 Thread Anwar AliKhan
Optimisation of Spark applications Apache Spark is an in-memory data processing tool widely used in companies to deal with Big Data issues. Running a Spark application in production requires user-defined resources. This article presents several Spark

Re: Spark dataframe hdfs vs s3

2020-05-30 Thread Dark Crusader
Thanks all for the replies. I am switching to hdfs since it seems like an easier solution. To answer some of your questions, my hdfs space is a part of my nodes I use for computation on spark. >From what I understand, this helps because of the data locality advantage. Which means that there is

Re: Spark dataframe hdfs vs s3

2020-05-29 Thread Jörn Franke
Maybe some aws network optimized instances with higher bandwidth will improve the situation. > Am 27.05.2020 um 19:51 schrieb Dark Crusader : > >  > Hi Jörn, > > Thanks for the reply. I will try to create a easier example to reproduce the > issue. > > I will also try your suggestion to look

Re: Spark dataframe hdfs vs s3

2020-05-29 Thread randy clinton
finally persist outside of HDFS. On Fri, May 29, 2020 at 2:09 PM Bin Fan wrote: > Try to deploy Alluxio as a caching layer on top of S3, providing Spark a > similar HDFS interface? > Like in this article: > > https://www.alluxio.io/blog/accelerate-spark-and-hive-jobs-o

Re: Spark dataframe hdfs vs s3

2020-05-29 Thread Bin Fan
Try to deploy Alluxio as a caching layer on top of S3, providing Spark a similar HDFS interface? Like in this article: https://www.alluxio.io/blog/accelerate-spark-and-hive-jobs-on-aws-s3-by-10x-with-alluxio-tiered-storage/ On Wed, May 27, 2020 at 6:52 PM Dark Crusader wrote: > Hi Ra

Re: Spark dataframe hdfs vs s3

2020-05-28 Thread Kanwaljit Singh
You can’t play much if it is a streaming job. But in case of batch jobs, sometimes teams will copy their S3 data to HDFS in prep for the next run :D From: randy clinton Date: Thursday, May 28, 2020 at 5:50 AM To: Dark Crusader Cc: Jörn Franke , user Subject: Re: Spark dataframe hdfs vs s3

Re: Spark dataframe hdfs vs s3

2020-05-28 Thread randy clinton
See if this helps "That is to say, on a per node basis, HDFS can yield 6X higher read throughput than S3. Thus, *given that the S3 is 10x cheaper than HDFS, we find that S3 is almost 2x better compared to HDFS on performance per dollar."*

Re: Spark dataframe hdfs vs s3

2020-05-27 Thread Dark Crusader
Hi Randy, Yes, I'm using parquet on both S3 and hdfs. On Thu, 28 May, 2020, 2:38 am randy clinton, wrote: > Is the file Parquet on S3 or is it some other file format? > > In general I would assume that HDFS read/writes are more performant for > spark jobs. > > For instance, consider how well

Re: Spark dataframe hdfs vs s3

2020-05-27 Thread randy clinton
Is the file Parquet on S3 or is it some other file format? In general I would assume that HDFS read/writes are more performant for spark jobs. For instance, consider how well partitioned your HDFS file is vs the S3 file. On Wed, May 27, 2020 at 1:51 PM Dark Crusader wrote: > Hi Jörn, > >

Re: Spark dataframe hdfs vs s3

2020-05-27 Thread Dark Crusader
Hi Jörn, Thanks for the reply. I will try to create a easier example to reproduce the issue. I will also try your suggestion to look into the UI. Can you guide on what I should be looking for? I was already using the s3a protocol to compare the times. My hunch is that multiple reads from S3

Re: Spark dataframe hdfs vs s3

2020-05-27 Thread Jörn Franke
Have you looked in Spark UI why this is the case ? S3 Reading can take more time - it depends also what s3 url you are using : s3a vs s3n vs S3. It could help after some calculation to persist in-memory or on HDFS. You can also initially load from S3 and store on HDFS and work from there .

Spark dataframe hdfs vs s3

2020-05-27 Thread Dark Crusader
Hi all, I am reading data from hdfs in the form of parquet files (around 3 GB) and running an algorithm from the spark ml library. If I create the same spark dataframe by reading data from S3, the same algorithm takes considerably more time. I don't understand why this is happening. Is this a

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Miguel Morales
If you're using Kubernetes you can group spark and hdfs to run in the same stack. Meaning they'll basically run in the same network space and share ips. Just gotta make sure there's no port conflicts. On Wed, Dec 28, 2016 at 5:07 AM, Karamba <phantom...@web.de> wrote: > > Good

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Karamba
> >>> Although the Spark task scheduler is aware of rack-level data locality, it >>> seems that only YARN implements the support for it. >> This explains why the script that I configured in core-site.xml >> topology.script.file.name is not called in by the sp

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Miguel Morales
container. > But at time of reading from hdfs in a spark program, the script is > called in my hdfs namenode container. > >> However, node-level locality can still work for Standalone. > > I have a couple of physical hosts that run spark and hdfs docker > containers. How does spar

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Karamba
spark container. But at time of reading from hdfs in a spark program, the script is called in my hdfs namenode container. > However, node-level locality can still work for Standalone. I have a couple of physical hosts that run spark and hdfs docker containers. How does spark standalone knows t

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-27 Thread Sun Rui
, which means executors are available on a subset of the cluster nodes? > On Dec 27, 2016, at 01:39, Karamba <phantom...@web.de> wrote: > > Hi, > > I am running a couple of docker hosts, each with an HDFS and a spark > worker in a spark standalone cluster. > In

[Spark 2.0.2 HDFS]: no data locality

2016-12-26 Thread Karamba
Hi, I am running a couple of docker hosts, each with an HDFS and a spark worker in a spark standalone cluster. In order to get data locality awareness, I would like to configure Racks for each host, so that a spark worker container knows from which hdfs node container it should load its data

Spark + Secure HDFS Cluster

2016-04-12 Thread Vijay Srinivasaraghavan
Hello, I am trying to understand Spark support to access secure HDFS cluster. My plan is to deploy Spark on Mesos which will access a secure HDFS cluster running elsewhere in the network. I am trying to understand how much of support do exist as of now?  My understanding is Spark as of now

Data Security on Spark-on-HDFS

2015-08-31 Thread Daniel Schulz
data? So is there a shortcoming when using Spark because the JVM processes are already running and therefore the launching user is omitted by Spark when accessing data residing on HDFS? Or is Spark only reading/writing data, that the user had access to, that launched this Thread? What about

Re: Data Security on Spark-on-HDFS

2015-08-31 Thread Steve Loughran
ing > and therefore the launching user is omitted by Spark when accessing data > residing on HDFS? Or is Spark only reading/writing data, that the user had > access to, that launched this Thread? in a kerberized YARN cluster, the processes run as the specific user submitting the job (or wh

Spark and HDFS

2015-07-15 Thread Jeskanen, Elina
I have Spark 1.4 on my local machine and I would like to connect to our local 4 nodes Cloudera cluster. But how? In the example it says text_file = spark.textFile(hdfs://...), but can you advise me in where to get this hdfs://... -address? Thanks! Elina

Re: Spark and HDFS

2015-07-15 Thread Marcelo Vanzin
On Wed, Jul 15, 2015 at 5:36 AM, Jeskanen, Elina elina.jeska...@cgi.com wrote: I have Spark 1.4 on my local machine and I would like to connect to our local 4 nodes Cloudera cluster. But how? In the example it says text_file = spark.textFile(hdfs://...), but can you advise me in where to

Re: Spark and HDFS

2015-07-15 Thread ayan guha
Assuming you run spark locally (ie either local mode or standalone cluster on your localm/c) 1. You need to have hadoop binaries locally 2. You need to have hdfs-site on Spark Classpath of your local m/c I would suggest you to start off with local files to play around. If you need to run spark

Re: Spark and HDFS

2015-07-15 Thread Naveen Madhire
: Assuming you run spark locally (ie either local mode or standalone cluster on your localm/c) 1. You need to have hadoop binaries locally 2. You need to have hdfs-site on Spark Classpath of your local m/c I would suggest you to start off with local files to play around. If you need to run spark

spark streaming HDFS file issue

2015-06-29 Thread ravi tella
I am running a spark streaming example from learning spark book with one change. The change I made was for streaming a file from HDFS. val lines = ssc.textFileStream(hdfs:/user/hadoop/spark/streaming/input) I ran the application number of times and every time dropped a new file in the input

Spark and HDFS ( Worker and Data Nodes Combination )

2015-06-22 Thread Ashish Soni
Hi All , What is the Best Way to install and Spark Cluster along side with Hadoop Cluster , Any recommendation for below deployment topology will be a great help *Also Is it necessary to put the Spark Worker on DataNodes as when it read block from HDFS it will be local to the Server / Worker or

Re: Spark and HDFS ( Worker and Data Nodes Combination )

2015-06-22 Thread Akhil Das
Option 1 should be fine, Option 2 would bound a lot on network as the data increase in time. Thanks Best Regards On Mon, Jun 22, 2015 at 5:59 PM, Ashish Soni asoni.le...@gmail.com wrote: Hi All , What is the Best Way to install and Spark Cluster along side with Hadoop Cluster , Any

Re: Spark and HDFS ( Worker and Data Nodes Combination )

2015-06-22 Thread ayan guha
I have a basic qs: how spark assigns partition to an executor? Does it respect data locality? Does this behaviour depend on cluster manager, ie yarn vs standalone? On 22 Jun 2015 22:45, Akhil Das ak...@sigmoidanalytics.com wrote: Option 1 should be fine, Option 2 would bound a lot on network as

Spark + Mesos + HDFS resource split

2015-04-27 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I am building a mesos cluster for the purposes of using it to run spark workloads (in addition to other frameworks). I am under the impression that it is preferable/recommended to run hdfs datanode process, spark slave on the same physical node

Re: Running spark over HDFS

2015-04-21 Thread madhvi
and have installed spark cluster over my system having hadoop cluster.I want to process data stored in HDFS through spark. When I am running code in eclipse it is giving the following warning repeatedly: scheduler.TaskSchedulerImpl: Initial

Re: Running spark over HDFS

2015-04-21 Thread Akhil Das
installed spark cluster over my system having hadoop cluster.I want to process data stored in HDFS through spark. When I am running code in eclipse it is giving the following warning repeatedly: scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure

Re: Running spark over HDFS

2015-04-20 Thread SURAJ SHETH
Regards On Mon, Apr 20, 2015 at 12:22 PM, madhvi madhvi.gu...@orkash.com wrote: Hi All, I am new to spark and have installed spark cluster over my system having hadoop cluster.I want to process data stored in HDFS through spark. When I am running code in eclipse it is giving the following

Re: Running spark over HDFS

2015-04-20 Thread Archit Thakur
There are lot of similar problems shared and resolved by users on this same portal. I have been part of those discussions before, Search those, Please Try them and let us know, if you still face problems. Thanks and Regards, Archit Thakur. On Mon, Apr 20, 2015 at 3:05 PM, madhvi

Re: Running spark over HDFS

2015-04-20 Thread madhvi
On Monday 20 April 2015 03:18 PM, Archit Thakur wrote: There are lot of similar problems shared and resolved by users on this same portal. I have been part of those discussions before, Search those, Please Try them and let us know, if you still face problems. Thanks and Regards, Archit

Re: Running spark over HDFS

2015-04-20 Thread Akhil Das
wrote: Hi All, I am new to spark and have installed spark cluster over my system having hadoop cluster.I want to process data stored in HDFS through spark. When I am running code in eclipse it is giving the following warning repeatedly: scheduler.TaskSchedulerImpl: Initial job has not accepted

Re: Running spark over HDFS

2015-04-20 Thread madhvi
On Monday 20 April 2015 02:52 PM, SURAJ SHETH wrote: Hi Madhvi, I think the memory requested by your job, i.e. 2.0 GB is higher than what is available. Please request for 256 MB explicitly while creating Spark Context and try again. Thanks and Regards, Suraj Sheth Tried the same but still

Re: Running spark over HDFS

2015-04-20 Thread madhvi
: Hi All, I am new to spark and have installed spark cluster over my system having hadoop cluster.I want to process data stored in HDFS through spark. When I am running code in eclipse it is giving the following warning repeatedly: scheduler.TaskSchedulerImpl: Initial job has

Re: Running spark over HDFS

2015-04-20 Thread madhvi
installed spark cluster over my system having hadoop cluster.I want to process data stored in HDFS through spark. When I am running code in eclipse it is giving the following warning repeatedly: scheduler.TaskSchedulerImpl: Initial job has not accepted any

Re: Running spark over HDFS

2015-04-20 Thread SURAJ SHETH
Hi Madhvi, I think the memory requested by your job, i.e. 2.0 GB is higher than what is available. Please request for 256 MB explicitly while creating Spark Context and try again. Thanks and Regards, Suraj Sheth

Running spark over HDFS

2015-04-20 Thread madhvi
Hi All, I am new to spark and have installed spark cluster over my system having hadoop cluster.I want to process data stored in HDFS through spark. When I am running code in eclipse it is giving the following warning repeatedly: scheduler.TaskSchedulerImpl: Initial job has not accepted any

Re: Running spark over HDFS

2015-04-20 Thread Akhil Das
and have installed spark cluster over my system having hadoop cluster.I want to process data stored in HDFS through spark. When I am running code in eclipse it is giving the following warning repeatedly: scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI

Spark on HDFS vs. Lustre vs. other file systems - formal research and performance evaluation

2015-03-13 Thread Edmon Begoli
All, Does anyone have any reference to a publication or other, informal sources (blogs, notes), showing performance of Spark on HDFS vs. other shared (Lustre, etc.) or other file system (NFS). I need this for formal performance research. We are currently doing a research into this on a very

Re: Questions about Spark and HDFS co-location

2015-01-09 Thread Sean Owen
on the same machine as a Spark worker, will it read directly off of disk, or does that data have to travel through the network in some way? Is there a distinct advantage to putting HDFS and Spark on the same box if it is possible or, due to the way blocks are distributed about a cluster, are we so likely

Re: Questions about Spark and HDFS co-location

2015-01-09 Thread Ted Yu
in a HDFS DataNode on the same machine as a Spark worker, will it read directly off of disk, or does that data have to travel through the network in some way? Is there a distinct advantage to putting HDFS and Spark on the same box if it is possible or, due to the way blocks are distributed

Re: Questions about Spark and HDFS co-location

2015-01-09 Thread Andrew Ash
and that file (or most of its blocks) is located in a HDFS DataNode on the same machine as a Spark worker, will it read directly off of disk, or does that data have to travel through the network in some way? Is there a distinct advantage to putting HDFS and Spark on the same box

Questions about Spark and HDFS co-location

2015-01-09 Thread zfry
by a Spark executor and that file (or most of its blocks) is located in a HDFS DataNode on the same machine as a Spark worker, will it read directly off of disk, or does that data have to travel through the network in some way? Is there a distinct advantage to putting HDFS and Spark on the same box

Read a HDFS file from Spark using HDFS API

2014-11-14 Thread rapelly kartheek
Hi, I am trying to read a HDFS file from Spark scheduler code. I could find how to write hdfs read/writes in java. But I need to access hdfs from spark using scala. Can someone please help me in this regard.

Re: Read a HDFS file from Spark using HDFS API

2014-11-14 Thread Akhil Das
like this? val file = sc.textFile(hdfs://localhost:9000/sigmoid/input.txt) Thanks Best Regards On Fri, Nov 14, 2014 at 9:02 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, I am trying to read a HDFS file from Spark scheduler code. I could find how to write hdfs read/writes in java

Re: Read a HDFS file from Spark using HDFS API

2014-11-14 Thread Akhil Das
trying to read a HDFS file from Spark scheduler code. I could find how to write hdfs read/writes in java. But I need to access hdfs from spark using scala. Can someone please help me in this regard.

Re: Read a HDFS file from Spark using HDFS API

2014-11-14 Thread Akhil Das
...@gmail.com] *Sent:* Friday, November 14, 2014 9:42 AM *To:* Akhil Das; user@spark.apache.org *Subject:* Re: Read a HDFS file from Spark using HDFS API No. I am not accessing hdfs from either shell or a spark application. I want to access from spark Scheduler code. I face an error when I

Re: Read a HDFS file from Spark using HDFS API

2014-11-14 Thread rapelly kartheek
@spark.apache.org *Subject:* Re: Read a HDFS file from Spark using HDFS API No. I am not accessing hdfs from either shell or a spark application. I want to access from spark Scheduler code. I face an error when I use sc.textFile() as SparkContext wouldn't have been created yet. So, error says: sc

Re: Read a HDFS file from Spark using HDFS API

2014-11-14 Thread rapelly kartheek
*From:* rapelly kartheek [mailto:kartheek.m...@gmail.com] *Sent:* Friday, November 14, 2014 9:42 AM *To:* Akhil Das; user@spark.apache.org *Subject:* Re: Read a HDFS file from Spark using HDFS API No. I am not accessing hdfs from either shell or a spark application. I want to access from

Spark using HDFS data [newb]

2014-10-23 Thread matan
.n3.nabble.com/Spark-using-HDFS-data-newb-tp17169.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Re: spark-ec2 - HDFS doesn't start on AWS EC2 cluster

2014-10-08 Thread mrm
They reverted to a previous version of the spark-ec2 script and things are working again! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-HDFS-doesn-t-start-on-AWS-EC2-cluster-tp15921p15945.html Sent from the Apache Spark User List mailing list

Re: spark-ec2 - HDFS doesn't start on AWS EC2 cluster

2014-10-08 Thread Nicholas Chammas
and things are working again! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-HDFS-doesn-t-start-on-AWS-EC2-cluster-tp15921p15945.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: spark-ec2 - HDFS doesn't start on AWS EC2 cluster

2014-10-08 Thread Jan Warchoł
hasn’t changed, obviously. ​ On Wed, Oct 8, 2014 at 12:20 PM, mrm ma...@skimlinks.com wrote: They reverted to a previous version of the spark-ec2 script and things are working again! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-HDFS-doesn-t

Re: spark-ec2 - HDFS doesn't start on AWS EC2 cluster

2014-10-08 Thread Akhil Das
: They reverted to a previous version of the spark-ec2 script and things are working again! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-HDFS-doesn-t-start-on-AWS-EC2-cluster-tp15921p15945.html Sent from the Apache Spark User List mailing list archive

Spark Stream + HDFS Append

2014-08-24 Thread Dean Chen
We are using HDFS for log storage where logs are flushed to HDFS every minute, with a new file created for each hour. We would like to consume these logs using spark streaming.  The docs state that new HDFS will be picked up, but does Spark Streaming support HDFS appends? — Dean Chen

Re: Spark Stream + HDFS Append

2014-08-24 Thread Tobias Pfeiffer
will be picked up, but does Spark Streaming support HDFS appends? I don't think so. The docs at http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.streaming.StreamingContext say that even for new files, Files must be written to the monitored directory by 'moving' them from another

Re: error when spark access hdfs with Kerberos enable

2014-07-09 Thread Sandy Ryza
steal and get to your data. -Sandy On Tue, Jul 8, 2014 at 7:28 PM, Cheney Sun sun.che...@gmail.com wrote: Hi Sandy, We are also going to grep data from a security enabled (with kerberos) HDFS in our Spark application. Per you answer, we have to switch Spark on YARN to achieve this. We

error when spark access hdfs with Kerberos enable

2014-07-08 Thread 许晓炜
Hi all, I encounter a strange issue when using spark 1.0 to access hdfs with Kerberos I just have one spark test node for spark and HADOOP_CONF_DIR is set to the location containing the hdfs configuration files(hdfs-site.xml and core-site.xml) When I use spark-shell with local mode, the access

Re: error when spark access hdfs with Kerberos enable

2014-07-08 Thread Marcelo Vanzin
Someone might be able to correct me if I'm wrong, but I don't believe standalone mode supports kerberos. You'd have to use Yarn for that. On Tue, Jul 8, 2014 at 1:40 AM, 许晓炜 xuxiao...@qiyi.com wrote: Hi all, I encounter a strange issue when using spark 1.0 to access hdfs with Kerberos I

Re: error when spark access hdfs with Kerberos enable

2014-07-08 Thread Cheney Sun
Hi Sandy, We are also going to grep data from a security enabled (with kerberos) HDFS in our Spark application. Per you answer, we have to switch Spark on YARN to achieve this. We plan to deploy a different Hadoop cluster(with YARN) only to run Spark. Is it necessary to deploy YARN with security

Re: Spark on HBase vs. Spark on HDFS

2014-05-23 Thread Mayur Rustagi
Also I am unsure if Spark on Hbase leverages Locality. When you cache process data do you see node_local jobs in process list. Spark on HDFS leverages locality quite well can really boost performance by 3-4x in my experience. If you are loading all your data from HBase to spark then you

Spark on HBase vs. Spark on HDFS

2014-05-22 Thread Limbeck, Philip
of Spark on HDFS? Best Philip Automic Software GmbH, Hauptstrasse 3C, 3012 Wolfsgraben Firmenbuchnummer/Commercial Register No. 275184h Firmenbuchgericht/Commercial Register Court: Landesgericht St. Poelten This email (including any attachments) may contain information which is privileged

Re: Spark on HBase vs. Spark on HDFS

2014-05-22 Thread Nick Pentreath
of file structure and the like. What is the true advantage of Spark on HBase in favor of Spark on HDFS? Best Philip Automic Software GmbH, Hauptstrasse 3C, 3012 Wolfsgraben Firmenbuchnummer/Commercial Register No. 275184h Firmenbuchgericht/Commercial Register Court: Landesgericht St

Re: Applications for Spark on HDFS

2014-03-11 Thread Sandy Ryza
...@gmail.comwrote: Hello Folks, I was wondering if anyone had experience placing application jars for Spark onto HDFS. Currently I have distributing the jars manually and would love to source the jar via HDFS a la distributed caching with MR. Any ideas? Regards, Paul