Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread Prem Sahoo
I am looking for writer/comitter optimization which can make the spark write faster. On Tue, May 21, 2024 at 9:15 PM eab...@163.com wrote: > Hi, > I think you should write to HDFS then copy file (parquet or orc) from > HDFS to MinIO. > > -- >

Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread eab...@163.com
Hi, I think you should write to HDFS then copy file (parquet or orc) from HDFS to MinIO. eabour From: Prem Sahoo Date: 2024-05-22 00:38 To: Vibhor Gupta; user Subject: Re: EXT: Dual Write to HDFS and MinIO in faster way On Tue, May 21, 2024 at 6:58 AM Prem Sahoo wrote: Hello Vibhor

Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread Prem Sahoo
help me in scenario 2 ? > How to make spark write to MinIO faster ? > Sent from my iPhone > > On May 21, 2024, at 1:18 AM, Vibhor Gupta > wrote: > >  > > Hi Prem, > > > > You can try to write to HDFS then read from HDFS and write to MinIO. > > > &

Re: [EXTERNAL] Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-11 Thread Eugene Miretsky
be the >> default. What's the use case for uploading the local pyspark.zip every >> time? >> 2) It seems like the localConfigs are meant to be copied every time (code) >> what's the use case for that? Why not just use the cluster config? >> >> >> >>

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-11 Thread Mich Talebzadeh
e (code) > what's the use case for that? Why not just use the cluster config? > > > > On Sun, Dec 10, 2023 at 1:15 PM Eugene Miretsky wrote: > >> Thanks Mich, >> >> Tried this and still getting >> INF Client: "Uploading resource >> file:/opt/spark

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-10 Thread Eugene Miretsky
use the cluster config? On Sun, Dec 10, 2023 at 1:15 PM Eugene Miretsky wrote: > Thanks Mich, > > Tried this and still getting > INF Client: "Uploading resource > file:/opt/spark/spark-3.5.0-bin-hadoop3/python/lib/pyspark.zip -> > hdfs:/". It is also doing it for (

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-10 Thread Eugene Miretsky
Thanks Mich, Tried this and still getting INF Client: "Uploading resource file:/opt/spark/spark-3.5.0-bin-hadoop3/python/lib/pyspark.zip -> hdfs:/". It is also doing it for (py4j.-0.10.9.7-src.zip and __spark_conf__.zip). It is working now because I enabled direct access to HDFS to

Re: Spark-submit without access to HDFS

2023-11-17 Thread Mich Talebzadeh
Hi, How are you submitting your spark job from your client? Your files can either be on HDFS or HCFS such as gs, s3 etc. With reference to --py-files hdfs://yarn-master-url hdfs://foo.py', I assume you want your spark-submit --verbose \ --deploy-mode cluster

Re: Spark-submit without access to HDFS

2023-11-16 Thread Jörn Franke
would recommend against it though for various reasons, such as reliability)Am 15.11.2023 um 22:33 schrieb Eugene Miretsky :Hey All, We are running Pyspark spark-submit from a client outside the cluster. The client has network connectivity only to the Yarn Master, not the HDFS Datanodes. How can we

Re: Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-11-15 Thread eab...@163.com
Hi Eugene, As the logs indicate, when executing spark-submit, Spark will package and upload spark/conf to HDFS, along with uploading spark/jars. These files are uploaded to HDFS unless you specify uploading them to another OSS. To do so, you'll need to modify the configuration in hdfs

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-11-15 Thread Eugene Miretsky
:32 PM eab...@163.com wrote: > Hi Eugene, > I think you should Check if the HDFS service is running properly. From > the logs, it appears that there are two datanodes in HDFS, > but none of them are healthy. > Please investigate the reasons why the datanodes are not funct

Re: Spark-submit without access to HDFS

2023-11-15 Thread eab...@163.com
Hi Eugene, I think you should Check if the HDFS service is running properly. From the logs, it appears that there are two datanodes in HDFS, but none of them are healthy. Please investigate the reasons why the datanodes are not functioning properly. It seems that the issue might be due

Spark-submit without access to HDFS

2023-11-15 Thread Eugene Miretsky
Hey All, We are running Pyspark spark-submit from a client outside the cluster. The client has network connectivity only to the Yarn Master, not the HDFS Datanodes. How can we submit the jobs? The idea would be to preload all the dependencies (job code, libraries, etc) to HDFS, and just submit

Re: [External Email] Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Nebi Aydin
Usually job never reaches that point fails during shuffle. And storage memory and executor memory when it failed is usually low On Fri, Sep 8, 2023 at 16:49 Jack Wells wrote: > Assuming you’re not writing to HDFS in your code, Spark can spill to HDFS > if it runs out of memory on a per-ex

Re: [External Email] Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Jack Wells
Assuming you’re not writing to HDFS in your code, Spark can spill to HDFS if it runs out of memory on a per-executor basis. This could happen when evaluating a cache operation like you have below or during shuffle operations in joins, etc. You might try to increase executor memory, tune shuffle

Re: [External Email] Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Nebi Aydin
> > On Sep 8, 2023 at 10:59:59, Nebi Aydin > wrote: > >> Hi all, >> I am using spark on EMR to process data. Basically i read data from AWS >> S3 and do the transformation and post transformation i am loading/writing >> data to s3. >> >> Recently we

Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Jack Wells
a > to s3. > > Recently we have found that hdfs(/mnt/hdfs) utilization is going too high. > > I disabled `yarn.log-aggregation-enable` by setting it to False. > > I am not writing any data to hdfs(/mnt/hdfs) however is that spark is > creating blocks and writing data into it

About /mnt/hdfs/current/BP directories

2023-09-08 Thread Nebi Aydin
Hi all, I am using spark on EMR to process data. Basically i read data from AWS S3 and do the transformation and post transformation i am loading/writing data to s3. Recently we have found that hdfs(/mnt/hdfs) utilization is going too high. I disabled `yarn.log-aggregation-enable` by setting

Spark Thrift Server issue with external HDFS table

2023-02-01 Thread Kalhara Gurugamage
Hello Team, We are using the spark 3.3.0 version. We’ve created external HDFS tables using beeline spark with thrift server Here we have multiple parquet files in one partition need to be attached to the external HDFS table. Note that HDFS data is stored as distributed setup

Fwd: [Spark Standalone Mode] How to read from kerberised HDFS in spark standalone mode

2023-01-31 Thread Wei Yan
Glad to hear that! And hope it can help any other guys facing the same problem. -- Forwarded message - 发件人: Bansal, Jaimita Date: 2023年2月1日周三 03:15 Subject: RE: [Spark Standalone Mode] How to read from kerberised HDFS in spark standalone mode To: Wei Yan Cc: Chittajallu, Rajiv

[Spark Standalone Mode] How to read from kerberised HDFS in spark standalone mode

2023-01-19 Thread Bansal, Jaimita
Hi Spark Team, We are facing an issue when trying to read from HDFS via spark running in standalone cluster. The issue comes from the executor node not able to authenticate. It is using auth:SIMPLE when actually we have setup auth as Kerberos. Could you please help in resolving this? Caused

Re: Spark equivalent to hdfs groups

2022-09-07 Thread phiroc
Many thanks, Sean. - Mail original - De: "Sean Owen" À: phi...@free.fr Cc: "User" Envoyé: Mercredi 7 Septembre 2022 17:05:55 Objet: Re: Spark equivalent to hdfs groups No, because this is a storage concept, and Spark is not a storage system. You would appeal to

Re: Spark equivalent to hdfs groups

2022-09-07 Thread Sean Owen
No, because this is a storage concept, and Spark is not a storage system. You would appeal to tools and interfaces that the storage system provides, like hdfs. Where or how the hdfs binary is available depends on how you deploy Spark where; it would be available on a Hadoop cluster. It's just

Re: Spark equivalent to hdfs groups

2022-09-07 Thread phiroc
Hi Sean, I'm talking about HDFS Groups. On Linux, you can type "hdfs groups " to get the list of the groups user1 belongs to. In Zeppelin/Spark, the hdfs executable is not accessible. As a result, I wondered if there was a class in Spark (eg. Security or ACL) which would let

Re: Spark equivalent to hdfs groups

2022-09-07 Thread Sean Owen
Spark isn't a storage system or user management system; no there is no notion of groups (groups for what?) On Wed, Sep 7, 2022 at 8:36 AM wrote: > Hello, > is there a Spark equivalent to "hdfs groups "? > Many

Spark equivalent to hdfs groups

2022-09-07 Thread phiroc
Hello, is there a Spark equivalent to "hdfs groups "? Many thanks. Philippe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Choosing architecture for on-premise Spark & HDFS on Kubernetes cluster

2021-11-25 Thread JHI Star
Thanks, I'll have a closer look at GKE and compare it with what some other sites running similar to use have used (Openstack). Well, no, I don't envisage any public cloud integration. There is no plan to use Hive just PySpark using HDFS ! On Wed, Nov 24, 2021 at 10:31 AM Mich Talebzadeh wrote

Re: Choosing architecture for on-premise Spark & HDFS on Kubernetes cluster

2021-11-24 Thread Mich Talebzadeh
n > on-premise Spark & HDFS on Kubernetes cluster.. > > Kubernetes is really a cloud-native technology. However, the > cloud-native concept does not exclude the use of on-premises infrastructure > in cases where it makes sense. So the question is are you going to use a > mesh

Re: Choosing architecture for on-premise Spark & HDFS on Kubernetes cluster

2021-11-23 Thread Mich Talebzadeh
OK to your point below "... We are going to deploy 20 physical Linux servers for use as an on-premise Spark & HDFS on Kubernetes cluster.. Kubernetes is really a cloud-native technology. However, the cloud-native concept does not exclude the use of on-premises infrastructure in ca

Choosing architecture for on-premise Spark & HDFS on Kubernetes cluster

2021-11-23 Thread JHI Star
We are going to deploy 20 physical Linux servers for use as an on-premise Spark & HDFS on Kubernetes cluster. My question is: within this architecture, is it best to have the pods run directly on bare metal or under VMs or system containers like LXC and/or under an on-premise instance of somet

Accessing a kerberized HDFS using Spark on Openshift

2021-10-13 Thread Gal Shinder
/spark-py:v3.1.1), spark runs fine but I'm unable to connect to a kerberized cloudera hdfs, I've tried the examples outlined in the security documentation (https://github.com/apache/spark/blob/master/docs/security.md#secure-interaction-with-kubernetes) and numerous other combinations but nothing

Re: How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg
ave to union multiple RDDs. You can read files from multiple > directories in a single read call. Spark will manage partitioning of the > data across directories. > > > > *From: *Kapil Garg > *Date: *Wednesday, May 5, 2021 at 10:45 AM > *To: *spark users > *Subject: *[EXTER

Re: How to read multiple HDFS directories

2021-05-05 Thread Lalwani, Jayesh
You don’t have to union multiple RDDs. You can read files from multiple directories in a single read call. Spark will manage partitioning of the data across directories. From: Kapil Garg Date: Wednesday, May 5, 2021 at 10:45 AM To: spark users Subject: [EXTERNAL] How to read multiple HDFS

Re: How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg
2. Loop over N directories > 1. read unprocessed new data from HDFS > 2. union them and do a `reduceByKey` operation > 3. output a new version of the snapshot > > HTH > >view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

Re: How to read multiple HDFS directories

2021-05-05 Thread Mich Talebzadeh
This is my take 1. read the current snapshot (provide empty if it doesn't exist yet) 2. Loop over N directories 1. read unprocessed new data from HDFS 2. union them and do a `reduceByKey` operation 3. output a new version of the snapshot HTH view my Linkedin profile

Re: How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg
ndence with an HDFS directory), do you have a common key across all? > > >view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or de

Re: How to read multiple HDFS directories

2021-05-05 Thread Mich Talebzadeh
When you are doing union on these RDDs, (each RDD has one to one correspondence with an HDFS directory), do you have a common key across all? view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any a

Re: How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg
destruction. > > > > > On Wed, 5 May 2021 at 15:46, Kapil Garg > wrote: > >> Hi, >> I am facing issues while reading multiple HDFS directories. Please read >> the problem statement and current approach below >> >> *Problem Statement* >> There ar

Re: How to read multiple HDFS directories

2021-05-05 Thread Mich Talebzadeh
l Garg wrote: > Hi, > I am facing issues while reading multiple HDFS directories. Please read > the problem statement and current approach below > > *Problem Statement* > There are N HDFS directories each having K files. We want to read data > from all directories such

How to read multiple HDFS directories

2021-05-05 Thread Kapil Garg
Hi, I am facing issues while reading multiple HDFS directories. Please read the problem statement and current approach below *Problem Statement* There are N HDFS directories each having K files. We want to read data from all directories such that when we read data from directory D, we map all

Re: Spark standalone - reading kerberos hdfs

2021-01-24 Thread jelmer
The only way I ever got it to work with spark standalone is via web hdfs. See https://issues.apache.org/jira/browse/SPARK-5158?focusedCommentId=16516856=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16516856 On Fri, 8 Jan 2021 at 18:49, Sudhir Babu Pothineni wrote

Re: Spark standalone - reading kerberos hdfs

2021-01-23 Thread Gábor Rőczei
eytab).getPath() } If you want to test your application with Kerberos, I recommend for you local mode. https://spark.apache.org/docs/latest/submitting-applications.html#master-urls For example: spark-shell --master local and if you want to access a HDFS filesystem, the

Re: Spark standalone - reading kerberos hdfs

2021-01-21 Thread Sudhir Babu Pothineni
> I think incase of spark stand alone the token is not shared to executor, any > example how to get the HDFS token for executor? > >> On Fri, Jan 8, 2021 at 12:13 PM Gabor Somogyi >> wrote: >> TGT is not enough, you need HDFS token which can be obtained by Spark. >

Re: Spark standalone - reading kerberos hdfs

2021-01-08 Thread Sudhir Babu Pothineni
Incase of Spark on Yarn, Application Master shares the token. I think incase of spark stand alone the token is not shared to executor, any example how to get the HDFS token for executor? On Fri, Jan 8, 2021 at 12:13 PM Gabor Somogyi wrote: > TGT is not enough, you need HDFS token which

Re: Spark standalone - reading kerberos hdfs

2021-01-08 Thread Gabor Somogyi
TGT is not enough, you need HDFS token which can be obtained by Spark. Please check the logs... On Fri, 8 Jan 2021, 18:51 Sudhir Babu Pothineni, wrote: > I spin up a spark standalone cluster (spark.autheticate=false), submitted > a job which reads remote kerberized HDFS, > &g

Spark standalone - reading kerberos hdfs

2021-01-08 Thread Sudhir Babu Pothineni
I spin up a spark standalone cluster (spark.autheticate=false), submitted a job which reads remote kerberized HDFS, val spark = SparkSession.builder() .master("spark://spark-standalone:7077") .getOrCreate() UserGroupInformation.loginUserFromKeytab

RE: Spark on Kubernetes : unable to write files to HDFS

2020-12-16 Thread Loic DESCOTTE
Everything is working fine now  Thanks again Loïc De : German Schiavon Envoyé : mercredi 16 décembre 2020 19:23 À : Loic DESCOTTE Cc : user@spark.apache.org Objet : Re: Spark on Kubernetes : unable to write files to HDFS We all been there! no reason

Re: Spark on Kubernetes : unable to write files to HDFS

2020-12-16 Thread German Schiavon
embre 2020 18:01 > *À :* Loic DESCOTTE > *Cc :* user@spark.apache.org > *Objet :* Re: Spark on Kubernetes : unable to write files to HDFS > > Hi, > > seems that you have a typo no? > > Exception in thread "main" java.io.IOException: No FileSystem for scheme: >

RE: Spark on Kubernetes : unable to write files to HDFS

2020-12-16 Thread Loic DESCOTTE
Oh thank you you're right!! I feel shameful ?? De : German Schiavon Envoyé : mercredi 16 décembre 2020 18:01 À : Loic DESCOTTE Cc : user@spark.apache.org Objet : Re: Spark on Kubernetes : unable to write files to HDFS Hi, seems that you have a typo

Re: Spark on Kubernetes : unable to write files to HDFS

2020-12-16 Thread German Schiavon
Hi, seems that you have a typo no? Exception in thread "main" java.io.IOException: No FileSystem for scheme: hfds data.write.mode("overwrite").format("text").save("hfds:// hdfs-namenode/user/loic/result.txt") On Wed, 16 Dec 2020 at 17:02, Loic DESCO

RE: Spark on Kubernetes : unable to write files to HDFS

2020-12-16 Thread Loic DESCOTTE
So I've tried several other things, including building a fat jar with hdfs dependency inside my app jar, and added this to the Spark configuration in the code : val spark = SparkSession .builder() .appName("Hello Spark 7") .config("fs.hdfs.

Spark on Kubernetes : unable to write files to HDFS

2020-12-16 Thread Loic DESCOTTE
Hello, I am using Spark On Kubernetes and I have the following error when I try to write data on HDFS : "no filesystem for scheme hdfs" More details : I am submitting my application with Spark submit like this : spark-submit --master k8s://https://myK8SMaster:6443 \ --deploy-mo

Re: Kafka Topic to Parquet HDFS with Structured Streaming

2020-11-19 Thread AlbertoMarq
Hi Chetan I'm having the exact same issue with spark structured streaming and kafka trying to write to HDFS. Can you please tell me how did you fixed it? I'm ussing spark 3.0.1 and hadoop 3.3.0 Thanks! -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com

Re: how to disable replace HDFS checkpoint location in structured streaming in spark3.0.1

2020-10-13 Thread lec ssmi
m.option("checkpointLocation","file:///C:\\Users\\Administrator >> \\Desktop\\test") > > But the app still throws an exception about the HDFS file system. > Is it not possible to specify the local file system as a checkpoint > location now? >

how to disable replace HDFS checkpoint location in structured streaming in spark3.0.1

2020-10-13 Thread lec ssmi
I have written a demo using spark3.0.0, and the location where the checkpoint file is saved has been explicitly specified like > > stream.option("checkpointLocation","file:///C:\\Users\\Administrator\\ > Desktop\\test") But the app still throws an excepti

Re: Spark3 on k8S reading encrypted data from HDFS with KMS in HA

2020-08-19 Thread Michel Sumbul
of any k8s cluster reading data to any hadoop3 with kms should be fine. I'm using a HDP3 cluster, but there is probably a more easy way to test. Michel Le mer. 19 août 2020 à 09:50, Prashant Sharma a écrit : > -dev > Hi, > > I have used Spark with HDFS encrypted with Hadoop KMS, and it

Re: Spark3 on k8S reading encrypted data from HDFS with KMS in HA

2020-08-19 Thread Prashant Sharma
-dev Hi, I have used Spark with HDFS encrypted with Hadoop KMS, and it worked well. Somehow, I could not recall, if I had the kubernetes in the mix. Somehow, seeing the error, it is not clear what caused the failure. Can I reproduce this somehow? Thanks, On Sat, Aug 15, 2020 at 7:18 PM Michel

Re: Spark3 on k8S reading encrypted data from HDFS with KMS in HA

2020-08-15 Thread Michel Sumbul
Hi guys, Does anyone have an idea on this issue? even some tips to troubleshoot it? I got the impression that after the creation of the delegation for the KMS, the token is not sent to the executor or maybe not saved? I'm sure I'm not the only one using Spark with HDFS encrypted with KMS

Spark3 on k8S reading encrypted data from HDFS with KMS in HA

2020-08-13 Thread Michel Sumbul
Hi guys, Does anyone try Spark3 on k8s reading data from HDFS encrypted with KMS in HA mode (with kerberos)? I have a wordcount job running with Spark3 reading data on HDFS (hadoop 3.1) everything secure with kerberos. Everything works fine if the data folder is not encrypted (spark on k8s

Write to same hdfs dir from multiple spark jobs

2020-07-29 Thread Deepak Sharma
Hi Is there any design pattern around writing to the same hdfs directory from multiple spark jobs? -- Thanks Deepak www.bigdatabig.com

Re: Spark 3.0 with Hadoop 2.6 HDFS/Hive

2020-07-20 Thread DB Tsai
8s in standalone mode. > We access HDFS/Hive running on a Hadoop 2.6 cluster. > We've been using Spark 2.4.5 and planning on upgrading to Spark 3.0.0 > However, we dont have any control over the Hadoop cluster and it will remain > in 2.6 > > Is Spark 3.0 still compatible with HDF

Re: Spark 3.0 with Hadoop 2.6 HDFS/Hive

2020-07-20 Thread DB Tsai
If it's standalone mode, it's even easier. You should be able to connect to hadoop 2.6 hdfs using 3.2 client. In your k8s cluster, just don't put hadoop 2.6 into your classpath. On Sun, Jul 19, 2020 at 10:25 PM Ashika Umanga Umagiliya wrote: > > Hello > > "spark.yarn.popula

Re: Spark 3.0 with Hadoop 2.6 HDFS/Hive

2020-07-19 Thread Ashika Umanga Umagiliya
Hello "spark.yarn.populateHadoopClasspath" is used in YARN mode correct? However our Spark cluster is standalone cluster not using YARN. We only connect to HDFS/Hive to access data.Computation is done on our spark cluster running on K8s (not Yarn) On Mon, Jul 20, 2020 at 2:04 PM DB T

Re: Spark 3.0 with Hadoop 2.6 HDFS/Hive

2020-07-19 Thread Prashant Sharma
s.apache.org/jira/browse/SPARK-25016 > > We run our Spark cluster on K8s in standalone mode. > We access HDFS/Hive running on a Hadoop 2.6 cluster. > We've been using Spark 2.4.5 and planning on upgrading to Spark 3.0.0 > However, we dont have any control over the Hadoop cluster

Spark 3.0 with Hadoop 2.6 HDFS/Hive

2020-07-19 Thread Ashika Umanga
Greetings, Hadoop 2.6 has been removed according to this ticket https://issues.apache.org/jira/browse/SPARK-25016 We run our Spark cluster on K8s in standalone mode. We access HDFS/Hive running on a Hadoop 2.6 cluster. We've been using Spark 2.4.5 and planning on upgrading to Spark 3.0.0 However

Re: Spark dataframe hdfs vs s3

2020-05-30 Thread Anwar AliKhan
ion rules <https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application.html> are important: - Partition size should be at least 128MB and, if possible, based on a key attribute. - The number of CPUs/Executor should be between 4 and 6. In

Re: Spark dataframe hdfs vs s3

2020-05-30 Thread Dark Crusader
Thanks all for the replies. I am switching to hdfs since it seems like an easier solution. To answer some of your questions, my hdfs space is a part of my nodes I use for computation on spark. >From what I understand, this helps because of the data locality advantage. Which me

Re: Spark dataframe hdfs vs s3

2020-05-29 Thread Jörn Franke
try your suggestion to look into the UI. Can you guide on what I > should be looking for? > > I was already using the s3a protocol to compare the times. > > My hunch is that multiple reads from S3 are required because of improper > caching of intermediate data. And mayb

Re: Spark dataframe hdfs vs s3

2020-05-29 Thread randy clinton
HDFS is simply a better place to make performant reads and on top of that the data is closer to your spark job. The databricks link from above will show you that where they find a 6x read throughput difference between the two. If your HDFS is part of the same Spark cluster than it should

Re: Spark dataframe hdfs vs s3

2020-05-29 Thread Bin Fan
Try to deploy Alluxio as a caching layer on top of S3, providing Spark a similar HDFS interface? Like in this article: https://www.alluxio.io/blog/accelerate-spark-and-hive-jobs-on-aws-s3-by-10x-with-alluxio-tiered-storage/ On Wed, May 27, 2020 at 6:52 PM Dark Crusader wrote: > Hi Ra

Re: Spark dataframe hdfs vs s3

2020-05-28 Thread Kanwaljit Singh
You can’t play much if it is a streaming job. But in case of batch jobs, sometimes teams will copy their S3 data to HDFS in prep for the next run :D From: randy clinton Date: Thursday, May 28, 2020 at 5:50 AM To: Dark Crusader Cc: Jörn Franke , user Subject: Re: Spark dataframe hdfs vs s3

Re: Spark dataframe hdfs vs s3

2020-05-28 Thread randy clinton
See if this helps "That is to say, on a per node basis, HDFS can yield 6X higher read throughput than S3. Thus, *given that the S3 is 10x cheaper than HDFS, we find that S3 is almost 2x better compared to HDFS on performance per dollar."* *https://databricks.com/blog/2017/05/31/top

Re: Spark dataframe hdfs vs s3

2020-05-27 Thread Dark Crusader
Hi Randy, Yes, I'm using parquet on both S3 and hdfs. On Thu, 28 May, 2020, 2:38 am randy clinton, wrote: > Is the file Parquet on S3 or is it some other file format? > > In general I would assume that HDFS read/writes are more performant for > spark jobs. > > For instance,

Re: Spark dataframe hdfs vs s3

2020-05-27 Thread randy clinton
Is the file Parquet on S3 or is it some other file format? In general I would assume that HDFS read/writes are more performant for spark jobs. For instance, consider how well partitioned your HDFS file is vs the S3 file. On Wed, May 27, 2020 at 1:51 PM Dark Crusader wrote: > Hi J

Re: Spark dataframe hdfs vs s3

2020-05-27 Thread Dark Crusader
are required because of improper caching of intermediate data. And maybe hdfs is doing a better job at this. Does this make sense? I would also like to add that we built an extra layer on S3 which might be adding to even slower times. Thanks for your help. On Wed, 27 May, 2020, 11:03 pm Jörn Franke

Re: Spark dataframe hdfs vs s3

2020-05-27 Thread Jörn Franke
Have you looked in Spark UI why this is the case ? S3 Reading can take more time - it depends also what s3 url you are using : s3a vs s3n vs S3. It could help after some calculation to persist in-memory or on HDFS. You can also initially load from S3 and store on HDFS and work from

Spark dataframe hdfs vs s3

2020-05-27 Thread Dark Crusader
Hi all, I am reading data from hdfs in the form of parquet files (around 3 GB) and running an algorithm from the spark ml library. If I create the same spark dataframe by reading data from S3, the same algorithm takes considerably more time. I don't understand why this is happening

Re: How can I add extra mounted disk to HDFS

2020-04-30 Thread Chetan Khatri
I've 3 disks now disk1- already have data disk2- newly added I want to shift the data from disk1 to disk2, obviously both are datanodes. Please suggest the steps for hot data node disk migration. On Wed, Apr 29, 2020 at 2:38 AM JB Data31 wrote: > Use Hadoop NFSv3 gateway to mount FS. > >

Re: How can I add extra mounted disk to HDFS

2020-04-29 Thread JB Data31
Use Hadoop NFSv3 gateway to mount FS. @*JB*Δ Le mar. 28 avr. 2020 à 23:18, Chetan Khatri a écrit : > Hi Spark Users, > > My spark job gave me an error No Space left on the device >

How can I add extra mounted disk to HDFS

2020-04-28 Thread Chetan Khatri
Hi Spark Users, My spark job gave me an error No Space left on the device

Re: HDFS file hdfs://127.0.0.1:9000/hdfs/spark/examples/README.txt

2020-04-06 Thread jane thorpe
Hi Som, HdfsWordCount program  counts words >From files you place in a  directory with the name of argv [args.length -1]  >while the program is running in a for (;;)  loop until user press CTRL C. Why  does program name  have prefix of  HDFS   ? HADOOP distributed  File

Re: HDFS file hdfs://127.0.0.1:9000/hdfs/spark/examples/README.txt

2020-04-06 Thread Som Lima
> > jane thorpe > janethor...@aol.com > > > -Original Message- > From: jane thorpe > To: somplasticllc ; user > Sent: Fri, 3 Apr 2020 2:44 > Subject: Re: HDFS file hdfs:// > 127.0.0.1:9000/hdfs/spark/examples/README.txt > > > Thanks darling > >

Fwd: HDFS file hdfs://127.0.0.1:9000/hdfs/spark/examples/README.txt

2020-04-06 Thread jane thorpe
: HDFS file hdfs://127.0.0.1:9000/hdfs/spark/examples/README.txt Thanks darling I tried this and worked hdfs getconf -confKey fs.defaultFS hdfs://localhost:9000 scala> :paste // Entering paste mode (ctrl-D to finish) val textFile = sc.textFile("hdfs://127.0.0.1:9000/hdfs/spark/

Re: HDFS file hdfs://127.0.0.1:9000/hdfs/spark/examples/README.txt

2020-04-02 Thread jane thorpe
Thanks darling I tried this and worked hdfs getconf -confKey fs.defaultFS hdfs://localhost:9000 scala> :paste // Entering paste mode (ctrl-D to finish) val textFile = sc.textFile("hdfs://127.0.0.1:9000/hdfs/spark/examples/README.txt") val counts = textFile.flatMap(line

Re: HDFS file

2020-03-31 Thread Som Lima
preview2-bin-hadoop2.7 > I can run same program for hdfs format > > val textFile = sc.textFile("hdfs://...")val counts = textFile.flatMap(line => > line.split(" ")) > .map(word => (word, 1)) > .reduceByKey(_ + _)counts.saveAsT

HDFS file

2020-03-31 Thread jane thorpe
hi, Are there setup instructions on the website for spark-3.0.0-preview2-bin-hadoop2.7I can run same program for hdfs format val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1))

Ceph / Lustre VS hdfs comparison

2020-02-12 Thread Nicolas PARIS
Hi Anyone has experience in ceph / lustre as a replacement of hdfs for spark storage (parquet, orc..)? Is hdfs still far superior to the former ? Thanks -- nicolas paris - To unsubscribe e-mail: user-unsubscr

Re: Structured Streaming - HDFS State Store Performance Issues

2020-01-14 Thread Gourav Sengupta
Hi Will, have you tried using S3 as state store with the option in EMR enabled for faster file sync, also there is an option now of using FSx Lustre. Thanks and Regards, Gourav Sengupta On Wed, Jan 15, 2020 at 5:17 AM William Briggs wrote: > Hi all, I've got a problem that really has me

Structured Streaming - HDFS State Store Performance Issues

2020-01-14 Thread William Briggs
Hi all, I've got a problem that really has me stumped. I'm running a Structured Streaming query that reads from Kafka, performs some transformations and stateful aggregations (using flatMapGroupsWithState), and outputs any updated aggregates to another Kafka topic. I'm running this job using

Re: Out of memory HDFS Read and Write

2019-12-22 Thread Ruijing Li
ap, I have an ETL data pipeline that does some logic, repartitions >> to reduce the amount of files written, writes the output to HDFS as parquet >> files. After, it reads the output and writes it to other locations, doesn’t >> matter if on the same hadoop cluster or multiple. This

Re: Out of memory HDFS Read and Write

2019-12-22 Thread Chris Teoh
rs. So it is purely a dataframe read > and write issue > — > To recap, I have an ETL data pipeline that does some logic, repartitions > to reduce the amount of files written, writes the output to HDFS as parquet > files. After, it reads the output and writes it to oth

Re: Out of memory HDFS Read and Write

2019-12-22 Thread Ruijing Li
, writes the output to HDFS as parquet files. After, it reads the output and writes it to other locations, doesn’t matter if on the same hadoop cluster or multiple. This is a simple piece of code ``` destPaths.foreach(path => Try(spark.read.parquet(sourceOutputPath).write.mode(SaveMode.Overwr

Re: Out of memory HDFS Multiple Cluster Write

2019-12-21 Thread Chris Teoh
ete, I repartition the files to 20 after having >>>> spark.sql.shuffle.partitions = 2000 so we don’t have too many small files. >>>> Data is small about 130MB per file. When spark reads it reads in 40 >>>> partitions and tries to output that to the dif

Out of memory HDFS Multiple Cluster Write

2019-12-21 Thread Ruijing Li
after logic is >>> complete, I repartition the files to 20 after having >>> spark.sql.shuffle.partitions = 2000 so we don’t have too many small files. >>> Data is small about 130MB per file. When spark reads it reads in 40 >>> partitions and tries to output that

Re: Out of memory HDFS Multiple Cluster Write

2019-12-21 Thread Ruijing Li
d tries to output that to the different cluster. Unfortunately >> during that read and write stage executors drop off. >> >> We keep hdfs block 128Mb >> >> On Fri, Dec 20, 2019 at 3:01 PM Chris Teoh wrote: >> >>> spark.sql.shuffle.partitions might be a start. &

Re: Out of memory HDFS Multiple Cluster Write

2019-12-20 Thread Chris Teoh
after having > spark.sql.shuffle.partitions = 2000 so we don’t have too many small files. > Data is small about 130MB per file. When spark reads it reads in 40 > partitions and tries to output that to the different cluster. Unfortunately > during that read and write stage executors drop off. >

Re: Out of memory HDFS Multiple Cluster Write

2019-12-20 Thread Ruijing Li
130MB per file. When spark reads it reads in 40 partitions and tries to output that to the different cluster. Unfortunately during that read and write stage executors drop off. We keep hdfs block 128Mb On Fri, Dec 20, 2019 at 3:01 PM Chris Teoh wrote: > spark.sql.shuffle.partitions might be a st

Re: Out of memory HDFS Multiple Cluster Write

2019-12-20 Thread Chris Teoh
countered a strange executor OOM error. I have a data pipeline > using Spark 2.3 Scala 2.11.12. This pipeline writes the output to one HDFS > location as parquet then reads the files back in and writes to multiple > hadoop clusters (all co-located in the same datacenter). It should be a

Out of memory HDFS Multiple Cluster Write

2019-12-20 Thread Ruijing Li
Hi all, I have encountered a strange executor OOM error. I have a data pipeline using Spark 2.3 Scala 2.11.12. This pipeline writes the output to one HDFS location as parquet then reads the files back in and writes to multiple hadoop clusters (all co-located in the same datacenter). It should

Problem of how to retrieve file from HDFS

2019-10-08 Thread Ashish Mittal
Hi, I am trying to store and retrieve csv file from HDFS.but i have successfully store csv file in HDFS using LinearRegressionModel in spark using Java.but not retrieve csv file from HDFS. how to retrieve csv file from HDFS. code-- SparkSession sparkSession = SparkSession.builder().appName

  1   2   3   4   5   6   7   8   9   10   >