Out of memory HDFS Multiple Cluster Write

2019-12-20 Thread Ruijing Li
Hi all, I have encountered a strange executor OOM error. I have a data pipeline using Spark 2.3 Scala 2.11.12. This pipeline writes the output to one HDFS location as parquet then reads the files back in and writes to multiple hadoop clusters (all co-located in the same datacenter). It should

Problem of how to retrieve file from HDFS

2019-10-08 Thread Ashish Mittal
Hi, I am trying to store and retrieve csv file from HDFS.but i have successfully store csv file in HDFS using LinearRegressionModel in spark using Java.but not retrieve csv file from HDFS. how to retrieve csv file from HDFS. code-- SparkSession sparkSession = SparkSession.builder().appName

How to use HDFS >3.1.1 with spark 2.3.3 to output parquet files to S3?

2019-07-14 Thread Alexander Czech
o I just set in the flintrock config HDFS to 3.1.1 and everything "just works"? Or do I also have to set a committer algorithm like this when I create my spark context in pyspark: .set('spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version','some_kind_of_Version') thanks for the help!

AWS EMR slow write to HDFS

2019-06-11 Thread Femi Anthony
I'm writing a large dataset in Parquet format to HDFS using Spark and it runs rather slowly in EMR vs say Databricks. I realize that if I was able to use Hadoop 3.1, it would be much more performant because it has a high performance output committer. Is this the case, and if so - when

Re: Read hdfs files in spark streaming

2019-06-11 Thread nitin jain
t;>>> https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html >>>>> >>>>> Plz start using stackoverflow to ask question to other ppl so get >>>>> benefits of answer >>>>> >>>>> >>>>> Regards, >&

Re: Kafka Topic to Parquet HDFS with Structured Streaming

2019-06-10 Thread Chetan Khatri
Hello Deng, Thank you for your email. Issue was with Spark - Hadoop / HDFS configuration settings. Thanks On Mon, Jun 10, 2019 at 5:28 AM Deng Ching-Mallete wrote: > Hi Chetan, > > Best to check if the user account that you're using to run the job has > permission to write to the

Re: Read hdfs files in spark streaming

2019-06-10 Thread Deepak Sharma
gt;>>> https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html >>>> >>>> Plz start using stackoverflow to ask question to other ppl so get >>>> benefits of answer >>>> >>>> >>>> Regards, >>

Re: Read hdfs files in spark streaming

2019-06-10 Thread Shyam P
On Sun, Jun 9, 2019, 8:08 AM Deepak Sharma wrote: >> >>> I am using spark streaming application to read from kafka. >>> The value coming from kafka message is path to hdfs file. >>> I am using spark 2.x , spark.read.stream. >>> What is the best way to rea

Re: Kafka Topic to Parquet HDFS with Structured Streaming

2019-06-10 Thread Deng Ching-Mallete
Hi Chetan, Best to check if the user account that you're using to run the job has permission to write to the path in HDFS. I would suggest to write the parquet files to a different path, perhaps to a project space or user home, rather than at the root directory. HTH, Deng On Sat, Jun 8, 2019

Re: Read hdfs files in spark streaming

2019-06-09 Thread Deepak Sharma
, > Vaquar khan > > On Sun, Jun 9, 2019, 8:08 AM Deepak Sharma wrote: > >> I am using spark streaming application to read from kafka. >> The value coming from kafka message is path to hdfs file. >> I am using spark 2.x , spark.read.stream. >> What is the best way

Re: Read hdfs files in spark streaming

2019-06-09 Thread vaquar khan
ark streaming application to read from kafka. > The value coming from kafka message is path to hdfs file. > I am using spark 2.x , spark.read.stream. > What is the best way to read this path in spark streaming and then read > the json stored at the hdfs path , may be using spark.read.json , int

Read hdfs files in spark streaming

2019-06-09 Thread Deepak Sharma
I am using spark streaming application to read from kafka. The value coming from kafka message is path to hdfs file. I am using spark 2.x , spark.read.stream. What is the best way to read this path in spark streaming and then read the json stored at the hdfs path , may be using spark.read.json

Re: Kafka Topic to Parquet HDFS with Structured Streaming

2019-06-07 Thread Chetan Khatri
rom Kafka Topic to Parquet HDFS with Structured > Streaming but Getting failures. Please do help. > > val spark: SparkSession = > SparkSession.builder().appName("DemoSparkKafka").getOrCreate() > import spark.implicits._ > val dataFromTopicDF = spark > .rea

Kafka Topic to Parquet HDFS with Structured Streaming

2019-06-07 Thread Chetan Khatri
Hello Dear Spark Users, I am trying to write data from Kafka Topic to Parquet HDFS with Structured Streaming but Getting failures. Please do help. val spark: SparkSession = SparkSession.builder().appName("DemoSparkKafka").getOrCreate() import spark.implicits._ val dataFromTopic

spark checkpoint between 2 jobs and HDFS ramfs with storage policy

2019-05-21 Thread Julien Laurenceau
a a few hundreds GB in ramfs, but I cannot find any feedbacks on these kind of configurations... and the doc hadoop that tells me "network replication negates the benefits of writing to memory" doesn't inspire me much confidence regarding performance improvement. My HDFS is configured with r

Re: [GraphX] Preserving Partitions when reading from HDFS

2019-04-25 Thread M Bilal
If I understand correctly this would set the split size in the Hadoop configuration when reading file. I can see that being useful when you want to create more partitions than what the block size in HDFS might dictate. Instead what I want to do is to create a single partition for each file written

Re: [GraphX] Preserving Partitions when reading from HDFS

2019-04-15 Thread Manu Zhang
You may try `sparkContext.hadoopConfiguration().set("mapred.max.split.size", "33554432")` to tune the partition size when reading from HDFS. Thanks, Manu Zhang On Mon, Apr 15, 2019 at 11:28 PM M Bilal wrote: > Hi, > > I have implemented a custom partitioning

[GraphX] Preserving Partitions when reading from HDFS

2019-04-15 Thread M Bilal
Hi, I have implemented a custom partitioning algorithm to partition graphs in GraphX. Saving the partitioning graph (the edges) to HDFS creates separate files in the output folder with the number of files equal to the number of Partitions. However, reading back the edges creates number

RE: How to print DataFrame.show(100) to text file at HDFS

2019-04-14 Thread email
to print DataFrame.show(100) to text file at HDFS Use .limit on the dataframe followed by .write On Apr 14, 2019, at 5:10 AM, Chetan Khatri mailto:chetan.opensou...@gmail.com> > wrote: Nuthan, Thank you for reply. the solution proposed will give everything. for me is like one Datafram

Re: How to print DataFrame.show(100) to text file at HDFS

2019-04-14 Thread Brandon Geise
txt >> >> showDF.py: >> >> from pyspark.sql import SparkSession >> >> >> spark = SparkSession.builder.appName("Write stdout").getOrCreate() >> >> spark.sparkContext.setLogLevel("OFF") >> >> >> spark.table("").show(100

Re: How to print DataFrame.show(100) to text file at HDFS

2019-04-14 Thread Chetan Khatri
el("OFF") > > > spark.table("").show(100,truncate=false) > > But is there any specific reason you want to write it to hdfs? Is this for > human consumption? > > Regards, > Nuthan > > On Sat, Apr 13, 2019 at 6:41 PM Chetan Khatri > wrote: > &

Re: How to print DataFrame.show(100) to text file at HDFS

2019-04-13 Thread Nuthan Reddy
ncate=false) But is there any specific reason you want to write it to hdfs? Is this for human consumption? Regards, Nuthan On Sat, Apr 13, 2019 at 6:41 PM Chetan Khatri wrote: > Hello Users, > > In spark when I have a DataFrame and do .show(100) the output which gets > printed, I

How to print DataFrame.show(100) to text file at HDFS

2019-04-13 Thread Chetan Khatri
Hello Users, In spark when I have a DataFrame and do .show(100) the output which gets printed, I wants to save as it is content to txt file in HDFS. How can I do this? Thanks

Re: Load Time from HDFS

2019-04-10 Thread Mich Talebzadeh
Have you tried looking at Spark GUI to see the time it takes to load from HDFS? Spark GUI by default runs on port 4040. However, you can set in spark-submit ${SPARK_HOME}/bin/spark-submit \ …... --conf "spark.ui.port=" and access it through hostname:port HTH Dr Mich

Re:Load Time from HDFS

2019-04-10 Thread yeikel valdes
koloka...@ics.forth.gr wrote Hello, I want to ask if there any way to measure HDFS data loading time at the start of my program. I tried to add an action e.g count() after val data = sc.textFile() call. But I notice that my program takes more time to finish than before adding count call

Load Time from HDFS

2019-04-02 Thread Jack Kolokasis
Hello,     I want to ask if there any way to measure HDFS data loading time at the start of my program. I tried to add an action e.g count() after val data = sc.textFile() call. But I notice that my program takes more time to finish than before adding count call. Is there any other way to do

Re: writing a small csv to HDFS is super slow

2019-03-27 Thread Gezim Sejdiu
Hi Lian, many thanks for the detailed information and sharing the solution with us. I will forward this to a student and hopefully will resolve the issue. Best regards, On Wed, Mar 27, 2019 at 1:55 AM Lian Jiang wrote: > Hi Gezim, > > My execution plan of the data frame to write

Re: writing a small csv to HDFS is super slow

2019-03-26 Thread Lian Jiang
Hi Gezim, My execution plan of the data frame to write into HDFS is a union of 140 children dataframes. All these children data frames are not materialized when writing to HDFS. It is not saving file taking time. Instead, it is materializing the dataframes taking time. My solution

Re: writing a small csv to HDFS is super slow

2019-03-26 Thread Gezim Sejdiu
Hi Lian, I was following the thread since one of my students had the same issue. The problem was when trying to save a larger XML dataset into HDFS and due to the connectivity timeout between Spark and HDFS, the output wasn't able to be displayed. I also suggested him to do the same as @Apostolos

Re: writing a small csv to HDFS is super slow

2019-03-25 Thread Lian Jiang
coalesce instead? > > Kathleen > > On Fri, Mar 22, 2019 at 2:43 PM Lian Jiang wrote: > >> Hi, >> >> Writing a csv to HDFS takes about 1 hour: >> >> >> df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header=

How to control batch size while reading from hdfs files?

2019-03-22 Thread kant kodali
Hi All, What determines the batch size while reading from a file from HDFS? I am trying to read files from HDFS and ingest into Kafka using Spark Structured Streaming 2.3.1. I get an error sayiKafkafka batch size is too big and that I need to increase max.request.size. Sure I can increase

Re: writing a small csv to HDFS is super slow

2019-03-22 Thread kathy Harayama
Hi Lian, Since you using repartition(1), do you want to decrease the number of partitions? If so, have you tried to use coalesce instead? Kathleen On Fri, Mar 22, 2019 at 2:43 PM Lian Jiang wrote: > Hi, > > Writing a csv to HDFS takes about 1 hour: > > > df.repartiti

Re: writing a small csv to HDFS is super slow

2019-03-22 Thread Apostolos N. Papadopoulos
Is it also slow when you do not repartition? (i.e., to get multiple output files) Also did you try simply saveAsTextFile? Also, before repartition, how many partitions are there? a. On 22/3/19 23:34, Lian Jiang wrote: Hi, Writing a csv to HDFS takes about 1 hour: df.repartition(1

writing a small csv to HDFS is super slow

2019-03-22 Thread Lian Jiang
Hi, Writing a csv to HDFS takes about 1 hour: df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv) The generated csv file is only about 150kb. The job uses 3 containers (13 cores, 23g mem). Other people have similar issues but I don't

spark-submit: Warning: Skip remote jar hdfs

2019-01-23 Thread Neo Chien
Hi Experts, I would like to submit a spark job with configuring additional jar on hdfs, however the hadoop gives me a warning on skipping remote jar. Although I can still get my final results on hdfs, I cannot obtain the effect of additional remote jar. I would appreciate if you can give me some

Re: Persist Dataframe to HDFS considering HDFS Block Size.

2019-01-21 Thread Shivam Sharma
ck size" > > Arnaud > > On Mon, Jan 21, 2019 at 9:01 AM Shivam Sharma <28shivamsha...@gmail.com> > wrote: > >> Don't we have any property for it? >> >> One more quick question that if files created by Spark is less than HDFS >> block size then the r

Re: Persist Dataframe to HDFS considering HDFS Block Size.

2019-01-21 Thread Arnaud LARROQUE
e have any property for it? > > One more quick question that if files created by Spark is less than HDFS > block size then the rest of Block space will become unavailable and remain > unutilized or it will be shared with other files? > > On Mon, Jan 21, 2019 at 1:30 PM Shivam Shar

Re: Persist Dataframe to HDFS considering HDFS Block Size.

2019-01-21 Thread Shivam Sharma
Don't we have any property for it? One more quick question that if files created by Spark is less than HDFS block size then the rest of Block space will become unavailable and remain unutilized or it will be shared with other files? On Mon, Jan 21, 2019 at 1:30 PM Shivam Sharma <28shivam

Re: Persist Dataframe to HDFS considering HDFS Block Size.

2019-01-19 Thread Hichame El Khalfi
You can do this in 2 passes (not one) A) save you dataset into hdfs with what you have. B) calculate number of partition, n= (size of your dataset)/hdfs block size Then run simple spark job to read and partition based on 'n'. Hichame From: felixcheun...@hotmail.com Sent: January 19, 2019 2:06 PM

Re: Persist Dataframe to HDFS considering HDFS Block Size.

2019-01-19 Thread Felix Cheung
You can call coalesce to combine partitions.. From: Shivam Sharma <28shivamsha...@gmail.com> Sent: Saturday, January 19, 2019 7:43 AM To: user@spark.apache.org Subject: Persist Dataframe to HDFS considering HDFS Block Size. Hi All, I wanted to persist dat

Persist Dataframe to HDFS considering HDFS Block Size.

2019-01-19 Thread Shivam Sharma
Hi All, I wanted to persist dataframe on HDFS. Basically, I am inserting data into a HIVE table using Spark. Currently, at the time of writing to HIVE table I have set total shuffle partitions = 400 so total 400 files are being created which is not even considering HDFS block size. How can I tell

Re: Re:Writing RDDs to HDFS is empty

2019-01-07 Thread yeikel valdes
-0800 fyyleej...@163.com wrote Sorry,the code is too long,it is simple to say look at the photo <http://apache-spark-user-list.1001560.n3.nabble.com/file/t9811/%E9%97%AE%E9%A2%98%E5%9B%9E%E7%AD%94.png> i define a arrayBuffer ,there are "1 2", '' 2 3" ," 4 5"

Re: Re:Writing RDDs to HDFS is empty

2019-01-07 Thread Jian Lee
Sorry,the code is too long,it is simple to say look at the photo <http://apache-spark-user-list.1001560.n3.nabble.com/file/t9811/%E9%97%AE%E9%A2%98%E5%9B%9E%E7%AD%94.png> i define a arrayBuffer ,there are "1 2", '' 2 3" ," 4 5" in it ,I want to save in hdfs ,so

Re:Writing RDDs to HDFS is empty

2019-01-07 Thread yeikel valdes
cluster,the result in hdfs is empty, why?how to solve? <http://apache-spark-user-list.1001560.n3.nabble.com/file/t9811/%E9%97%AE%E9%A2%98.jpg> Thanks! Jian Li -- Sent from: http://apache-spark-user-list.1001560.n3.nabb

Writing RDDs to HDFS is empty

2019-01-07 Thread Jian Lee
Hi all, In my experiment program,I used spark Graphx, when running on the Idea in windows,the result is right, but when runing on the linux distributed cluster,the result in hdfs is empty, why?how to solve? <http://apache-spark-user-list.1001560.n3.nabble.com/file/t9811/%E9%97%AE%E9%A2%98.

Getting FileNotFoundException and LeaseExpired Exception while writing a df to hdfs path

2018-12-24 Thread Gaurav Gupta
Hi I am receiving FileNotFoundException and LeaseExpired Exception while writing a data frame to an hdfs path.I am using spark 1.6 and reading messages from Tibco in my streaming application .I am doing some tranformations on each Rdd and converting it to a data frame and writing to an hdfs path

Spark App Write nothing on HDFS

2018-12-17 Thread Soheil Pourbafrani
Hi, I submit an app on Spark2 cluster using standalone scheduler on client mode. The app saves the results of the processing on the HDFS. There is no error on output logs and the app finished successfully. But the problem is it just create _SUCSSES and empty part-0 file on the saving directory

Re: How to read remote HDFS from Spark using username?

2018-10-03 Thread Aakash Basu
ktrace is below - > > --- >> Py4JJavaError Traceback (most recent call last) >> in () >> > 1 df = spark.read.load("hdfs:// >> 35.154.242.76:9000/auto-ml/projects/auto-ml-test__8503cdc4-21fc-4fae-87c1-5b879cafff71/data/breast-cancer-wisconsin.csv >> ") >>

Re: How to read remote HDFS from Spark using username?

2018-10-03 Thread Jörn Franke
Looks like a firewall issue > Am 03.10.2018 um 09:34 schrieb Aakash Basu : > > The stacktrace is below - > >> --- >> Py4JJavaError Traceback (most recent call last) >> in () >> --

Re: How to read remote HDFS from Spark using username?

2018-10-03 Thread Aakash Basu
The stacktrace is below - --- > Py4JJavaError Traceback (most recent call last) > in () > > 1 df = spark.read.load("hdfs:// > 35.154.242.76:9000/auto-ml/projects/auto-ml-test__8503cdc4-21fc-4fae-87

How to read remote HDFS from Spark using username?

2018-10-03 Thread Aakash Basu
Hi, I have to read data stored in HDFS of a different machine and needs to be accessed through Spark for being read. How to do that? Full HDFS address along with port doesn't seem to work. Anyone did it before? Thanks, AB.

Re: How to read json data from kafka and store to hdfs with spark structued streaming?

2018-07-27 Thread Arbab Khalil
Please try adding an other option of starting offset. I have done the same thing many times with different versions of spark that supports structured streaming. The other I am seeing is could be something that it could be at write time. Can you please confirm it be doing printSchema function after

Re: How to read json data from kafka and store to hdfs with spark structued streaming?

2018-07-27 Thread dddaaa
This is a mistake in the code snippet I posted. The right code that is actually running and producing the error is: / df = spark \ .readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "kafka_broker") \ .option("subscribe", "test_hdfs3") \ .load()

Re: How to read json data from kafka and store to hdfs with spark structued streaming?

2018-07-27 Thread Arbab Khalil
Why are you reading batch from kafka and writing it as stream? On Fri, Jul 27, 2018, 1:40 PM dddaaa wrote: > No, I just made sure I'm not doing it. > changed the path in .start() to another path and the same still occurs. > > > > -- > Sent from:

Re: How to read json data from kafka and store to hdfs with spark structued streaming?

2018-07-27 Thread dddaaa
No, I just made sure I'm not doing it. changed the path in .start() to another path and the same still occurs. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Re: How to read json data from kafka and store to hdfs with spark structued streaming?

2018-07-26 Thread Tathagata Das
hem in hdfs with > spark > structured streaming. > > I followed the example here: > https://spark.apache.org/docs/2.1.0/structured-streaming- > kafka-integration.html > > and when my code looks like this: > > df = spark \ > .read \ >

How to read json data from kafka and store to hdfs with spark structued streaming?

2018-07-24 Thread dddaaa
I'm trying to read json messages from kafka and store them in hdfs with spark structured streaming. I followed the example here: https://spark.apache.org/docs/2.1.0/structured-streaming-kafka-integration.html and when my code looks like this: df = spark \ .read \ .format("

Data from HDFS

2018-04-22 Thread Zois Theodoros
Hello, I am reading data from HDFS in a Spark application and as far as I read each HDFS block is 1 partition for Spark by default. Is there any way to select only 1 block from HDFS to read in my Spark application? Thank you, Thodoris

hdfs file partition

2018-04-19 Thread 崔苗
Hi, when I create a dataset by reading a json file from hdfs ,I found the partition number of the dataset not equals to the file blocks, so what define the partition number of the dataset when I read file from hdfs ?

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-12 Thread surender kumar
Question was not what kind of sampling but random sampling per user. There's no value associated with items to create stratas. If you read Matteo's answer, that's the way to go about it. -Surender On Thursday, 12 April, 2018, 5:49:43 PM IST, Gourav Sengupta

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-12 Thread Gourav Sengupta
Hi, There is an option for Stratified Sampling available in SPARK: https://spark.apache.org/docs/latest/mllib-statistics.html#stratified-sampling . Also there is a method called randomSplit which may be called on dataframes in case we want to split them into training and test data. Please let

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-12 Thread surender kumar
Thanks Matteo, this should work! -Surender On Thursday, 12 April, 2018, 1:13:38 PM IST, Matteo Cossu wrote: I don't think it's trivial. Anyway, the naive solution would be a cross join between user x items. But this can be very very expensive. I've encountered

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-12 Thread Matteo Cossu
I don't think it's trivial. Anyway, the naive solution would be a cross join between user x items. But this can be very very expensive. I've encountered once a similar problem, here how I solved it: - create a new RDD with (itemID, index) where the index is a unique integer between 0 and

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-11 Thread surender kumar
right, this is what I did when I said I tried to persist and create an RDD out of it to sample from. But how to do for each user?You have one rdd of users on one hand and rdd of items on the other. How to go from here? Am I missing something trivial?  On Thursday, 12 April, 2018, 2:10:51

Re: Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-11 Thread Matteo Cossu
Why broadcasting this list then? You should use an RDD or DataFrame. For example, RDD has a method sample() that returns a random sample from it. On 11 April 2018 at 22:34, surender kumar wrote: > I'm using pySpark. > I've list of 1 million items (all float values

Broadcasting huge array or persisting on HDFS to read on executors - both not working

2018-04-11 Thread surender kumar
I'm using pySpark.I've list of 1 million items (all float values ) and 1 million users. for each user I want to sample randomly some items from the item list.Broadcasting the item list results in Outofmemory error on the driver, tried setting driver memory till 10G.  I tried to persist this

DataFrameWriter in pyspark ignoring hdfs attributes (using spark-2.2.1-bin-hadoop2.7)?

2018-03-10 Thread Chuan-Heng Hsiao
hi all, I am using spark-2.2.1-bin-hadoop2.7 with stand-alone mode. (python version: 3.5.2 from ubuntu 16.04) I intended to have DataFrame write to hdfs with customized block-size but failed. However, the corresponding rdd can successfully write with the customized block-size. Could you help me

Re: Writing data in HDFS high available cluster

2018-01-18 Thread Subhash Sriram
eil Pourbafrani <soheil.i...@gmail.com> wrote: > > I have a HDFS high available cluster with two namenode, one as active > namenode and one as standby namenode. When I want to write data to HDFS I use > the active namenode address. Now, my question is what happened if during > s

Writing data in HDFS high available cluster

2018-01-18 Thread Soheil Pourbafrani
I have a HDFS high available cluster with two namenode, one as active namenode and one as standby namenode. When I want to write data to HDFS I use the active namenode address. Now, my question is what happened if during spark writing data active namenode fails. Is there any way to set both active

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread Gourav Sengupta
mail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> You can monitor a filesystem directory as streaming source as long as >>>> the files placed there are atomically copied/moved into the directory. >>>> Updating the files is not suppo

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread ayan guha
mail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> You can monitor a filesystem directory as streaming source as long as >>>> the files placed there are atomically copied/moved into the directory. >>>> Updating the files is

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread kant kodali
; Updating the files is not supported. >>> >>> kr, Gerard. >>> >>> On Mon, Jan 15, 2018 at 11:41 PM, kant kodali <kanth...@gmail.com> >>> wrote: >>> >>>> Hi All, >>>> >>>> I am wondering if HDFS ca

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread Gourav Sengupta
e atomically copied/moved into the directory. >> Updating the files is not supported. >> >> kr, Gerard. >> >> On Mon, Jan 15, 2018 at 11:41 PM, kant kodali <kanth...@gmail.com> wrote: >> >>> Hi All, >>> >>> I am wondering

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread kant kodali
ectory. > Updating the files is not supported. > > kr, Gerard. > > On Mon, Jan 15, 2018 at 11:41 PM, kant kodali <kanth...@gmail.com> wrote: > >> Hi All, >> >> I am wondering if HDFS can be a streaming source like Kafka in Spark >> 2.2.0? For example can I

Re: can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread Gerard Maas
> > I am wondering if HDFS can be a streaming source like Kafka in Spark > 2.2.0? For example can I have stream1 reading from Kafka and writing to > HDFS and stream2 to read from HDFS and write it back to Kakfa ? such that > stream2 will be pulling the latest updates written by stream1. > > Thanks! >

can HDFS be a streaming source like Kafka in Spark 2.2.0?

2018-01-15 Thread kant kodali
Hi All, I am wondering if HDFS can be a streaming source like Kafka in Spark 2.2.0? For example can I have stream1 reading from Kafka and writing to HDFS and stream2 to read from HDFS and write it back to Kakfa ? such that stream2 will be pulling the latest updates written by stream1. Thanks!

Re: Spark loads data from HDFS or S3

2017-12-13 Thread Jörn Franke
S3 can be realized cheaper than HDFS on Amazon. As you correctly describe it does not support data locality. The data is distributed to the workers. Depending on your use case it can make sense to have HDFS as a temporary “cache” for S3 data. > On 13. Dec 2017, at 09:39, Philip Lee <

Re: Spark loads data from HDFS or S3

2017-12-13 Thread Sebastian Nagel
lelism. Otherwise (e.g., if reading a single gzipped file) only one worker will read the data. > So it migt be a trade-off compared to HDFS? Accessing data on S3 fromHadoop is usually slower than HDFS, cf. https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Oth

Spark loads data from HDFS or S3

2017-12-13 Thread Philip Lee
Hi ​ I have a few of questions about a structure of HDFS and S3 when Spark-like loads data from two storage. Generally, when Spark loads data from HDFS, HDFS supports data locality and already own distributed file on datanodes, right? Spark could just process data on workers. What about S3

History server and non-HDFS filesystems

2017-11-17 Thread Paul Mackles
e permissions on the files and skipping the ones which it thinks are not readable. The problem is that its using a check that appears to be specific to HDFS and so even though the files are definitely readable, it skips over them. Also, "FSHistoryProvider" is the only place this co

Change the owner of hdfs file being saved

2017-11-02 Thread Sunita Arvind
Hello Experts, I am required to use a specific user id to save files on a remote hdfs cluster. Remote in the sense, spark jobs run on EMR and write to a CDH cluster. Hence I cannot change the hdfs-site.xml etc to point to the destination cluster. As a result I am using webhdfs to save the files

Re: Write to HDFS

2017-10-20 Thread Deepak Sharma
more than one partition like part-0, > part-1. I want to collect all of them into one file. > > > 2017-10-20 16:43 GMT+03:00 Marco Mistroni <mmistr...@gmail.com>: > >> Hi >> Could you just create an rdd/df out of what you want to save and store >>

Re: Write to HDFS

2017-10-20 Thread Marco Mistroni
you just create an rdd/df out of what you want to save and store it > in hdfs? > Hth > > On Oct 20, 2017 9:44 AM, "Uğur Sopaoğlu" <usopao...@gmail.com> wrote: > >> Hi all, >> >> In word count example, >> >> val textFile = sc.textFile(

Re: Write to HDFS

2017-10-20 Thread Uğur Sopaoğlu
part-0, part-1. I want to collect all of them into one file. 2017-10-20 16:43 GMT+03:00 Marco Mistroni <mmistr...@gmail.com>: > Hi > Could you just create an rdd/df out of what you want to save and store it > in hdfs? > Hth > > On Oct 20, 2017 9:44 AM, "

Re: Write to HDFS

2017-10-20 Thread Marco Mistroni
Hi Could you just create an rdd/df out of what you want to save and store it in hdfs? Hth On Oct 20, 2017 9:44 AM, "Uğur Sopaoğlu" <usopao...@gmail.com> wrote: > Hi all, > > In word count example, > > val textFile = sc.textFile("Sample.txt"

Write to HDFS

2017-10-20 Thread Uğur Sopaoğlu
Hi all, In word count example, val textFile = sc.textFile("Sample.txt") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://master:8020/user/ab

Re: Reading from HDFS by increasing split size

2017-10-10 Thread Kanagha Kumar
map-reduce program (as Spark >> uses the same input format) >> >> On Tue, Oct 10, 2017 at 10:21 PM, Jörn Franke <jornfra...@gmail.com> >> wrote: >> >>> Write your own input format/datasource or split the file yourself >>> beforehand (not reco

Re: Reading from HDFS by increasing split size

2017-10-10 Thread Jörn Franke
; wrote: >>> Write your own input format/datasource or split the file yourself >>> beforehand (not recommended). >>> >>> > On 10. Oct 2017, at 09:14, Kanagha Kumar <kpra...@salesforce.com> wrote: >>> > >>> > Hi, >

Re: Reading from HDFS by increasing split size

2017-10-10 Thread ayan guha
t; Write your own input format/datasource or split the file yourself >>> beforehand (not recommended). >>> >>> > On 10. Oct 2017, at 09:14, Kanagha Kumar <kpra...@salesforce.com> >>> wrote: >>> > >>> > Hi, >>> > >>

Re: Reading from HDFS by increasing split size

2017-10-10 Thread Kanagha Kumar
e or split the file yourself >> beforehand (not recommended). >> >> > On 10. Oct 2017, at 09:14, Kanagha Kumar <kpra...@salesforce.com> >> wrote: >> > >> > Hi, >> > >> > I'm trying to read a 60GB HDFS file using spark >> textFile("

Re: Reading from HDFS by increasing split size

2017-10-10 Thread ayan guha
<jornfra...@gmail.com> wrote: > Write your own input format/datasource or split the file yourself > beforehand (not recommended). > > > On 10. Oct 2017, at 09:14, Kanagha Kumar <kpra...@salesforce.com> wrote: > > > > Hi, > > > > I'm try

Re: Reading from HDFS by increasing split size

2017-10-10 Thread Jörn Franke
Write your own input format/datasource or split the file yourself beforehand (not recommended). > On 10. Oct 2017, at 09:14, Kanagha Kumar <kpra...@salesforce.com> wrote: > > Hi, > > I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", &g

Reading from HDFS by increasing split size

2017-10-10 Thread Kanagha Kumar
Hi, I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", minPartitions). How can I control the no.of tasks by increasing the split size? With default split size of 250 MB, several tasks are created. But I would like to have a specific no.of tasks created while re

Re: HDFS or NFS as a cache?

2017-10-02 Thread Miguel Morales
From: Steve Loughran [mailto:ste...@hortonworks.com] > > Sent: Saturday, September 30, 2017 6:10 AM > > To: JG Perrin <jper...@lumeris.com> > > Cc: Alexander Czech <alexander.cz...@googlemail.com>; > user@spark.apache.org > > Subject: Re: HDFS or NFS as a

Re: HDFS or NFS as a cache?

2017-10-02 Thread Marcelo Vanzin
> From: Steve Loughran [mailto:ste...@hortonworks.com] > Sent: Saturday, September 30, 2017 6:10 AM > To: JG Perrin <jper...@lumeris.com> > Cc: Alexander Czech <alexander.cz...@googlemail.com>; user@spark.apache.org > Subject: Re: HDFS or NFS as a cache? > > >

RE: HDFS or NFS as a cache?

2017-10-02 Thread JG Perrin
[mailto:ste...@hortonworks.com] Sent: Saturday, September 30, 2017 6:10 AM To: JG Perrin <jper...@lumeris.com> Cc: Alexander Czech <alexander.cz...@googlemail.com>; user@spark.apache.org Subject: Re: HDFS or NFS as a cache? On 29 Sep 2017, at 20:03, JG Perrin <jper...@lumeris.

RE: Error - Spark reading from HDFS via dataframes - Java

2017-10-02 Thread JG Perrin
<kpra...@salesforce.com> Cc: user @spark <user@spark.apache.org> Subject: Re: Error - Spark reading from HDFS via dataframes - Java Hi, Set the inferschema option to true in spark-csv. you may also want to set the mode option. See readme below https://github.com/databricks/spark-csv

Re: Error - Spark reading from HDFS via dataframes - Java

2017-10-01 Thread Anastasios Zouzias
ing to read data from HDFS in spark as dataframes. Printing the schema, I see all columns are being read as strings. I'm converting it to RDDs and creating another dataframe by passing in the correct schema ( how the rows should be interpreted finally). I'm getting the followin

Error - Spark reading from HDFS via dataframes - Java

2017-09-30 Thread Kanagha Kumar
Hi, I'm trying to read data from HDFS in spark as dataframes. Printing the schema, I see all columns are being read as strings. I'm converting it to RDDs and creating another dataframe by passing in the correct schema ( how the rows should be interpreted finally). I'm getting the following error

Re: HDFS or NFS as a cache?

2017-09-30 Thread Steve Loughran
On 29 Sep 2017, at 20:03, JG Perrin <jper...@lumeris.com<mailto:jper...@lumeris.com>> wrote: You will collect in the driver (often the master) and it will save the data, so for saving, you will not have to set up HDFS. no, it doesn't work quite like that. 1. workers generat

<    1   2   3   4   5   6   7   8   9   10   >