Re: HDFS or NFS as a cache?

2017-09-30 Thread Steve Loughran
hod: df.write.parquet('s3a://bucket/parquet') Now I want to setup a small cache for the parquet output. One output is about 12-15 GB in size. Would it be enough to setup a NFS-directory on the master, write the output to it and then move it to S3? Or should I setup a HDFS on the Master? Or

RE: HDFS or NFS as a cache?

2017-09-29 Thread JG Perrin
You will collect in the driver (often the master) and it will save the data, so for saving, you will not have to set up HDFS. From: Alexander Czech [mailto:alexander.cz...@googlemail.com] Sent: Friday, September 29, 2017 8:15 AM To: user@spark.apache.org Subject: HDFS or NFS as a cache? I have

Re: HDFS or NFS as a cache?

2017-09-29 Thread Alexander Czech
rough the parquet write method: >> >> df.write.parquet('s3a://bucket/parquet') >> >> Now I want to setup a small cache for the parquet output. One output is >> about 12-15 GB in size. Would it be enough to setup a NFS-directory on the >> master, write the output to

Re: HDFS or NFS as a cache?

2017-09-29 Thread Vadim Semenov
quet') > > Now I want to setup a small cache for the parquet output. One output is > about 12-15 GB in size. Would it be enough to setup a NFS-directory on the > master, write the output to it and then move it to S3? Or should I setup a > HDFS on the Master? Or should I even opt for an

HDFS or NFS as a cache?

2017-09-29 Thread Alexander Czech
output is about 12-15 GB in size. Would it be enough to setup a NFS-directory on the master, write the output to it and then move it to S3? Or should I setup a HDFS on the Master? Or should I even opt for an additional cluster running a HDFS solution on more than one node? thanks!

Persist DStream into a single file on HDFS

2017-09-28 Thread Mustafa Elbehery
Hi Folks, I am writing a pipeline which reads from Kafka, applying some transformations, then persist to HDFS. Obviously such operation is not supported to DStream, since the *DStream.save*(Path) *method, considers the Path as a directory, not a file. Also using *repartition(1).mode

Re: Failing jobs with Spark 2.2 running on Yarn with HDFS

2017-08-31 Thread Jan-Hendrik Zab
iltered.map(e => e.copy(date = > e.date.slice(0, 10) + > "T00:00:00.000-00:00")) .dropDuplicates(Array("src", "date", "dst")) > transformed.write .option("sep", "\t") >.option("header", "false&

Failing jobs with Spark 2.2 running on Yarn with HDFS

2017-08-18 Thread Jan-Hendrik Zab
quot;sep", "\t") .option("header", "false") .option("compression", "gzip") .mode(SaveMode.Append) .csv(config.output) The input data is roughly 2.1TB (~ 500 billion lines I think) and on HDFS. I'm honestly running out of ideas on h

Parquet error while saving in HDFS

2017-07-24 Thread unk1102
Hi I am getting the following error not sure why seems like race condition but I dont use any threads just one thread which owns spark context is writing to hdfs with one parquet partition. I am using Scala 2.10 and Spark 1.5.1. Please guide. Thanks in advance. java.io.IOException: The file

Re: Spark streaming persist to hdfs question

2017-06-25 Thread Naveen Madhire
We are also doing transformations, thats the reason using spark streaming. Does Spark streaming support tumbling windows? I was thinking I can use a window operation to writing into HDFS. Thanks On Sun, Jun 25, 2017 at 10:23 PM, ayan guha <guha.a...@gmail.com> wrote: > I would sugge

Re: Spark streaming persist to hdfs question

2017-06-25 Thread ayan guha
I would suggest to use Flume, if possible, as it has in built HDFS log rolling capabilities On Mon, Jun 26, 2017 at 1:09 PM, Naveen Madhire <vmadh...@umail.iu.edu> wrote: > Hi, > > I am using spark streaming with 1 minute duration to read data from kafka > topic, app

Spark streaming persist to hdfs question

2017-06-25 Thread Naveen Madhire
Hi, I am using spark streaming with 1 minute duration to read data from kafka topic, apply transformations and persist into HDFS. The application is creating a new directory every 1 minute with many partition files(= nbr of partitions). What parameter should I need to change/configure to persist

Re: Using YARN w/o HDFS

2017-06-23 Thread Steve Loughran
for HDFS in Azure.I think google cloud storage is similar, but haven't played with it. Ask google. You cannot do the same for S3 except on EMR and Amazon's premium emrfs:// offering, which adds the consistency layer. On 22 Jun 2017, at 00:50, Alaa Zubaidi (PDF) <alaa.zuba...@pdf.

Re: Flume DStream produces 0 records after HDFS node killed

2017-06-22 Thread N B
This issue got resolved. I was able to trace it to the fact that the driver program's pom.xml was pulling in Spark 2.1.1 which in turn was pulling in Hadoop 2.2.0. Explicitly adding dependencies on Hadoop libraries 2.7.3 resolves it. The following API in HDFS

Re: Using YARN w/o HDFS

2017-06-21 Thread Chen He
chang your fs.defaultFS to point to local file system and have a try On Wed, Jun 21, 2017 at 4:50 PM, Alaa Zubaidi (PDF) <alaa.zuba...@pdf.com> wrote: > Hi, > > Can we run Spark on YARN with out installing HDFS? > If yes, where would HADOOP_CONF_DIR point to? > > Regard

Using YARN w/o HDFS

2017-06-21 Thread Alaa Zubaidi (PDF)
Hi, Can we run Spark on YARN with out installing HDFS? If yes, where would HADOOP_CONF_DIR point to? Regards, -- *This message may contain confidential and privileged information. If it has been sent to you in error, please reply to advise the sender of the error and then immediately

Re: Flume DStream produces 0 records after HDFS node killed

2017-06-21 Thread N B
shine a light on > what could be going on. I turned on debug logging for > org.apache.spark.streaming.scheduler in the driver process and this is > what gets thrown in the logs and keeps throwing it even after the downed > HDFS node is restarted. Using Spark 2.1.1 and HDFS 2.7.3 he

Re: Flume DStream produces 0 records after HDFS node killed

2017-06-21 Thread yohann jardin
and this is what gets thrown in the logs and keeps throwing it even after the downed HDFS node is restarted. Using Spark 2.1.1 and HDFS 2.7.3 here. 2017-06-20 22:38:11,302 WARN JobGenerator ReceivedBlockTracker.logWarning - Exception thrown while writing record: BatchCleanupEvent(ArrayBuffer

Re: Flume DStream produces 0 records after HDFS node killed

2017-06-20 Thread N B
Ok some more info about this issue to see if someone can shine a light on what could be going on. I turned on debug logging for org.apache.spark.streaming.scheduler in the driver process and this is what gets thrown in the logs and keeps throwing it even after the downed HDFS node is restarted

Re: Flume DStream produces 0 records after HDFS node killed

2017-06-20 Thread N B
BTW, this is running on Spark 2.1.1. I have been trying to debug this issue and what I have found till now is that it is somehow related to the Spark WAL. The directory named /receivedBlockMetadata seems to stop getting written to after the point of an HDFS node being killed and restarted. I have

Flume DStream produces 0 records after HDFS node killed

2017-06-19 Thread N B
[socketAddress.size()]), StorageLevel.MEMORY_AND_DISK_SER(), *100*, *5*); The checkpoint directory is configured to be on an HDFS cluster and Spark workers have their SPARK_LOCAL_DIRS and SPARK_WORKER_DIR defined to be on their respective local filesystems. What we are seeing is some odd behavior

Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-05 Thread Muthu Jayakumar
rstand How SparkSession can use Akka to communicate >> with SparkCluster? >> Let me use your initial requirement as a way to illustrate what I mean -- >> i.e, "I want my Micro service app to be able to query and access data on >> HDFS" >> In order to run a query say a DF quer

Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-05 Thread kant kodali
Let me use your initial requirement as a way to illustrate what I mean -- > i.e, "I want my Micro service app to be able to query and access data on > HDFS" > In order to run a query say a DF query (equally possible with SQL as > well), you'll need a sparkSession to build a q

Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-05 Thread Muthu Jayakumar
Hello Kant, >I still don't understand How SparkSession can use Akka to communicate with SparkCluster? Let me use your initial requirement as a way to illustrate what I mean -- i.e, "I want my Micro service app to be able to query and access data on HDFS" In order to run a query s

Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-04 Thread kant kodali
Hi Muthu, I am actually using Play framework for my Micro service which uses Akka but I still don't understand How SparkSession can use Akka to communicate with SparkCluster? SparkPi or SparkPl? any link? Thanks!

Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-04 Thread Muthu Jayakumar
:23 AM, Sandeep Nemuri <nhsande...@gmail.com> >> wrote: >> >>> Check out http://livy.io/ >>> >>> >>> On Sun, Jun 4, 2017 at 11:59 AM, kant kodali <kanth...@gmail.com> wrote: >>> >>>> Hi All, >>>> >>

Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-04 Thread Sandeep Nemuri
hanks! > > On Sun, Jun 4, 2017 at 12:23 AM, Sandeep Nemuri <nhsande...@gmail.com> > wrote: > >> Check out http://livy.io/ >> >> >> On Sun, Jun 4, 2017 at 11:59 AM, kant kodali <kanth...@gmail.com> wrote: >> >>> Hi All, >>> &g

Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-04 Thread kant kodali
wrote: > Check out http://livy.io/ > > > On Sun, Jun 4, 2017 at 11:59 AM, kant kodali <kanth...@gmail.com> wrote: > >> Hi All, >> >> I am wondering what is the easiest way for a Micro service to query data >> on HDFS? By easiest way I mean using mi

Re: What is the easiest way for an application to Query parquet data on HDFS?

2017-06-04 Thread Sandeep Nemuri
Check out http://livy.io/ On Sun, Jun 4, 2017 at 11:59 AM, kant kodali <kanth...@gmail.com> wrote: > Hi All, > > I am wondering what is the easiest way for a Micro service to query data > on HDFS? By easiest way I mean using minimal number of tools. > > Currentl

What is the easiest way for an application to Query parquet data on HDFS?

2017-06-04 Thread kant kodali
Hi All, I am wondering what is the easiest way for a Micro service to query data on HDFS? By easiest way I mean using minimal number of tools. Currently I use spark structured streaming to do some real time aggregations and store it in HDFS. But now, I want my Micro service app to be able

Re: Temp checkpoint directory for EMR (S3 or HDFS)

2017-05-30 Thread Asher Krim
checkpointDirectory); sparkContext.setCheckpointDir(checkpointPath); Asher Krim Senior Software Engineer On Tue, May 30, 2017 at 12:37 PM, Everett Anderson <ever...@nuna.com.invalid > wrote: > Still haven't found a --conf option. > > Regarding a temporary HDFS checkpoint directory, it looks lik

Re: Temp checkpoint directory for EMR (S3 or HDFS)

2017-05-30 Thread Everett Anderson
Still haven't found a --conf option. Regarding a temporary HDFS checkpoint directory, it looks like when using --master yarn, spark-submit supplies a SPARK_YARN_STAGING_DIR environment variable. Thus, one could do the following when creating a SparkSession: val checkpointPath = new Path

Temp checkpoint directory for EMR (S3 or HDFS)

2017-05-26 Thread Everett Anderson
running jobs on AWS EMR (so on YARN + HDFS) and reading and writing non-transient data to S3. Two questions: 1. Is there a Spark --conf option to set the checkpoint directory? Somehow I couldn't find it, but surely it exists. 2. What's a good checkpoint directory for this use case? I imagine it'd

Spark Streaming: NullPointerException when restoring Spark Streaming job from hdfs/s3 checkpoint

2017-05-16 Thread Richard Moorhead
Im having some difficulty reliably restoring a streaming job from a checkpoint. When restoring a streaming job constructed from the following snippet, I receive NullPointerException's when `map` is called on the the restored RDD. lazy val ssc = StreamingContext.getOrCreate(checkpointDir,

hbase + spark + hdfs

2017-05-08 Thread mathieu ferlay
Hi everybody. I’m totally new in Spark and I wanna know one stuff that I do not manage to find. I have a full ambary install with hbase, Hadoop and spark. My code reads and writes in hdfs via hbase. Thus, as I understood, all data stored are in bytes format in hdfs. Now, I know that it’s possible

hbase + spark + hdfs

2017-05-05 Thread mathieu ferlay
Hi everybody. I'm totally new in Spark and I wanna know one stuff that I do not manage to find. I have a full ambary install with hbase, Hadoop and spark. My code reads and writes in hdfs via hbase. Thus, as I understood, all data stored are in bytes format in hdfs. Now, I know that it's possible

Re: 答复: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread Ryan
Hi mo, I don't think it needs shuffle cause the bloom filter only depends on data within each row group, not the whole data. But the HAR solution seems nice. I've thought of combining small files together and store the offsets.. not aware of hdfs has provided such functionality. And after some

答复: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread 莫涛
ail.com> 发送时间: 2017年4月17日 16:48:47 收件人: 莫涛 抄送: user 主题: Re: 答复: 答复: How to store 10M records in HDFS to speed up further filtering? how about the event timeline on executors? It seems add more executor could help. 1. I found a jira(https://issues.apache.org/jira/browse/SPARK-11621) that state

答复: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread 莫涛
It's hadoop archive. https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html 发件人: Alonso Isidoro Roman <alons...@gmail.com> 发送时间: 2017年4月20日 17:03:33 收件人: 莫涛 抄送: Jörn Franke; user@spark.apache.org 主题: Re: 答复: 答复: How to store 10M records in HDFS to sp

Re: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread Alonso Isidoro Roman
path > list and 0.5 second per thread to read a record). > > Such performance is exactly what I expected: "only the requested BINARY > are scanned". > > Moreover, HAR provides directly access to each record by hdfs shell > command. > > > Thank you very much!

答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread 莫涛
t I expected: "only the requested BINARY are scanned". Moreover, HAR provides directly access to each record by hdfs shell command. Thank you very much! 发件人: Jörn Franke <jornfra...@gmail.com> 发送时间: 2017年4月17日 22:37:48 收件人: 莫涛 抄送: user@spark.ap

Questions on HDFS with Spark

2017-04-18 Thread kant kodali
Hi All, I've been using spark standalone for a while and now its time for me to install HDFS. If a spark worker goes down then Spark master restarts the worker similarly if a datanode process goes down it looks like it is not the namenode job to restart the datanode and if so, 1) should I use

Spark 2.1.0 hanging while writing a table in HDFS in parquet format

2017-04-18 Thread gae123
.n3.nabble.com/Spark-2-1-0-hanging-while-writing-a-table-in-HDFS-in-parquet-format-tp28611.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Jörn Franke
Yes 5 mb is a difficult size, too small for HDFS too big for parquet/orc. Maybe you can put the data in a HAR and store id, path in orc/parquet. > On 17. Apr 2017, at 10:52, 莫涛 <mo...@sensetime.com> wrote: > > Hi Jörn, > > > > I do think a 5 MB column is odd but

Re: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread ayan guha
One possihility is using hive with bucketed on id column? Another option: build the index in hbase ie store id and path of hdfs in hbase. This was your scans will be fast and once you have the hdfs path pointers you can read the actual data from hdfs. On Mon, 17 Apr 2017 at 6:52 pm, 莫涛 <

答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread 莫涛
: user@spark.apache.org 主题: Re: How to store 10M records in HDFS to speed up further filtering? You need to sort the data by id otherwise q situation can occur where the index does not work. Aside from this, it sounds odd to put a 5 MB column using those formats. This will be also not so efficient

Re: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Ryan
epends on the distribution of the > given ID list. No partition could be skipped in the worst case. > > > Mo Tao > > > > -- > *发件人:* Ryan <ryan.hd@gmail.com> > *发送时间:* 2017年4月17日 15:42:46 > *收件人:* 莫涛 > *抄送:* user > *主题:*

答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread 莫涛
f the given ID list. No partition could be skipped in the worst case. Mo Tao 发件人: Ryan <ryan.hd@gmail.com> 发送时间: 2017年4月17日 15:42:46 收件人: 莫涛 抄送: user 主题: Re: 答复: How to store 10M records in HDFS to speed up further filtering? 1. Per my understanding

Re: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Jörn Franke
;BINARY") > .write... > > Thanks for any advice! > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-store-10M-records-in-HDFS-to-speed-up-further-filtering-tp28605.html > Sent from the Apache Spark

Re: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Ryan
Thanks very much! > > > Mo Tao > > -- > *发件人:* Ryan <ryan.hd@gmail.com> > *发送时间:* 2017年4月17日 14:32:00 > *收件人:* 莫涛 > *抄送:* user > *主题:* Re: How to store 10M records in HDFS to speed up further filtering? > > you can build a searc

答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread 莫涛
ser 主题: Re: How to store 10M records in HDFS to speed up further filtering? you can build a search tree using ids within each partition to act like an index, or create a bloom filter to see if current partition would have any hit. What's your expected qps and response time for the filter reques

Re: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Ryan
kID = udf { ID: String => IDSet(ID) } > spark.read.orc("/path/to/whole/data") > .filter(checkID($"ID")) > .select($"ID", $"BINARY") > .write... > > Thanks for any advice! > > > > > -- > View this message in context:

How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread MoTao
ole/data") .filter(checkID($"ID")) .select($"ID", $"BINARY") .write... Thanks for any advice! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-store-10M-records-in-HDFS-to-speed-up-further-filtering-tp28605.html S

Re: Does Spark uses its own HDFS client?

2017-04-07 Thread Jörn Franke
Maybe using ranger or sentry would be the better choice to intercept those calls? > On 7. Apr 2017, at 16:32, Alvaro Brandon <alvarobran...@gmail.com> wrote: > > I was going through the SparkContext.textFile() and I was wondering at that > point does Spark communicates wit

Re: Does Spark uses its own HDFS client?

2017-04-07 Thread Steve Loughran
On 7 Apr 2017, at 15:32, Alvaro Brandon <alvarobran...@gmail.com<mailto:alvarobran...@gmail.com>> wrote: I was going through the SparkContext.textFile() and I was wondering at that point does Spark communicates with HDFS. Since when you download Spark binaries you also specif

Does Spark uses its own HDFS client?

2017-04-07 Thread Alvaro Brandon
I was going through the SparkContext.textFile() and I was wondering at that point does Spark communicates with HDFS. Since when you download Spark binaries you also specify the Hadoop version you will use, I'm guessing it has its own client that calls HDFS wherever you specify

Re: reading snappy eventlog files from hdfs using spark

2017-04-07 Thread Jörn Franke
app - i get a snappy library not found > error. I am confused as to how can spark write eventlog in snappy format > without an error, but reading fails with the above error. > > Any help in unblocking myself to read snappy eventlog files from hdfs using > spark? > > > > -

Re: reading snappy eventlog files from hdfs using spark

2017-04-07 Thread Jacek Laskowski
error. Any help in unblocking myself to read snappy eventlog files from hdfs using spark? -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/reading-snappy-eventlog-files-from- hdfs-using-spark-tp28577.html Sent from the Apache Spark User List mailing list archi

reading snappy eventlog files from hdfs using spark

2017-04-07 Thread satishl
without an error, but reading fails with the above error. Any help in unblocking myself to read snappy eventlog files from hdfs using spark? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/reading-snappy-eventlog-files-from-hdfs-using-spark-tp28577.html Sent

spark 2.1.0 foreachRDD write slowly to HDFS

2017-03-26 Thread 446463...@qq.com
Hi All: when I use spark-streaming to consume kafka topic data to HDFS with sparking 2.1.0. I find it's slow .Why? environment:CDH5.8 scala version:2.11 kafka version 0.10.1.1 spark-streaming-kafka :spark-streaming-kafka-0.8_2.11 should I replace the spark-streaming-kafka with spark-streaming

Combining reading from Kafka and HDFS w/ Spark Streaming

2017-03-01 Thread Mike Thomsen
(Sorry if this is a duplicate. I got a strange error message when I first tried to send it earlier) I want to pull HDFS paths from Kafka and build text streams based on those paths. I currently have: val lines = KafkaUtils.createStream(/* params here */).map(_._2) val buffer = new ArrayBuffer

Combining reading from Kafka and HDFS w/ Spark Streaming

2017-03-01 Thread Mike Thomsen
I want to pull HDFS paths from Kafka and build text streams based on those paths. I currently have: val lines = KafkaUtils.createStream(/* params here */).map(_._2) val buffer = new ArrayBuffer[String]() lines.foreachRDD(rdd => { if (!rdd.partitions.isEmpty) { rdd.collect().foreach(l

答复: 答复: spark append files to the same hdfs dir issue for LeaseExpiredException

2017-03-01 Thread Triones,Deng(vip.com)
Thanks for your email My situation I, there is a hive table partitioned by five minutes, I want to write data every 30s into the hdfs location where the table located. So I when the first batch is delay, then the next batch may have the chance to touch the _SUCCESS file at the same

Re: 答复: spark append files to the same hdfs dir issue for LeaseExpiredException

2017-02-28 Thread Charles O. Bajomo
uot;Charles O. Bajomo" <charles.baj...@pretechconsulting.co.uk> Cc: "user" <user@spark.apache.org>, d...@spark.apache.org Sent: Tuesday, 28 February, 2017 10:47:47 Subject: 答复: spark append files to the same hdfs dir issue for LeaseExpiredException I am writing data to hdfs

答复: spark append files to the same hdfs dir issue for LeaseExpiredException

2017-02-28 Thread Triones,Deng(vip.com)
I am writing data to hdfs file, also the hdfs dir is a hive partition file dir. Hive does not support sub dirs.. for example my partition folder is ***/dt=20170224/hm=1400 that means I need to write all the data between 1400 to 1500 to the same folder. 发件人: Charles O. Bajomo

Re: spark append files to the same hdfs dir issue for LeaseExpiredException

2017-02-28 Thread Charles O. Bajomo
. Kind Regards From: "Triones,Deng(vip.com)" <triones.d...@vipshop.com> To: "user" <user@spark.apache.org>, d...@spark.apache.org Sent: Tuesday, 28 February, 2017 09:35:00 Subject: spark append files to the same hdfs dir issue for LeaseExpiredException

spark append files to the same hdfs dir issue for LeaseExpiredException

2017-02-28 Thread Triones,Deng(vip.com)
Hi dev and users Now I am running spark streaming , (spark version 2.0.2) to write file to hdfs. When my spark.streaming.concurrentJobs is more than one. Like 20. I meet the exception as below. We know that when the batch finished, there will be a _SUCCESS file. As I guess

how to give hdfs file path as argument to spark-submit

2017-02-17 Thread nancy henry
mFile(args(0)).mkString) System.out.println("Okay") } } This is my spark program and my hivescript is at args(0) $SPARK_HOME/bin/./spark-submit --class com.spark.test.Step1 --master yarn --deploy-mode cluster com.spark.test-0.1-SNAPSHOT.jar hdfs://spirui-d86-f03-06:9229/samp

Re: Remove dependence on HDFS

2017-02-13 Thread Calvin Jia
Hi Ben, You can replace HDFS with a number of storage systems since Spark is compatible with other storage like S3. This would allow you to scale your compute nodes solely for the purpose of adding compute power and not disk space. You can deploy Alluxio on your compute nodes to offset

Re: Remove dependence on HDFS

2017-02-13 Thread Saisai Shao
IIUC Spark doesn't strongly bind to HDFS, it uses a common FileSystem layer which supports different FS implementations, HDFS is just one option. You could also use S3 as a backend FS, from Spark's point it is transparent to different FS implementations. On Sun, Feb 12, 2017 at 5:32 PM, ayan

Re: Remove dependence on HDFS

2017-02-12 Thread ayan guha
2, 2017 at 4:29 AM Benjamin Kim <bbuil...@gmail.com> wrote: > > Has anyone got some advice on how to remove the reliance on HDFS for > storing persistent data. We have an on-premise Spark cluster. It seems like > a waste of resources to keep adding nodes because of a lack of storage &

Re: Remove dependence on HDFS

2017-02-12 Thread Sean Owen
Data has to live somewhere -- how do you not add storage but store more data? Alluxio is not persistent storage, and S3 isn't on your premises. On Sun, Feb 12, 2017 at 4:29 AM Benjamin Kim <bbuil...@gmail.com> wrote: > Has anyone got some advice on how to remove the relianc

Re: Remove dependence on HDFS

2017-02-12 Thread Jörn Franke
You're have to carefully choose if your strategy makes sense given your users workloads. Hence, I am not sure your reasoning makes sense. However, You can , for example, install openstack swift as an object store and use this as storage. HDFS in this case can be used as a temporary store

Remove dependence on HDFS

2017-02-11 Thread Benjamin Kim
Has anyone got some advice on how to remove the reliance on HDFS for storing persistent data. We have an on-premise Spark cluster. It seems like a waste of resources to keep adding nodes because of a lack of storage space only. I would rather add more powerful nodes due to the lack

HDFS Shell tool

2017-02-10 Thread Vitásek , Ladislav
Hello Spark fans, I would like to inform you about our tool we want to share in big data community. I think it can be also handy for Spark users. We created a new utility - HDFS Shell to work with HDFS data more easily. https://github.com/avast/hdfs-shell *Feature highlights* - HDFS DFS command

Re: can I use Spark Standalone with HDFS but no YARN

2017-02-03 Thread kant kodali
I have 3 Spark Masters colocated with ZK's nodes and 2 Workers nodes. so my NameNodes are the same nodes as my spark master and DataNodes are the same Nodes as my Spark Workers. is that correct? How do I setup HDFS with zookeeper? On Fri, Feb 3, 2017 at 10:27 PM, Mark Hamstra &l

Re: can I use Spark Standalone with HDFS but no YARN

2017-02-03 Thread kant kodali
On Fri, Feb 3, 2017 at 10:27 PM, Mark Hamstra <m...@clearstorydata.com> wrote: > yes > > On Fri, Feb 3, 2017 at 10:08 PM, kant kodali <kanth...@gmail.com> wrote: > >> can I use Spark Standalone with HDFS but no YARN? >> >> Thanks! >> > >

Re: can I use Spark Standalone with HDFS but no YARN

2017-02-03 Thread Mark Hamstra
yes On Fri, Feb 3, 2017 at 10:08 PM, kant kodali <kanth...@gmail.com> wrote: > can I use Spark Standalone with HDFS but no YARN? > > Thanks! >

can I use Spark Standalone with HDFS but no YARN

2017-02-03 Thread kant kodali
can I use Spark Standalone with HDFS but no YARN? Thanks!

Re: Writing Spark SQL output in Local and HDFS path

2017-01-19 Thread smartzjp
not member of org.apache.spark.sql.DataFrameWriter. Regards Prasad On Thu, Jan 19, 2017 at 4:35 PM, smartzjp <zjp_j...@163.com> wrote: Beacause the reduce number will be not one, so it will out put a fold on the HDFS, You can use “result.write.csv(foldPath)”. -- Hi, Can anyone please

Re: Writing Spark SQL output in Local and HDFS path

2017-01-19 Thread Ravi Prasad
e not one, so it will out put a fold on > the HDFS, You can use “result.write.csv(foldPath)”. > > > > -- > > Hi, > Can anyone please let us know how to write the output of the Spark SQL > in > Local and HDFS path using Scala code. > > *Code :-* &g

Re: Writing Spark SQL output in Local and HDFS path

2017-01-19 Thread smartzjp
Beacause the reduce number will be not one, so it will out put a fold on the HDFS, You can use “result.write.csv(foldPath)”. -- Hi, Can anyone please let us know how to write the output of the Spark SQL in Local and HDFS path using Scala code. Code :- scala> val res

Writing Spark SQL output in Local and HDFS path

2017-01-19 Thread Ravi Prasad
Hi, Can anyone please let us know how to write the output of the Spark SQL in Local and HDFS path using Scala code. *Code :-* scala> val result = sqlContext.sql("select empno , name from emp"); scala > result.show(); If I give the command result.show() then It will

Re: AVRO Append HDFS using saveAsNewAPIHadoopFile

2017-01-09 Thread Santosh.B
. On Mon, Jan 9, 2017 at 3:17 PM, Jörn Franke <jornfra...@gmail.com> wrote: > Avro itself supports it, but I am not sure if this functionality is > available through the Spark API. Just out of curiosity, if your use case is > only write to HDFS then you might use simply flume. &g

Re: AVRO Append HDFS using saveAsNewAPIHadoopFile

2017-01-09 Thread Jörn Franke
Avro itself supports it, but I am not sure if this functionality is available through the Spark API. Just out of curiosity, if your use case is only write to HDFS then you might use simply flume. > On 9 Jan 2017, at 09:58, awkysam <contactsanto...@gmail.com> wrote: > > Cu

AVRO Append HDFS using saveAsNewAPIHadoopFile

2017-01-09 Thread awkysam
Currently for our project we are collecting data and pushing into Kafka with messages are in Avro format. We need to push this data into HDFS and we are using SparkStreaming and in HDFS also it is stored in Avro format. We are partitioning the data per each day. So when we write data into HDFS we

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Miguel Morales
If you're using Kubernetes you can group spark and hdfs to run in the same stack. Meaning they'll basically run in the same network space and share ips. Just gotta make sure there's no port conflicts. On Wed, Dec 28, 2016 at 5:07 AM, Karamba <phantom...@web.de> wrote: > > Good

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Karamba
Good idea, thanks! But unfortunately that's not possible. All containers are connected to an overlay network. Is there any other possiblity to say spark that it is on the same *NODE* as an hdfs data node? On 28.12.2016 12:00, Miguel Morales wrote: > It might have to do with your container

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Miguel Morales
g! > > >> Although the Spark task scheduler is aware of rack-level data locality, it >> seems that only YARN implements the support for it. > > This explains why the script that I configured in core-site.xml > topology.script.file.name is not called in by the spark

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Karamba
spark container. But at time of reading from hdfs in a spark program, the script is called in my hdfs namenode container. > However, node-level locality can still work for Standalone. I have a couple of physical hosts that run spark and hdfs docker containers. How does spark standalone knows t

Re: Spark Dataframe: Save to hdfs is taking long time

2016-12-28 Thread Raju Bairishetti
..@gmail.com> >> *Sent:* Thursday, December 15, 2016 7:54:27 PM >> *To:* user @spark >> *Subject:* Spark Dataframe: Save to hdfs is taking long time >> >> Hi, >> >> I am using issue while saving the dataframe back to HDFS. It's taking >>

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-27 Thread Sun Rui
, which means executors are available on a subset of the cluster nodes? > On Dec 27, 2016, at 01:39, Karamba <phantom...@web.de> wrote: > > Hi, > > I am running a couple of docker hosts, each with an HDFS and a spark > worker in a spark standalone cluster. > In

[Spark 2.0.2 HDFS]: no data locality

2016-12-26 Thread Karamba
Hi, I am running a couple of docker hosts, each with an HDFS and a spark worker in a spark standalone cluster. In order to get data locality awareness, I would like to configure Racks for each host, so that a spark worker container knows from which hdfs node container it should load its data

Re: Ingesting data in elasticsearch from hdfs using spark , cluster setup and usage

2016-12-23 Thread Anastasios Zouzias
for storing the elastic indices; this will boost your elastic cluster performance. Best, Anastasios On Thu, Dec 22, 2016 at 6:35 PM, Rohit Verma <rohit.ve...@rokittech.com> wrote: > I am setting up a spark cluster. I have hdfs data nodes and spark master > nodes on same instances. To add e

Re: Ingesting data in elasticsearch from hdfs using spark , cluster setup and usage

2016-12-22 Thread Rohit Verma
ites... > > One more thing, make sure you have enough network bandwidth... > > Regards, > > Yang > > Sent from my iPhone > >> On Dec 22, 2016, at 12:35 PM, Rohit Verma <rohit.ve...@rokittech.com> wrote: >> >> I am setting up a spark cluste

Re: Ingesting data in elasticsearch from hdfs using spark , cluster setup and usage

2016-12-22 Thread genia...@gmail.com
from my iPhone > On Dec 22, 2016, at 12:35 PM, Rohit Verma <rohit.ve...@rokittech.com> wrote: > > I am setting up a spark cluster. I have hdfs data nodes and spark master > nodes on same instances. To add elasticsearch to this cluster, should I spawn > es on different machi

Ingesting data in elasticsearch from hdfs using spark , cluster setup and usage

2016-12-22 Thread Rohit Verma
I am setting up a spark cluster. I have hdfs data nodes and spark master nodes on same instances. To add elasticsearch to this cluster, should I spawn es on different machine on same machine. I have only 12 machines, 1-master (spark and hdfs) 8-spark workers and hdfs data nodes I can use 3

Re: Spark Dataframe: Save to hdfs is taking long time

2016-12-15 Thread KhajaAsmath Mohammed
, December 15, 2016 7:54:27 PM > *To:* user @spark > *Subject:* Spark Dataframe: Save to hdfs is taking long time > > Hi, > > I am using issue while saving the dataframe back to HDFS. It's taking long > time to run. > > val results_dataframe = sqlContext.sql("se

Re: Spark Dataframe: Save to hdfs is taking long time

2016-12-15 Thread Felix Cheung
What is the format? From: KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> Sent: Thursday, December 15, 2016 7:54:27 PM To: user @spark Subject: Spark Dataframe: Save to hdfs is taking long time Hi, I am using issue while saving the dataframe back to HDFS

Spark Dataframe: Save to hdfs is taking long time

2016-12-15 Thread KhajaAsmath Mohammed
Hi, I am using issue while saving the dataframe back to HDFS. It's taking long time to run. val results_dataframe = sqlContext.sql("select gt.*,ct.* from PredictTempTable pt,ClusterTempTable ct,GamificationTempTable gt where gt.vin=pt.vin and pt.cluster=ct.cluster") results_datafram

<    1   2   3   4   5   6   7   8   9   10   >