hod:
df.write.parquet('s3a://bucket/parquet')
Now I want to setup a small cache for the parquet output. One output is about
12-15 GB in size. Would it be enough to setup a NFS-directory on the master,
write the output to it and then move it to S3? Or should I setup a HDFS on the
Master? Or
You will collect in the driver (often the master) and it will save the data, so
for saving, you will not have to set up HDFS.
From: Alexander Czech [mailto:alexander.cz...@googlemail.com]
Sent: Friday, September 29, 2017 8:15 AM
To: user@spark.apache.org
Subject: HDFS or NFS as a cache?
I have
rough the parquet write method:
>>
>> df.write.parquet('s3a://bucket/parquet')
>>
>> Now I want to setup a small cache for the parquet output. One output is
>> about 12-15 GB in size. Would it be enough to setup a NFS-directory on the
>> master, write the output to
quet')
>
> Now I want to setup a small cache for the parquet output. One output is
> about 12-15 GB in size. Would it be enough to setup a NFS-directory on the
> master, write the output to it and then move it to S3? Or should I setup a
> HDFS on the Master? Or should I even opt for an
output is
about 12-15 GB in size. Would it be enough to setup a NFS-directory on the
master, write the output to it and then move it to S3? Or should I setup a
HDFS on the Master? Or should I even opt for an additional cluster running
a HDFS solution on more than one node?
thanks!
Hi Folks,
I am writing a pipeline which reads from Kafka, applying some
transformations, then persist to HDFS.
Obviously such operation is not supported to DStream, since the
*DStream.save*(Path)
*method,
considers the Path as a directory, not a file. Also using
*repartition(1).mode
iltered.map(e => e.copy(date =
> e.date.slice(0, 10) +
> "T00:00:00.000-00:00")) .dropDuplicates(Array("src", "date", "dst"))
> transformed.write .option("sep", "\t")
>.option("header", "false&
quot;sep", "\t")
.option("header", "false")
.option("compression", "gzip")
.mode(SaveMode.Append)
.csv(config.output)
The input data is roughly 2.1TB (~ 500 billion lines I think) and on HDFS.
I'm honestly running out of ideas on h
Hi I am getting the following error not sure why seems like race condition
but I dont use any threads just one thread which owns spark context is
writing to hdfs with one parquet partition. I am using Scala 2.10 and Spark
1.5.1. Please guide. Thanks in advance.
java.io.IOException: The file
We are also doing transformations, thats the reason using spark streaming.
Does Spark streaming support tumbling windows? I was thinking I can use a
window operation to writing into HDFS.
Thanks
On Sun, Jun 25, 2017 at 10:23 PM, ayan guha <guha.a...@gmail.com> wrote:
> I would sugge
I would suggest to use Flume, if possible, as it has in built HDFS log
rolling capabilities
On Mon, Jun 26, 2017 at 1:09 PM, Naveen Madhire <vmadh...@umail.iu.edu>
wrote:
> Hi,
>
> I am using spark streaming with 1 minute duration to read data from kafka
> topic, app
Hi,
I am using spark streaming with 1 minute duration to read data from kafka
topic, apply transformations and persist into HDFS.
The application is creating a new directory every 1 minute with many
partition files(= nbr of partitions). What parameter should I need to
change/configure to persist
for HDFS in Azure.I think
google cloud storage is similar, but haven't played with it. Ask google.
You cannot do the same for S3 except on EMR and Amazon's premium emrfs://
offering, which adds the consistency layer.
On 22 Jun 2017, at 00:50, Alaa Zubaidi (PDF)
<alaa.zuba...@pdf.
This issue got resolved.
I was able to trace it to the fact that the driver program's pom.xml was
pulling in Spark 2.1.1 which in turn was pulling in Hadoop 2.2.0.
Explicitly adding dependencies on Hadoop libraries 2.7.3 resolves it.
The following API in HDFS
chang your fs.defaultFS to point to local file system and have a try
On Wed, Jun 21, 2017 at 4:50 PM, Alaa Zubaidi (PDF) <alaa.zuba...@pdf.com>
wrote:
> Hi,
>
> Can we run Spark on YARN with out installing HDFS?
> If yes, where would HADOOP_CONF_DIR point to?
>
> Regard
Hi,
Can we run Spark on YARN with out installing HDFS?
If yes, where would HADOOP_CONF_DIR point to?
Regards,
--
*This message may contain confidential and privileged information. If it
has been sent to you in error, please reply to advise the sender of the
error and then immediately
shine a light on
> what could be going on. I turned on debug logging for
> org.apache.spark.streaming.scheduler in the driver process and this is
> what gets thrown in the logs and keeps throwing it even after the downed
> HDFS node is restarted. Using Spark 2.1.1 and HDFS 2.7.3 he
and this is what
gets thrown in the logs and keeps throwing it even after the downed HDFS node
is restarted. Using Spark 2.1.1 and HDFS 2.7.3 here.
2017-06-20 22:38:11,302 WARN JobGenerator ReceivedBlockTracker.logWarning -
Exception thrown while writing record: BatchCleanupEvent(ArrayBuffer
Ok some more info about this issue to see if someone can shine a light on
what could be going on. I turned on debug logging for
org.apache.spark.streaming.scheduler in the driver process and this is what
gets thrown in the logs and keeps throwing it even after the downed HDFS
node is restarted
BTW, this is running on Spark 2.1.1.
I have been trying to debug this issue and what I have found till now is
that it is somehow related to the Spark WAL. The directory named
/receivedBlockMetadata seems to stop getting
written to after the point of an HDFS node being killed and restarted. I
have
[socketAddress.size()]),
StorageLevel.MEMORY_AND_DISK_SER(), *100*, *5*);
The checkpoint directory is configured to be on an HDFS cluster and Spark
workers have their SPARK_LOCAL_DIRS and SPARK_WORKER_DIR defined to be on
their respective local filesystems.
What we are seeing is some odd behavior
rstand How SparkSession can use Akka to communicate
>> with SparkCluster?
>> Let me use your initial requirement as a way to illustrate what I mean --
>> i.e, "I want my Micro service app to be able to query and access data on
>> HDFS"
>> In order to run a query say a DF quer
Let me use your initial requirement as a way to illustrate what I mean --
> i.e, "I want my Micro service app to be able to query and access data on
> HDFS"
> In order to run a query say a DF query (equally possible with SQL as
> well), you'll need a sparkSession to build a q
Hello Kant,
>I still don't understand How SparkSession can use Akka to communicate with
SparkCluster?
Let me use your initial requirement as a way to illustrate what I mean --
i.e, "I want my Micro service app to be able to query and access data on
HDFS"
In order to run a query s
Hi Muthu,
I am actually using Play framework for my Micro service which uses Akka but
I still don't understand How SparkSession can use Akka to communicate with
SparkCluster? SparkPi or SparkPl? any link?
Thanks!
:23 AM, Sandeep Nemuri <nhsande...@gmail.com>
>> wrote:
>>
>>> Check out http://livy.io/
>>>
>>>
>>> On Sun, Jun 4, 2017 at 11:59 AM, kant kodali <kanth...@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>
hanks!
>
> On Sun, Jun 4, 2017 at 12:23 AM, Sandeep Nemuri <nhsande...@gmail.com>
> wrote:
>
>> Check out http://livy.io/
>>
>>
>> On Sun, Jun 4, 2017 at 11:59 AM, kant kodali <kanth...@gmail.com> wrote:
>>
>>> Hi All,
>>>
&g
wrote:
> Check out http://livy.io/
>
>
> On Sun, Jun 4, 2017 at 11:59 AM, kant kodali <kanth...@gmail.com> wrote:
>
>> Hi All,
>>
>> I am wondering what is the easiest way for a Micro service to query data
>> on HDFS? By easiest way I mean using mi
Check out http://livy.io/
On Sun, Jun 4, 2017 at 11:59 AM, kant kodali <kanth...@gmail.com> wrote:
> Hi All,
>
> I am wondering what is the easiest way for a Micro service to query data
> on HDFS? By easiest way I mean using minimal number of tools.
>
> Currentl
Hi All,
I am wondering what is the easiest way for a Micro service to query data on
HDFS? By easiest way I mean using minimal number of tools.
Currently I use spark structured streaming to do some real time
aggregations and store it in HDFS. But now, I want my Micro service app to
be able
checkpointDirectory);
sparkContext.setCheckpointDir(checkpointPath);
Asher Krim
Senior Software Engineer
On Tue, May 30, 2017 at 12:37 PM, Everett Anderson <ever...@nuna.com.invalid
> wrote:
> Still haven't found a --conf option.
>
> Regarding a temporary HDFS checkpoint directory, it looks lik
Still haven't found a --conf option.
Regarding a temporary HDFS checkpoint directory, it looks like when using
--master yarn, spark-submit supplies a SPARK_YARN_STAGING_DIR environment
variable. Thus, one could do the following when creating a SparkSession:
val checkpointPath = new Path
running jobs on AWS EMR (so on YARN + HDFS) and reading
and writing non-transient data to S3.
Two questions:
1. Is there a Spark --conf option to set the checkpoint directory? Somehow
I couldn't find it, but surely it exists.
2. What's a good checkpoint directory for this use case? I imagine it'd
Im having some difficulty reliably restoring a streaming job from a checkpoint.
When restoring a streaming job constructed from the following snippet, I
receive NullPointerException's when `map` is called on the the restored RDD.
lazy val ssc = StreamingContext.getOrCreate(checkpointDir,
Hi everybody.
I’m totally new in Spark and I wanna know one stuff that I do not manage to
find. I have a full ambary install with hbase, Hadoop and spark. My code
reads and writes in hdfs via hbase. Thus, as I understood, all data stored
are in bytes format in hdfs. Now, I know that it’s possible
Hi everybody.
I'm totally new in Spark and I wanna know one stuff that I do not manage to
find. I have a full ambary install with hbase, Hadoop and spark. My code
reads and writes in hdfs via hbase. Thus, as I understood, all data stored
are in bytes format in hdfs. Now, I know that it's possible
Hi mo,
I don't think it needs shuffle cause the bloom filter only depends on data
within each row group, not the whole data. But the HAR solution seems nice.
I've thought of combining small files together and store the offsets.. not
aware of hdfs has provided such functionality. And after some
ail.com>
发送时间: 2017年4月17日 16:48:47
收件人: 莫涛
抄送: user
主题: Re: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?
how about the event timeline on executors? It seems add more executor could
help.
1. I found a jira(https://issues.apache.org/jira/browse/SPARK-11621) that
state
It's hadoop archive.
https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html
发件人: Alonso Isidoro Roman <alons...@gmail.com>
发送时间: 2017年4月20日 17:03:33
收件人: 莫涛
抄送: Jörn Franke; user@spark.apache.org
主题: Re: 答复: 答复: How to store 10M records in HDFS to sp
path
> list and 0.5 second per thread to read a record).
>
> Such performance is exactly what I expected: "only the requested BINARY
> are scanned".
>
> Moreover, HAR provides directly access to each record by hdfs shell
> command.
>
>
> Thank you very much!
t I expected: "only the requested BINARY are
scanned".
Moreover, HAR provides directly access to each record by hdfs shell command.
Thank you very much!
发件人: Jörn Franke <jornfra...@gmail.com>
发送时间: 2017年4月17日 22:37:48
收件人: 莫涛
抄送: user@spark.ap
Hi All,
I've been using spark standalone for a while and now its time for me to
install HDFS. If a spark worker goes down then Spark master restarts the
worker similarly if a datanode process goes down it looks like it is not
the namenode job to restart the datanode and if so, 1) should I use
.n3.nabble.com/Spark-2-1-0-hanging-while-writing-a-table-in-HDFS-in-parquet-format-tp28611.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Yes 5 mb is a difficult size, too small for HDFS too big for parquet/orc.
Maybe you can put the data in a HAR and store id, path in orc/parquet.
> On 17. Apr 2017, at 10:52, 莫涛 <mo...@sensetime.com> wrote:
>
> Hi Jörn,
>
>
>
> I do think a 5 MB column is odd but
One possihility is using hive with bucketed on id column?
Another option: build the index in hbase ie store id and path of hdfs in
hbase. This was your scans will be fast and once you have the hdfs path
pointers you can read the actual data from hdfs.
On Mon, 17 Apr 2017 at 6:52 pm, 莫涛 <
: user@spark.apache.org
主题: Re: How to store 10M records in HDFS to speed up further filtering?
You need to sort the data by id otherwise q situation can occur where the index
does not work. Aside from this, it sounds odd to put a 5 MB column using those
formats. This will be also not so efficient
epends on the distribution of the
> given ID list. No partition could be skipped in the worst case.
>
>
> Mo Tao
>
>
>
> --
> *发件人:* Ryan <ryan.hd@gmail.com>
> *发送时间:* 2017年4月17日 15:42:46
> *收件人:* 莫涛
> *抄送:* user
> *主题:*
f the given ID
list. No partition could be skipped in the worst case.
Mo Tao
发件人: Ryan <ryan.hd@gmail.com>
发送时间: 2017年4月17日 15:42:46
收件人: 莫涛
抄送: user
主题: Re: 答复: How to store 10M records in HDFS to speed up further filtering?
1. Per my understanding
;BINARY")
> .write...
>
> Thanks for any advice!
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-store-10M-records-in-HDFS-to-speed-up-further-filtering-tp28605.html
> Sent from the Apache Spark
Thanks very much!
>
>
> Mo Tao
>
> --
> *发件人:* Ryan <ryan.hd@gmail.com>
> *发送时间:* 2017年4月17日 14:32:00
> *收件人:* 莫涛
> *抄送:* user
> *主题:* Re: How to store 10M records in HDFS to speed up further filtering?
>
> you can build a searc
ser
主题: Re: How to store 10M records in HDFS to speed up further filtering?
you can build a search tree using ids within each partition to act like an
index, or create a bloom filter to see if current partition would have any hit.
What's your expected qps and response time for the filter reques
kID = udf { ID: String => IDSet(ID) }
> spark.read.orc("/path/to/whole/data")
> .filter(checkID($"ID"))
> .select($"ID", $"BINARY")
> .write...
>
> Thanks for any advice!
>
>
>
>
> --
> View this message in context:
ole/data")
.filter(checkID($"ID"))
.select($"ID", $"BINARY")
.write...
Thanks for any advice!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-store-10M-records-in-HDFS-to-speed-up-further-filtering-tp28605.html
S
Maybe using ranger or sentry would be the better choice to intercept those
calls?
> On 7. Apr 2017, at 16:32, Alvaro Brandon <alvarobran...@gmail.com> wrote:
>
> I was going through the SparkContext.textFile() and I was wondering at that
> point does Spark communicates wit
On 7 Apr 2017, at 15:32, Alvaro Brandon
<alvarobran...@gmail.com<mailto:alvarobran...@gmail.com>> wrote:
I was going through the SparkContext.textFile() and I was wondering at that
point does Spark communicates with HDFS. Since when you download Spark binaries
you also specif
I was going through the SparkContext.textFile() and I was wondering at that
point does Spark communicates with HDFS. Since when you download Spark
binaries you also specify the Hadoop version you will use, I'm guessing it
has its own client that calls HDFS wherever you specify
app - i get a snappy library not found
> error. I am confused as to how can spark write eventlog in snappy format
> without an error, but reading fails with the above error.
>
> Any help in unblocking myself to read snappy eventlog files from hdfs using
> spark?
>
>
>
> -
error.
Any help in unblocking myself to read snappy eventlog files from hdfs using
spark?
--
View this message in context: http://apache-spark-user-list.
1001560.n3.nabble.com/reading-snappy-eventlog-files-from-
hdfs-using-spark-tp28577.html
Sent from the Apache Spark User List mailing list archi
without an error, but reading fails with the above error.
Any help in unblocking myself to read snappy eventlog files from hdfs using
spark?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/reading-snappy-eventlog-files-from-hdfs-using-spark-tp28577.html
Sent
Hi All:
when I use spark-streaming to consume kafka topic data to HDFS with sparking
2.1.0.
I find it's slow .Why?
environment:CDH5.8
scala version:2.11
kafka version 0.10.1.1
spark-streaming-kafka :spark-streaming-kafka-0.8_2.11
should I replace the spark-streaming-kafka with spark-streaming
(Sorry if this is a duplicate. I got a strange error message when I first
tried to send it earlier)
I want to pull HDFS paths from Kafka and build text streams based on those
paths. I currently have:
val lines = KafkaUtils.createStream(/* params here */).map(_._2)
val buffer = new ArrayBuffer
I want to pull HDFS paths from Kafka and build text streams based on those
paths. I currently have:
val lines = KafkaUtils.createStream(/* params here */).map(_._2)
val buffer = new ArrayBuffer[String]()
lines.foreachRDD(rdd => {
if (!rdd.partitions.isEmpty) {
rdd.collect().foreach(l
Thanks for your email
My situation I, there is a hive table partitioned by five minutes, I
want to write data every 30s into the hdfs location where the table located. So
I when the first batch is delay, then the next batch may have the chance to
touch the _SUCCESS file at the same
uot;Charles O. Bajomo" <charles.baj...@pretechconsulting.co.uk>
Cc: "user" <user@spark.apache.org>, d...@spark.apache.org
Sent: Tuesday, 28 February, 2017 10:47:47
Subject: 答复: spark append files to the same hdfs dir issue for
LeaseExpiredException
I am writing data to hdfs
I am writing data to hdfs file, also the hdfs dir is a hive partition file dir.
Hive does not support sub dirs.. for example my partition folder is
***/dt=20170224/hm=1400 that means I need to write all the data between 1400
to 1500 to the same folder.
发件人: Charles O. Bajomo
.
Kind Regards
From: "Triones,Deng(vip.com)" <triones.d...@vipshop.com>
To: "user" <user@spark.apache.org>, d...@spark.apache.org
Sent: Tuesday, 28 February, 2017 09:35:00
Subject: spark append files to the same hdfs dir issue for
LeaseExpiredException
Hi dev and users
Now I am running spark streaming , (spark version 2.0.2) to write
file to hdfs. When my spark.streaming.concurrentJobs is more than one. Like 20.
I meet the exception as below.
We know that when the batch finished, there will be a _SUCCESS file.
As I guess
mFile(args(0)).mkString)
System.out.println("Okay")
}
}
This is my spark program and my hivescript is at args(0)
$SPARK_HOME/bin/./spark-submit --class com.spark.test.Step1 --master yarn
--deploy-mode cluster com.spark.test-0.1-SNAPSHOT.jar
hdfs://spirui-d86-f03-06:9229/samp
Hi Ben,
You can replace HDFS with a number of storage systems since Spark is
compatible with other storage like S3. This would allow you to scale your
compute nodes solely for the purpose of adding compute power and not disk
space. You can deploy Alluxio on your compute nodes to offset
IIUC Spark doesn't strongly bind to HDFS, it uses a common FileSystem layer
which supports different FS implementations, HDFS is just one option. You
could also use S3 as a backend FS, from Spark's point it is transparent to
different FS implementations.
On Sun, Feb 12, 2017 at 5:32 PM, ayan
2, 2017 at 4:29 AM Benjamin Kim <bbuil...@gmail.com> wrote:
>
> Has anyone got some advice on how to remove the reliance on HDFS for
> storing persistent data. We have an on-premise Spark cluster. It seems like
> a waste of resources to keep adding nodes because of a lack of storage
&
Data has to live somewhere -- how do you not add storage but store more
data? Alluxio is not persistent storage, and S3 isn't on your premises.
On Sun, Feb 12, 2017 at 4:29 AM Benjamin Kim <bbuil...@gmail.com> wrote:
> Has anyone got some advice on how to remove the relianc
You're have to carefully choose if your strategy makes sense given your users
workloads. Hence, I am not sure your reasoning makes sense.
However, You can , for example, install openstack swift as an object store and
use this as storage. HDFS in this case can be used as a temporary store
Has anyone got some advice on how to remove the reliance on HDFS for storing
persistent data. We have an on-premise Spark cluster. It seems like a waste of
resources to keep adding nodes because of a lack of storage space only. I would
rather add more powerful nodes due to the lack
Hello Spark fans,
I would like to inform you about our tool we want to share in big data
community. I think it can be also handy for Spark users.
We created a new utility - HDFS Shell to work with HDFS data more easily.
https://github.com/avast/hdfs-shell
*Feature highlights*
- HDFS DFS command
I have 3 Spark Masters colocated with ZK's nodes and 2 Workers nodes. so my
NameNodes are the same nodes as my spark master and DataNodes are the same
Nodes as my Spark Workers. is that correct? How do I setup HDFS with
zookeeper?
On Fri, Feb 3, 2017 at 10:27 PM, Mark Hamstra &l
On Fri, Feb 3, 2017 at 10:27 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:
> yes
>
> On Fri, Feb 3, 2017 at 10:08 PM, kant kodali <kanth...@gmail.com> wrote:
>
>> can I use Spark Standalone with HDFS but no YARN?
>>
>> Thanks!
>>
>
>
yes
On Fri, Feb 3, 2017 at 10:08 PM, kant kodali <kanth...@gmail.com> wrote:
> can I use Spark Standalone with HDFS but no YARN?
>
> Thanks!
>
can I use Spark Standalone with HDFS but no YARN?
Thanks!
not member of org.apache.spark.sql.DataFrameWriter.
Regards
Prasad
On Thu, Jan 19, 2017 at 4:35 PM, smartzjp <zjp_j...@163.com> wrote:
Beacause the reduce number will be not one, so it will out put a fold on the
HDFS, You can use “result.write.csv(foldPath)”.
--
Hi,
Can anyone please
e not one, so it will out put a fold on
> the HDFS, You can use “result.write.csv(foldPath)”.
>
>
>
> --
>
> Hi,
> Can anyone please let us know how to write the output of the Spark SQL
> in
> Local and HDFS path using Scala code.
>
> *Code :-*
&g
Beacause the reduce number will be not one, so it will out put a fold on the
HDFS, You can use “result.write.csv(foldPath)”.
--
Hi,
Can anyone please let us know how to write the output of the Spark SQL in
Local and HDFS path using Scala code.
Code :-
scala> val res
Hi,
Can anyone please let us know how to write the output of the Spark SQL in
Local and HDFS path using Scala code.
*Code :-*
scala> val result = sqlContext.sql("select empno , name from emp");
scala > result.show();
If I give the command result.show() then It will
.
On Mon, Jan 9, 2017 at 3:17 PM, Jörn Franke <jornfra...@gmail.com> wrote:
> Avro itself supports it, but I am not sure if this functionality is
> available through the Spark API. Just out of curiosity, if your use case is
> only write to HDFS then you might use simply flume.
&g
Avro itself supports it, but I am not sure if this functionality is available
through the Spark API. Just out of curiosity, if your use case is only write to
HDFS then you might use simply flume.
> On 9 Jan 2017, at 09:58, awkysam <contactsanto...@gmail.com> wrote:
>
> Cu
Currently for our project we are collecting data and pushing into Kafka with
messages are in Avro format. We need to push this data into HDFS and we are
using SparkStreaming and in HDFS also it is stored in Avro format. We are
partitioning the data per each day. So when we write data into HDFS we
If you're using Kubernetes you can group spark and hdfs to run in the
same stack. Meaning they'll basically run in the same network space
and share ips. Just gotta make sure there's no port conflicts.
On Wed, Dec 28, 2016 at 5:07 AM, Karamba <phantom...@web.de> wrote:
>
> Good
Good idea, thanks!
But unfortunately that's not possible. All containers are connected to
an overlay network.
Is there any other possiblity to say spark that it is on the same *NODE*
as an hdfs data node?
On 28.12.2016 12:00, Miguel Morales wrote:
> It might have to do with your container
g!
>
>
>> Although the Spark task scheduler is aware of rack-level data locality, it
>> seems that only YARN implements the support for it.
>
> This explains why the script that I configured in core-site.xml
> topology.script.file.name is not called in by the spark
spark container.
But at time of reading from hdfs in a spark program, the script is
called in my hdfs namenode container.
> However, node-level locality can still work for Standalone.
I have a couple of physical hosts that run spark and hdfs docker
containers. How does spark standalone knows t
..@gmail.com>
>> *Sent:* Thursday, December 15, 2016 7:54:27 PM
>> *To:* user @spark
>> *Subject:* Spark Dataframe: Save to hdfs is taking long time
>>
>> Hi,
>>
>> I am using issue while saving the dataframe back to HDFS. It's taking
>>
,
which means executors are available on a subset of the cluster nodes?
> On Dec 27, 2016, at 01:39, Karamba <phantom...@web.de> wrote:
>
> Hi,
>
> I am running a couple of docker hosts, each with an HDFS and a spark
> worker in a spark standalone cluster.
> In
Hi,
I am running a couple of docker hosts, each with an HDFS and a spark
worker in a spark standalone cluster.
In order to get data locality awareness, I would like to configure Racks
for each host, so that a spark worker container knows from which hdfs
node container it should load its data
for
storing the elastic indices; this will boost your elastic cluster
performance.
Best,
Anastasios
On Thu, Dec 22, 2016 at 6:35 PM, Rohit Verma <rohit.ve...@rokittech.com>
wrote:
> I am setting up a spark cluster. I have hdfs data nodes and spark master
> nodes on same instances. To add e
ites...
>
> One more thing, make sure you have enough network bandwidth...
>
> Regards,
>
> Yang
>
> Sent from my iPhone
>
>> On Dec 22, 2016, at 12:35 PM, Rohit Verma <rohit.ve...@rokittech.com> wrote:
>>
>> I am setting up a spark cluste
from my iPhone
> On Dec 22, 2016, at 12:35 PM, Rohit Verma <rohit.ve...@rokittech.com> wrote:
>
> I am setting up a spark cluster. I have hdfs data nodes and spark master
> nodes on same instances. To add elasticsearch to this cluster, should I spawn
> es on different machi
I am setting up a spark cluster. I have hdfs data nodes and spark master nodes
on same instances. To add elasticsearch to this cluster, should I spawn es on
different machine on same machine. I have only 12 machines,
1-master (spark and hdfs)
8-spark workers and hdfs data nodes
I can use 3
, December 15, 2016 7:54:27 PM
> *To:* user @spark
> *Subject:* Spark Dataframe: Save to hdfs is taking long time
>
> Hi,
>
> I am using issue while saving the dataframe back to HDFS. It's taking long
> time to run.
>
> val results_dataframe = sqlContext.sql("se
What is the format?
From: KhajaAsmath Mohammed <mdkhajaasm...@gmail.com>
Sent: Thursday, December 15, 2016 7:54:27 PM
To: user @spark
Subject: Spark Dataframe: Save to hdfs is taking long time
Hi,
I am using issue while saving the dataframe back to HDFS
Hi,
I am using issue while saving the dataframe back to HDFS. It's taking long
time to run.
val results_dataframe = sqlContext.sql("select gt.*,ct.* from
PredictTempTable pt,ClusterTempTable ct,GamificationTempTable gt where
gt.vin=pt.vin and pt.cluster=ct.cluster")
results_datafram
201 - 300 of 1329 matches
Mail list logo