Can you paste the piece of code?
Thanks
Best Regards
On Wed, Jul 23, 2014 at 1:22 AM, Bill Jay bill.jaypeter...@gmail.com
wrote:
Hi all,
I am running a spark streaming job. The job hangs on one stage, which
shows as follows:
Details for Stage 4
Summary Metrics No tasks have started
AFAIK you can use the --hadoop-major-version parameter with the spark-ec2
https://github.com/apache/spark/blob/master/ec2/spark_ec2.py script to
switch the hadoop version.
Thanks
Best Regards
On Wed, Jul 23, 2014 at 6:07 AM, durga durgak...@gmail.com wrote:
Hi,
I am trying to create spark
At the moment your best bet for sharing SparkContexts across jobs will be
Ooyala job server: https://github.com/ooyala/spark-jobserver
It doesn't yet support spark 1.0 though I did manage to amend it to get it to
build and run on 1.0
—
Sent from Mailbox
On Wed, Jul 23, 2014 at 1:21 AM, Asaf
foreachRDD is how I extracted values in the first place, so that’s not going to
make a difference. I don’t think it’s related to SPARK-1312 because I’m
generating data every second in the first place and I’m using foreachRDD right
after the window operation. The code looks something like
val
I'm having the exact same problem. I tried Mesos 0.19, 0.14.2, 0.14.1,
hadoop 2.3.0, spark 0.9.1.
# SIGSEGV (0xb) at pc=0x7fab70c55c4d, pid=31012, tid=140366980314880
#
# JRE version: 6.0_31-b31
# Java VM: OpenJDK 64-Bit Server VM (23.25-b01 mixed mode linux-amd64
compressed oops)
#
I have a DStream receiving data from a socket. I'm using local mode.
I set spark.streaming.unpersist to false and leave
spark.cleaner.ttl to be infinite.
I can see files for input and shuffle blocks under spark.local.dir
folder and the size of folder keeps increasing, although JVM's memory
usage
Hi,
We have been using standalone spark for last 6 months and I used to run
application jars fine on spark cluster with the following command.
java -cp
:/app/data/spark_deploy/conf:/app/data/spark_deploy/lib/spark-assembly-1.0.0-SNAPSHOT-hadoop2.0.0-mr1-cdh4.5.0.jar:./app.jar
-Xms2g -Xmx2g
Hi Haopu,
Please see the inline comments.
Thanks
Jerry
-Original Message-
From: Haopu Wang [mailto:hw...@qilinsoft.com]
Sent: Wednesday, July 23, 2014 3:00 PM
To: user@spark.apache.org
Subject: spark.streaming.unpersist and spark.cleaner.ttl
I have a DStream receiving data from a
If you need to run Spark apps through Hue, see if Ooyala's job server helps:
http://gethue.com/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/
--
View this message in context:
I found the issue...
If you use spark git and generate the assembly jar then
org.apache.hadoop.io.Writable.class is packaged with it
If you use the assembly jar that ships with CDH in
TD, it looks like your instincts were correct. I misunderstood what you meant.
If I force an eval on the inputstream using foreachRDD, the windowing will
work correctly. If I don’t do that, lazy eval somehow screws with window
batches I eventually receive. Any reason the bug is categorized
Okay, I finally got this. The project/SparkBuild needed to be set, and only
0.19.0 seems to work (out of 0.14.1, 0.14.2).
org.apache.mesos % mesos% 0.19.0,
was the one that worked.
--
View this message in context:
Yes, that's the plan. If you use broadcast, please also make sure
TorrentBroadcastFactory is used, which became the default broadcast
factory very recently. -Xiangrui
On Tue, Jul 22, 2014 at 10:47 PM, Makoto Yui yuin...@gmail.com wrote:
Hi Xiangrui,
By using your treeAggregate and broadcast
Hi all,
Sorry for taking this topic again,still I am confused on this.
I set SPARK_DAEMON_JAVA_OPTS=-XX:+UseCompressedOops -Xmx8g
when I run my application,I got the following line in logs.
Spark Command: java -cp
I'm using the spark-shell locally and working on a dataset of size 900MB. I
initially ran into java.lang.OutOfMemoryError: GC overhead limit exceeded
error and upon researching set SPARK_DRIVER_MEMORY to 4g.
Now I run into ArrayIndexOutOfBoundsException, please let me know if there
is some way to
Is there any documentation from cloudera on how to run Spark apps on CDH
Manager deployed Spark ?
Asking the cloudera community would be a good idea.
http://community.cloudera.com/
In the end only Cloudera will fix quickly issues with CDH...
Bertrand Dechoux
On Wed, Jul 23, 2014 at 9:28 AM,
Hi all,
I was wondering how spark may deal with an execution plan. Using PIG as
example and its DAG execution, I would like to manage Spark for a similar
solution.
For instance, if my code has 3 different parts, being A and B
self-sufficient parts:
Part A:
..
.
.
var output_a
Part
Jerry, thanks for the response.
For the default storage level of DStream, it looks like Spark's document is
wrong. In this link:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#memory-tuning
It mentions:
Default persistence level of DStreams: Unlike RDDs, the default
Thank you!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/hadoop-version-tp10405p10485.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
How do I increase the driver memory? This are my configs right now:
sed 's/INFO/ERROR/' spark/conf/log4j.properties.template
./ephemeral-hdfs/conf/log4j.properties
sed 's/INFO/ERROR/' spark/conf/log4j.properties.template
spark/conf/log4j.properties
# Environment variables and Spark
You specify your own log4j configuration in the usual log4j way --
package it in your assembly, or specify on the command line for
example. See http://logging.apache.org/log4j/1.2/manual.html
The template you can start with is in
core/src/main/resources/org/apache/spark/log4j-defaults.properties
Hi,
Thanks TD for your reply. I am still not able to resolve the problem for my
use case.
I have let's say 1000 different RDD's, and I am applying a transformation
function on each RDD and I want the output of all rdd's combined to a single
output RDD. For, this I am doing the following:
*Loop
Hi,
I figured out my problem so I wanted to share my findings. I was basically
trying to broadcast an array with 4 million elements, and a size of
approximatively 150 MB. Every time I was trying to broadcast, I got an
OutOfMemory error. I fixed my problem by increasing the driver memory using:
in spark 1.1 maybe not so easy like spark 1.0 after commit:
https://issues.apache.org/jira/browse/SPARK-2446
only binary with UTF8 annotation will be recognized as string after this
commit, but in impala strings are always without UTF8 anno
--
View this message in context:
Yeah, the document may not be precisely aligned with latest code, so the best
way is to check the code.
-Original Message-
From: Haopu Wang [mailto:hw...@qilinsoft.com]
Sent: Wednesday, July 23, 2014 5:56 PM
To: user@spark.apache.org
Subject: RE: spark.streaming.unpersist and
Hello,
We plan to use Spark on EC2 for our data science pipeline. We successfully
manage to set up cluster as-well-as launch and run applications on
remote-clusters. However, to enhance scalability we would like to implement
auto-scaling in EC2 for Spark applications. However, I did not find any
We are having difficulties configuring Spark, partly because we still don't
understand some key concepts. For instance, how many executors are there
per machine in standalone mode? This is after having closely read the
documentation several times:
Thanks for your answer. However, there has been a missunderstanding here.
My question is related to control the execution in parallel of different
parts of code, similarly to PIG, where there is a planning phase before the
execution.
On Wed, Jul 23, 2014 at 1:46 PM, chutium teng@gmail.com
Hi
Currently this is not supported out of the Box. But you can of course
add/remove workers in a running cluster. Better option would be to use a
Mesos cluster where adding/removing nodes are quiet simple. But again, i
believe adding new worker in the middle of a task won't give you better
We are starting to use Spark, but we don't have any existing infrastructure
related to big-data, so we decided to setup the standalone cluster, rather
than mess around with Yarn or Mesos.
But it appears like the driver program has to stay up on the client for the
full duration of the job (client
I am aware that today PySpark can not load sequence files directly. Are
there work-arounds people are using (short of duplicating all the data to
text files) for accessing this data?
Load from sequenceFile for PySpark is in master and save is in this PR
underway (https://github.com/apache/spark/pull/1338)
I hope that Kan will have it ready to merge in time for 1.1 release window
(it should be, the PR just needs a final review or two).
In the meantime you can check out master
I¹m also in early stages of setting up long running Spark jobs. Easiest way
I¹ve found is to set up a cluster and submit the job via YARN. Then I can
come back and check in on progress when I need to. Seems the trick is tuning
the queue priority and YARN preemption to get the job to run in a
Hi Martin,
In standalone mode, each SparkContext you initialize gets its own set of
executors across the cluster. So for example if you have two shells open,
they'll each get two JVMs on each worker machine in the cluster.
As far as the other docs, you can configure the total number of cores
How can I transform the mapper key at the reducer output. The functions I
have encountered are combineByKey, reduceByKey, etc that work on the values
and not on the key. For example below, this is what I want to achieve but
seems like I can only have K1 and not K2:
On Wed, Jul 23, 2014 at 4:19 PM, Andrew Ash and...@andrewash.com wrote:
In standalone mode, each SparkContext you initialize gets its own set of
executors across the cluster. So for example if you have two shells open,
they'll each get two JVMs on each worker machine in the cluster.
Dumb
I have the same issue
val a = sc.textFile(s3n://MyBucket/MyFolder/*.tif)
a.first
works perfectly fine, but
val d = sc.wholeTextFiles(s3n://MyBucket/MyFolder/*.tif) does not
work
d.first
Gives the following error message
java.io.FileNotFoundException:
I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc.
Cluster resources are available to me via Yarn and I am seeing these
errors quite often.
ERROR YarnClientClusterScheduler: Lost executor 63 on host: remote Akka
client disassociated
This is in an interactive shell
Thanks Andrew,
So if there is only one SparkContext there is only one executor per
machine? This seems to contradict Aaron's message from the link above:
If each machine has 16 GB of RAM and 4 cores, for example, you might set
spark.executor.memory between 2 and 3 GB, totaling 8-12 GB used by
You do them sequentially in code; Spark will take care of combining them in
execution.
so something like:
foo.map(fcn to [K1, V1]).reduceByKey(fcn from (V1, V1) to V1).map(fcn from
(K1, V1) to (K2, V2))
On Wed, Jul 23, 2014 at 11:22 AM, soumick86 sdasgu...@dstsystems.com
wrote:
How can I
Do we have a JIRA issue to track this? I think I've run into a similar
issue.
On Wed, Jul 23, 2014 at 1:12 AM, Yin Huai yh...@databricks.com wrote:
It is caused by a bug in Spark REPL. I still do not know which part of the
REPL code causes it... I think people working REPL may have better
Thanks Steve, but my goal is to hopefully avoid installing yet another
component into my environment. Yarn is cool, but wouldn't be used for
anything but Spark. We have no hadoop in our ecosystem (or HDFS). Ideally
I'd avoid having to learn about yet another tool.
On Wed, Jul 23, 2014 at 11:12
There is a JIRA issue to track adding such functionality to spark-ec2:
SPARK-2008 https://issues.apache.org/jira/browse/SPARK-2008 - Enhance
spark-ec2 to be able to add and remove slaves to an existing cluster
On Wed, Jul 23, 2014 at 10:12 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Hi
Is there branch somewhere with current spark on a current version of akka?
I'm trying to embed spark into a spray app. I can probably backport our app
to 2.2.3 for a little while, but I wouldn't want to be stuck there too long.
Related: do I need the protobuf shading if I'm using the
Thanks Akhil
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-could-I-start-new-spark-cluster-with-hadoop2-0-2-tp10450p10514.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Yes, https://issues.apache.org/jira/browse/SPARK-2576 is used to track it.
On Wed, Jul 23, 2014 at 9:11 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Do we have a JIRA issue to track this? I think I've run into a similar
issue.
On Wed, Jul 23, 2014 at 1:12 AM, Yin Huai
Discussions about how CDH packages Spark aside, you should be using
the spark-class script (assuming you're still in 0.9) instead of
executing Java directly. That will make sure that the environment
needed to run Spark apps is set up correctly.
CDH 5.1 ships with Spark 1.0.0, so it has
Hi
It seems I can only give --hadoop-major-version=2 . it is taking 2.0.0.
How could I say it should use 2.0.2
is there any --hadoop-minor-version variable I can use?
Thanks,
D.
--
View this message in context:
with this
will be great!
scalac -classpath
/apps/software/secondstring/secondstring/dist/lib/secondstring-20140723.jar:/opt/cloudera/parcels/CDH/lib/spark/core/lib/spark-core_2.10-1.0.0-cdh5.1.0.jar:/opt/cloudera/parcels/CDH/lib/spark/lib/kryo-2.21.jar:/opt/cloudera/parcels/CDH/lib/hadoop/lib/commons-io-2.4
with this will be great!
scalac -classpath
/apps/software/secondstring/secondstring/dist/lib/secondstring-20140723.jar:/opt/cloudera/parcels/CDH/lib/spark/core/lib/spark-core_2.10-1.0.0-cdh5.1.0.jar:/opt/cloudera/parcels/CDH/lib/spark/lib/kryo-2.21.jar:/opt/cloudera/parcels/CDH/lib/hadoop/lib/commons-io-2.4.jar
Just wanted to add more info.. I was using SparkSQL reading in the
tab-delimited raw data files converting the timestamp to Date format:
sc.textFile(rawdata/*).map(_.split(\t)).map(p = Point(df.format(new
Date( p(0).trim.toLong*1000L )), p(1), p(2).trim.toInt ,p(3).trim.toInt,
p(4).trim.toInt
i do not want to use always
schemaRDD.map {
case Row(xxx) =
...
}
using case we must rewrite the table schema again
is there any plan to implement this?
Thanks
--
View this message in context:
Hi all
I guess the problem has something to do with the fact i submit the job to
remote location
I submit from OracleVM running ubuntu and suspect some NAT issues maybe?
akka tcp tries this address as follows from the STDERR print which is
appended akka.tcp://spark@LinuxDevVM.local:59266
STDERR
Thanks Ankur.
The version with in-memory shuffle is here:
https://github.com/amplab/graphx2/commits/vldb. Unfortunately Spark has
changed a lot since then, and the way to configure and invoke Spark is
different. I can send you the correct configuration/invocation for this if
you're interested in
Yeah, seems to be the case. In general your executors should be able to
reach the driver, which I don't think is the case for you currently
(LinuxDevVM.local:59266 seems very private). What you need is some sort of
gateway node that can be publicly reached from your worker machines to
launch your
Hi Eric,
Have you checked the executor logs? It is possible they died because of
some exception, and the message you see is just a side effect.
Andrew
2014-07-23 8:27 GMT-07:00 Eric Friedman eric.d.fried...@gmail.com:
I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc.
I wanted to experiment with using Parquet data with SparkSQL. I got some
tab-delimited files and wanted to know how to convert them to Parquet
format. I'm using standalone spark-shell.
Thanks!
--
View this message in context:
Take a look at the programming guide for spark sql:
http://spark.apache.org/docs/latest/sql-programming-guide.html
On Wed, Jul 23, 2014 at 11:09 AM, buntu buntu...@gmail.com wrote:
I wanted to experiment with using Parquet data with SparkSQL. I got some
tab-delimited files and wanted to know
Anyone has any idea on this?
On Tue, Jul 22, 2014 at 7:02 PM, hsy...@gmail.com hsy...@gmail.com wrote:
But how do they do the interactive sql in the demo?
https://www.youtube.com/watch?v=dJQ5lV5Tldw
And if it can work in the local mode. I think it should be able to work in
cluster mode,
Hi Meethu,
SPARK_DAEMON_JAVA_OPTS is not intended for setting memory. Please use
SPARK_DAEMON_MEMORY instead. It turns out that java respects only the last
-Xms and -Xmx values, and in spark-class we put SPARK_DAEMON_JAVA_OPTS
before the SPARK_DAEMON_MEMORY. In general, memory configuration in
Hello,
Most of the tasks I've accomplished in Spark were fairly straightforward but
I can't figure the following problem using the Spark API:
Basically, I have an IP with a bunch of user ID associated to it. I want to
create a list of all user id that are associated together, even if some are
Hi Maria,
SPARK_MEM is actually a deprecated because it was too general; the reason
it worked was because SPARK_MEM applies to everything (drivers, executors,
masters, workers, history servers...). In favor of more specific configs,
we broke this down into SPARK_DRIVER_MEMORY and
I am trying to build spark after cloning from github repo:
I am executing:
./sbt/sbt -Dhadoop.version=2.4.0 -Pyarn assembly
I am getting following error:
[warn] ^
[error]
[error] while compiling:
$calculateSortedJaccardScore$1$$anonfun$5.classapproxstrmatch/JaccardScore$$anonfun$calculateJaccardScore$1$$anonfun$1.class
However, when I start my spark shell:
spark-shell --jars
/apps/sameert/software/secondstring/secondstring/dist/lib/secondstring-20140723.jar
/apps/sameert/software/approxstrmatch/target
The streaming program contains the following main stages:
1. receive data from Kafka
2. preprocessing of the data. These are all map and filtering stages.
3. Group by a field
4. Process the groupBy results using map. Inside this processing, I use
collect, count.
Thanks!
Bill
On Tue, Jul 22,
The code is pretty long. But the main idea is to consume from Kafka,
preprocess the data, and groupBy a field. I use mutliple DStream to add
parallelism to the consumer. It seems when the number of DStreams is large,
this happens often.
Thanks,
Bill
On Tue, Jul 22, 2014 at 11:13 PM, Akhil Das
try `sbt/sbt clean` first? -Xiangrui
On Wed, Jul 23, 2014 at 11:23 AM, m3.sharma sharm...@umn.edu wrote:
I am trying to build spark after cloning from github repo:
I am executing:
./sbt/sbt -Dhadoop.version=2.4.0 -Pyarn assembly
I am getting following error:
[warn]
Thanks Michael.
If I read in multiple files and attempt to saveAsParquetFile() I get the
ArrayIndexOutOfBoundsException. I don't see this if I try the same with a
single file:
case class Point(dt: String, uid: String, kw: String, tz: Int, success:
Int, code: String )
val point =
Hello all,
I have noticed the (what I think) is a erroneous behavior when using the
WebUI:
1. Launching App from Eclipse to a cluster (with 3 workers)
2. Using Spark 0.9.0 (Cloudera distr 5.0.1)
3. The Application makes the worker write to the stdout using
System.out.println(...)
When the
That seems to be the issue, when I reduce the number of fields it works
perfectly fine.
Thanks again Michael.. that was super helpful!!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Convert-raw-data-files-to-Parquet-format-tp10526p10541.html
Sent from
Turns out to be an issue with number of fields being read, one of the fields
might be missing from the raw data file causing this error. Michael Ambrust
pointed it out in another thread.
--
View this message in context:
Hi,
Is it feasible to deploy a Spark cluster spanning multiple data centers if
there is fast connections with not too high latency (30ms) between them? I
don't know whether there is any presumptions in the software expecting all
cluster nodes to be local (super low latency, for instance). Has
hello all,
in case anyone is interested, i just wrote a short blog about using
shapeless in spark to optimize data layout in memory.
blog is here:
http://tresata.com/tresata-open-sources-spark-columnar
code is here:
https://github.com/tresata/spark-columnar
Thanks, it works now :)
On Wed, Jul 23, 2014 at 11:45 AM, Xiangrui Meng [via Apache Spark User
List] ml-node+s1001560n10537...@n3.nabble.com wrote:
try `sbt/sbt clean` first? -Xiangrui
On Wed, Jul 23, 2014 at 11:23 AM, m3.sharma [hidden email]
So, given sets, you are joining overlapping sets until all of them are
mutually disjoint, right?
If graphs are out, then I also would love to see a slick distributed
solution, but couldn't think of one. It seems like a cartesian product
won't scale.
You can write a simple method to implement
That worked for me as well, I was using spark 1.0 compiled against Hadoop
1.0, switching to 1.0.1 compiled against hadoop 2
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p10547.html
Sent from the Apache Spark
Is there any thing equivalent to haddop Job
(org.apache.hadoop.mapreduce.Job;) in spark? Once i submit the spark job i
want to concurrently read the sparkListener interface implementation methods
where i can grab the job status. I am trying to find a way to wrap the spark
submit object into one
Hi all,
I have a question regarding Spark streaming. When we use the
saveAsTextFiles function and my batch is 60 seconds, Spark will generate a
series of files such as:
result-140614896, result-140614802, result-140614808, etc.
I think this is the timestamp for the beginning of each
I'm happy to announce the availability of Spark 0.9.2! Spark 0.9.2 is
a maintenance release with bug fixes across several areas of Spark,
including Spark Core, PySpark, MLlib, Streaming, and GraphX. We
recommend all 0.9.x users to upgrade to this stable release.
Contributions to this release came
Hi All,
I have a question,
For my company , we are planning to use spark-ec2 scripts to create cluster
for us.
I understand that , persistent HDFS will make the hdfs available for cluster
restarts.
Question is:
1) What happens , If I destroy and re-create , do I loose the data.
a) If I
We removed
https://github.com/apache/spark/commit/732333d78e46ee23025d81ca9fbe6d1e13e9f253
the PowerGraph abstraction layer when merging GraphX into Spark to reduce
the maintenance costs. You can still read the code
/JaccardScore.classapproxstrmatch/JaccardScore$$anonfun$calculateSortedJaccardScore$1$$anonfun$5.classapproxstrmatch/JaccardScore$$anonfun$calculateJaccardScore$1$$anonfun$1.class
However, when I start my spark shell:
spark-shell --jars
/apps/sameert/software/secondstring/secondstring/dist/lib/secondstring-20140723.jar
Ah yes, you're quite right with partitions I could probably process a good
chunk of the data but I didn't think a reduce would work? Sorry, I'm still
new to Spark and map reduce in general but I thought that the reduce result
wasn't an RDD and had to fit into memory. If the result of a reduce can
If I save an RDD as a sequence file such as:
val wordCounts = words.map(x = (x, 1)).reduceByKey(_ + _)
wordCounts.foreachRDD( d = {
d.saveAsSequenceFile(tachyon://localhost:19998/files/WordCounts- +
(new SimpleDateFormat(MMdd-HHmmss) format
Hi TD
You are right, I did not include \n to delimit the string flushed. That's
the reason.
Is there a way for me to define the delimiter? Like SOH or ETX instead of
\n
Regards
kytay
--
View this message in context:
Bill,
Spark Streaming's DStream provides overloaded methods for transform() and
foreachRDD() that allow you to access the timestamp of a batch:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.DStream
I think the timestamp is the end of the batch, not
For what it's worth, I got it to work with a Cartesian product even if it's
very inefficient it worked out alright for me. The trick was to flat map it
(step4) after the cartesian product so that I could do a reduce by key and
find the commonalities. After I was done, I could check if any Value
Yes you lose the data
You can add machines but will require you to restart the cluster. Also
adding is manual on you add nodes
Regards
Mayur
On Wednesday, July 23, 2014, durga durgak...@gmail.com wrote:
Hi All,
I have a question,
For my company , we are planning to use spark-ec2 scripts to
Have a look at broadcast variables .
On Tuesday, July 22, 2014, Parthus peng.wei@gmail.com wrote:
Hi there,
I was wondering if anybody could help me find an efficient way to make a
MapReduce program like this:
1) For each map function, it need access some huge files, which is around
hi Andrew,
Thanks for your note. Yes, I see a stack trace now. It seems to be an
issue with python interpreting a function I wish to apply to an RDD. The
stack trace is below. The function is a simple factorial:
def f(n):
if n == 1: return 1
return n * f(n-1)
and I'm trying to use it
And... PEBCAK
I mistakenly believed I had set PYSPARK_PYTHON to a python 2.7 install, but
it was on a python 2.6 install on the remote nodes, hence incompatible with
what the master was sending. Have set this to point to the correct version
everywhere and it works.
Apologies for the false
Thanks Mayur.
is there any documentation/readme with step by step process available for
adding or deleting nodes?
Thanks,
D.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/persistent-HDFS-instance-for-cluster-restarts-destroys-tp10551p10565.html
Sent from
In particular, take a look at the TorrentBroadcast, which should be much
more efficient than HttpBroadcast (which was the default in 1.0) for large
files.
If you find that TorrentBroadcast doesn't work for you, then another way to
solve this problem is to place the data on all nodes' local disks,
See if this helps:
https://github.com/nishkamravi2/SparkAutoConfig/
It's a very simple tool for auto-configuring default parameters in Spark.
Takes as input high-level parameters (like number of nodes, cores per node,
memory per node, etc) and spits out default configuration, user advice and
Hi Tobias,
It seems this parameter is an input to the function. What I am expecting is
output from a function that tells me the starting or ending time of the
batch. For instance, If I use saveAsTextFiles, it seems DStream will
generate a batch every minute and the starting time is a complete
95 matches
Mail list logo