Hi,
Is there any change in the release plan for Spark 1.0.0-rc1 release date
from what is listed in the Proposal for Spark Release Strategy thread?
== Tentative Release Window for 1.0.0 ==
Feb 1st - April 1st: General development
April 1st: Code freeze for new features
April 15th: RC1
Thanks,
We are now testing precisely what you ask about in our environment.
But Sandy's questions are relevant. The bigger issue is not Spark
vs. Yarn but "client" vs. "standalone" and where the client is
located on the network relative to the cluster.
The "client" options
Hey Bhaskar, this is still the plan, though QAing might take longer than 15
days. Right now since we’ve passed April 1st, the only features considered for
a merge are those that had pull requests in review before. (Some big ones are
things like annotating the public APIs and simplifying
Any word on this one ?
On Apr 2, 2014, at 12:26 AM, Vipul Pandey vipan...@gmail.com wrote:
I downloaded 0.9.0 fresh and ran the mvn command - the assembly jar thus
generated also has both shaded and real version of protobuf classes
Vipuls-MacBook-Pro-3:spark-0.9.0-incubating vipul$ jar -ftv
Hey,
Does somebody know the kinds of dependencies that the new SQL operators produce?
I’m specifically interested in the relational join operation as it seems
substantially more optimized.
The old join was narrow on two RDDs with the same partitioner.
Is the relational join narrow as well?
I'm sorry, but I don't really understand what you mean when you say wide
in this context. For a HashJoin, the only dependencies of the produced RDD
are the two input RDDs. For BroadcastNestedLoopJoin The only dependence
will be on the streamed RDD. The other RDD will be distributed to all
To run multiple workers with Spark’s standalone mode, set
SPARK_WORKER_INSTANCES and SPARK_WORKER_CORES in conf/spark-env.sh. For
example, if you have 16 cores and want 2 workers, you could add
export SPARK_WORKER_INSTANCES=2
export SPARK_WORKER_CORES=8
Matei
On Apr 3, 2014, at 12:38 PM,
@Mayur...I am hitting ulimits on the cluster if I go beyond 4 core per
worker and I don't think I can change the ulimit due to sudo issues etc...
If I have more workers, in ALS, I can go for 20 blocks (right now I am
running 10 blocks on 10 nodes with 4 cores each and now I can go upto 20
blocks
This ultimately means a problem with SSL in the version of Java you
are using to run SBT. If you look around the internet, you'll see a
bunch of discussion, most of which seems to boil down to reinstall, or
update, Java.
--
Sean Owen | Director, Data Science | London
On Fri, Apr 4, 2014 at 12:21
Hi all,
Could anyone explain me about the lines below?
computer1 - worker
computer8 - driver(master)
14/04/04 14:24:56 INFO BlockManagerMasterActor$BlockManagerInfo: Added
input-0-1396614314800 in memory on computer1.ant-net:60820 (size: 1262.5
KB, free: 540.3 MB)
14/04/04 14:24:56 INFO
Hi all,
I am doing some tests using JavaNetworkWordcount and I have some
questions about the performance machine, my tests' time are
approximately 2 min.
Why does the RAM Memory decrease meaningly? I have done tests with 2, 3
machines and I had gotten the same behavior.
What should I
Hi all,
Say I have an input file which I would like to partition using
HashPartitioner k times.
Calling rdd.saveAsTextFile(hdfs://); will save k files as part-0
part-k
Is there a way to save each partition in specific folders?
i.e. src
part0/part-0
Do we have a list of things we really want to get in for 1.X? Perhaps move
any jira out to a 1.1 release if we aren't targetting them for 1.0.
It might be nice to send out reminders when these dates are approaching.
Tom
On Thursday, April 3, 2014 11:19 PM, Bhaskar Dutta bhas...@gmail.com
Hi Rahul,
Spark will be available in Fedora 21 (see:
https://fedoraproject.org/wiki/SIGs/bigdata/packaging/Spark), currently
scheduled on 2014-10-14 but they already have produced spec files and source
RPMs.
If you are stuck with EL6 like me, you can have a look at the attached spec
file,
Hi Guys,
Could anyone help me understand this driver behavior when I start the
JavaNetworkWordCount?
computer8
16:24:07 up 121 days, 22:21, 12 users, load average: 0.66, 1.27, 1.55
total used free shared buffers
cached
Mem: 5897
Hi All,
I am not sure if this is a 0.9.0 problem to be fixed in 0.9.1 so perhaps
already being addressed, but I am having a devil of a time with a spark
0.9.0 client jar for hadoop 2.X. If I go to the site and download:
- Download binaries for Hadoop 2 (HDP2, CDH5): find an Apache
mirror
Hi Erik,
I am working with TOT branch-0.9 ( 0.9.1) and the following works for me for
maven build:
export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m
mvn -Pyarn -Dhadoop.version=2.3.0 -Dyarn.version=2.3.0 -DskipTests clean package
And from
I believe you got to set following
SPARK_HADOOP_VERSION=2.2.0 (or whatever your version is)
SPARK_YARN=true
then type sbt/sbt assembly
If you are using Maven to compile
mvn -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests clean
package
Hope this helps
-A
On Fri, Apr 4, 2014
Hi Evan,
Could you please provide a code-snippet? Because it not clear for me, in
Hadoop you need to engage addNamedOutput method and I'm in stuck how to use
it from Spark
Thank you,
Konstantin Kudryavtsev
On Fri, Apr 4, 2014 at 5:27 PM, Evan Sparks evan.spa...@gmail.com wrote:
Have a look
Hi,
Can you explain a little more what's going on? Which one submits a job to the
yarn cluster that creates an application master and spawns containers for the
local jobs? I tried yarn-client and submitted to our yarn cluster and it seems
to work that way. Shouldn't Client.scala be running
Thanks all for the update - I have actually built using those options every
which way I can think of so perhaps this is something I am doing about how
I upload the jar to our artifactory repo server. Anyone have a working pom
file for the publish of a spark 0.9 hadoop 2.X publish to a maven repo
Hi all,
I have put this line in my spark-env.sh:
-Dspark.default.parallelism=20
this parallelism level, is it correct?
The machine's processor is a dual core.
Thanks
--
Informativa sulla Privacy: http://www.unibs.it/node/8155
On Wed, Apr 2, 2014 at 7:11 PM, yh18190 yh18...@gmail.com wrote:
Is it always needed that sparkcontext object be created in Main method of
class.Is it necessary?Can we create sc object in other class and try to
use it by passing this object through function and use it?
The Spark context can
Hi Wisely,
Could you please post your pom.xml here.
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158p3770.html
Sent from the Apache Spark User List
Hi Guys,
Could anyone explain me this behavior? After 2 min of tests
computer1- worker
computer10 - worker
computer8 - driver(master)
computer1
18:24:31 up 73 days, 7:14, 1 user, load average: 3.93, 2.45, 1.14
total used free shared buffers
cached
If you're running on one machine with 2 cores, I believe all you can get
out of it are 2 concurrent tasks at any one time. So setting your default
parallelism to 20 won't help.
On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia
e.costaalf...@unibs.it wrote:
Hi all,
I have put this line
Hi Francis,
This might be a long shot, but do you happen to have built spark on an
encrypted home dir?
(I was running into the same error when I was doing that. Rebuilding
on an unencrypted disk fixed the issue. This is a known issue /
limitation with ecryptfs. It's weird that the build doesn't
I'm trying to get a clear idea about how exceptions are handled in Spark?
Is there somewhere where I can read about this? I'm on spark .7
For some reason I was under the impression that such exceptions are
swallowed and the value that produced them ignored but the exception is
logged. However,
Exceptions should be sent back to the driver program and logged there (with a
SparkException thrown if a task fails more than 4 times), but there were some
bugs before where this did not happen for non-Serializable exceptions. We
changed it to pass back the stack traces only (as text), which
In such construct, each operator builds on the previous one, including any
materialized results etc. If I use a SQL for each of them, I suspect the
later SQLs will not leverage the earlier SQLs by any means - hence these
will be inefficient to first approach. Let me know if this is not
What do you advice me Nicholas?
Em 4/4/14, 19:05, Nicholas Chammas escreveu:
If you're running on one machine with 2 cores, I believe all you can
get out of it are 2 concurrent tasks at any one time. So setting your
default parallelism to 20 won't help.
On Fri, Apr 4, 2014 at 11:41 AM,
Is there a way to log exceptions inside a mapping function? logError and
logInfo seem to freeze things.
On Fri, Apr 4, 2014 at 11:02 AM, Matei Zaharia matei.zaha...@gmail.comwrote:
Exceptions should be sent back to the driver program and logged there
(with a SparkException thrown if a task
Btw, thank you for your help.
On Fri, Apr 4, 2014 at 11:49 AM, John Salvatier jsalvat...@gmail.comwrote:
Is there a way to log exceptions inside a mapping function? logError and
logInfo seem to freeze things.
On Fri, Apr 4, 2014 at 11:02 AM, Matei Zaharia matei.zaha...@gmail.comwrote:
Minor typo in the example. The first SELECT statement should actually be:
sql(SELECT * FROM src)
Where `src` is a HiveTable with schema (key INT value STRING).
On Fri, Apr 4, 2014 at 11:35 AM, Michael Armbrust mich...@databricks.comwrote:
In such construct, each operator builds on the
Spark community,
What's the size of the largest Spark cluster ever deployed? I've heard
Yahoo is running Spark on several hundred nodes but don't know the actual
number.
can someone share?
Thanks
If you want more parallelism, you need more cores. So, use a machine with
more cores, or use a cluster of machines.
spark-ec2https://spark.apache.org/docs/latest/ec2-scripts.htmlis the
easiest way to do this.
If you're stuck on a single machine with 2 cores, then set your default
parallelism to
FYI, one thing we’ve added now is support for reading multiple text files from
a directory as separate records: https://github.com/apache/spark/pull/327. This
should remove the need for mapPartitions discussed here.
Avro and SequenceFiles look like they may not make it for 1.0, but there’s a
This can’t be done through the script right now, but you can do it manually as
long as the cluster is stopped. If the cluster is stopped, just go into the AWS
Console, right click a slave and choose “launch more of these” to add more. Or
select multiple slaves and delete them. When you run
Hi Christophe,
Thanks for your reply and the spec file. I have solved my issue for now. I
didn't want to rely building spark using the spec file (%build section) as I
don't want to be maintaining the list of files that need to be packaged. I
ended up adding maven build support to
Hi Tathagata,
You are right, this code compile, but I am some problems with high
memory consummation, I sent today some email about this, but no response
until now.
Thanks
Em 4/4/14, 22:56, Tathagata Das escreveu:
I havent really compiled the code, but it looks good to me. Why? Is
there any
Logging inside a map function shouldn't freeze things. The messages
should be logged on the worker logs, since the code is executed on the
executors. If you throw a SparkException, however, it'll be propagated to
the driver after it has failed 4 or more times (by default).
On Fri, Apr 4, 2014 at
There is no compress type for snappy.
Sent from my iPhone5s
On 2014年4月4日, at 23:06, Konstantin Kudryavtsev
kudryavtsev.konstan...@gmail.com wrote:
Can anybody suggest how to change compression level (Record, Block) for
Snappy?
if it possible, of course
thank you in advance
Thank
All
Are there any drawbacks or technical challenges (or any information, really)
related to using Spark directly on a global parallel filesystem like
Lustre/GPFS?
Any idea of what would be involved in doing a minimal proof of concept? Is it
just possible to run Spark unmodified (without the
As long as the filesystem is mounted at the same path on every node, you should
be able to just run Spark and use a file:// URL for your files.
The only downside with running it this way is that Lustre won’t expose data
locality info to Spark, the way HDFS does. That may not matter if it’s a
Thanks will take a look...
Sent from my iPad
On Apr 3, 2014, at 7:49 AM, FRANK AUSTIN NOTHAFT fnoth...@berkeley.edu
wrote:
We use avro objects in our project, and have a Kryo serializer for generic
Avro SpecificRecords. Take a look at:
Does spark in general assure exactly once semantics? What happens to
those guarantees in the presence of updateStateByKey operations -- are
they also assured to be exactly once?
Thanks
manku.timma at outlook dot com
Hey Parviz,
There was a similar thread a while ago... I think that many companies like
to be discrete about the size of large clusters. But of course it would be
great if people wanted to share openly :)
For my part - I can say that Spark has been benchmarked on
hundreds-of-nodes clusters before
We might be able to incorporate the maven rpm plugin into our build. If
that can be done in an elegant way it would be nice to have that
distribution target for people who wanted to try this with arbitrary Spark
versions...
Personally I have no familiarity with that plug-in, so curious if anyone
We run Spark (in Standalone mode) on top of a network-mounted file system
(NFS), rather than HDFS, and find it to work great. It required no modification
or special configuration to set this up; as Matei says, we just point Spark to
data using the file location.
-- Jeremy
On Apr 4, 2014, at
On Fri, Apr 4, 2014 at 5:12 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
As long as the filesystem is mounted at the same path on every node, you
should be able to just run Spark and use a file:// URL for your files.
The only downside with running it this way is that Lustre won't expose
I occasionally see links to pages in the spark.incubator.apache.org domain.
Can we HTTP 301 redirect that whole domain to spark.apache.org now that
the project has graduated? The content seems identical.
That would also make the eventual decommission of the incubator domain much
easier as usage
I'm looking forward to that myself! Seems to be hung up with Apache
infrastructure though.
https://issues.apache.org/jira/plugins/servlet/mobile#issue/INFRA-7398
On Apr 4, 2014 11:19 PM, Andrew Ash and...@andrewash.com wrote:
I occasionally see links to pages in the
@patrick I think there is a bug...when this timeout happens then suddenly I
see some negative ms numbers in spark uiI tried to send a pic showing
the negative ms numbers but it was rejected by mailing list...I will send
it your gmail...
From the archive I saw some more suggestions:
It
Interested in a resolution to this. I'm building a large triangular matrix so
doing similar to ALS - lots of work on the worker nodes and keep timing out.
Tried a few updates to akka frame sizes, timeouts and blockmanager but
unable to complete. Will try the blockmanagerslaves property now and
From the documentation this is what I understood:
1. spark.worker.timeout: Number of seconds after which the standalone
deploy master considers a worker lost if it receives no heartbeats.
default: 60
I increased it to be 600
It was pointed before that if there is GC overload and the worker
Avati, depending on your specific deployment config, there can be up to a
10X difference in data loading time. For example, we routinely parallel
load 10+GB data files across small 8-node clusters in 10-20 seconds, which
would take about 100s if bottlenecked over a 1GigE network. That's about
the
Hi Rahul,
As Christophe pointed out, Spark has been in Fedora Rawhide (which will become
Fedora 21) for a little while now. (I haven't announced it here because
Rawhide is a little too bleeding-edge for most end-users.) With native
packages of any kind, there are a couple of considerations:
This does not seem to help:
export SPARK_JAVA_OPTS=-Dspark.local.dir=/app/spark/tmp
-Dspark.worker.timeout=600 -Dspark.akka.timeout=200
-Dspark.storage.blockManagerSlaveTimeoutMs=30
Getting the message leads to GC failure followed by master declaring the
worker as dead !
This is related to
Setting spark.worker.timeout should not help you. What this value means is
that the master checks every 60 seconds whether the workers are still
alive, as the documentation describes. But this value also determines how
often the workers send HEARTBEAT messages to notify the master of their
Christopher
Just to clarify - by ‘load ops’ do you mean RDD actions that result in IO?
Venkat
From: Christopher Nguyen c...@adatao.commailto:c...@adatao.com
Reply-To: user@spark.apache.orgmailto:user@spark.apache.org
user@spark.apache.orgmailto:user@spark.apache.org
Date: Saturday, April 5,
Hi,
This will work nicely unless you're using spot instances, in this case
the start does not work as slaves are lost on shutdown.
I feel like spark-ec2 script need a major refactor to cope with new
features/more users using it in dynamic environments.
Are there any current plans to migrate it to
Hi All,
I am using spark-0.9.0 and am able to run my program successfully if spark
master and worker are in same machine.
If i run the same program in spark master in Machine A and worker in
Machine B, I am getting below exception
I am running program with java -cp ... instead of scala command
It worked when I converted the nested RDD to an array
--
case class TradingTier(tierId:String, lowerLimit:Int,upperLimit:Int ,
transactionFees:Double)
//userTransactions Seq[(accountId,numTransactions)]
val userTransactionsRDD =
I am seeing a small standalone cluster (master, slave) hang when I reach a
certain memory threshold, but I cannot detect how to configure memory to avoid
this.
I added memory by configuring SPARK_DAEMON_MEMORY=2G and I can see this
allocated, but it does not help.
The reduce is by key to get
Hi Will,
For issue #2 I was concerned that the build packaging had to be
internal. So I am using the already packaged make-distribution.sh
(modified to use a maven build) to create a tar ball which I then package
it using a RPM spec file.
Although on a side note, it would interesting to learn
Can you provide an example?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754p3823.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I am seeing a small standalone cluster (master, slave) hang when I reach a
certain memory threshold, but I cannot detect how to configure memory to avoid
this.
I added memory by configuring SPARK_DAEMON_MEMORY=2G and I can see this
allocated, but it does not help.
The reduce is by key to get
Dear all,
We have a spark 0.8.1 cluster on mesos 0.15. Some of my colleagues are
familiar with python, but some of features are developed under java. I am
looking for a way to integrate java and python on spark.
I notice that the initialization of pyspark does not include a field to
distribute
Hi,
Any thoughts on this? Thanks.
-Suren
On Thu, Apr 3, 2014 at 8:27 AM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
Hi,
I know if we call persist with the right options, we can have Spark
persist an RDD's data on disk.
I am wondering what happens in intermediate operations
Hi Shark,
Should I assume that Shark users should not use the shark APIs since there
are no documentations for it? If there are documentations, can you point it
out?
Best Regards,
Jerry
On Thu, Apr 3, 2014 at 9:24 PM, Jerry Lam chiling...@gmail.com wrote:
Hello everyone,
I have
Hi,
I was going through Matei's Advanced Spark presentation at
https://www.youtube.com/watch?v=w0Tisli7zn4 , and had few questions.
The presentation of this video is at
http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
The PageRank example
Hi All,
I wanted Spark on Yarn to up and running.
I did *SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true ./sbt/sbt assembly*
Then i ran
*SPARK_JAR=./assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop2.3.0.jar
For issue #2 I was concerned that the build packaging had to be
internal. So I am using the already packaged make-distribution.sh
(modified to use a maven build) to create a tar ball which I then package
it using a RPM spec file.
Hi Rahul, so the issue for downstream operating system
Hi,
We have a situation where a Pyspark script works fine as a local process
(local url) on the Master and the Worker nodes, which would indicate that
all python dependencies are set up properly on each machine.
But when we try to run the script at the cluster level (using the master's
url), if
It might help if I clarify my questions. :-)
1. Is persist() applied during the transformation right before the
persist() call in the graph? Or is is applied after the transform's
processing is complete? In the case of things like GroupBy, is the Seq
backed by disk as it is being created? We're
Hi TD
Could you explain me this code part?
.reduceByKeyAndWindow(
109 new Function2Integer, Integer, Integer() {
110 public Integer call(Integer i1, Integer i2) { return i1 +
i2; }
111 },
112 new Function2Integer, Integer, Integer() {
113 public
Hi all,
On the old Amazon Linux EC2 images, the user 'root' was enabled for ssh.
Also, it is the default user for the Spark-EC2 script.
Currently, the Amazon Linux images have an 'ec2-user' set up for ssh
instead of 'root'.
I can see that the Spark-EC2 script allows you to specify which user to
Hi,
I'm trying to use SparkContext.addFile() to propagate a file to worker
nodes, in a standalone cluster (2 nodes, 1 master, 1 worker connected to the
master). I don't have HDFS or any distributed file system. Just playing with
basic stuff.
Here's the code in my driver (actually spark-shell
That work is under submission at an academic conference and will be made
available if/when the paper is published.
In terms of algorithms for hyperparameter tuning, we consider Grid Search,
Random Search, a couple of older derivative-free optimization methods, and
a few newer methods - TPE (aka
I might be wrong here but I don't believe it's discouraged. Maybe part
of the reason there's not a lot of examples is that sql2rdd returns an
RDD (TableRDD that is
https://github.com/amplab/shark/blob/master/src/main/scala/shark/SharkContext.scala).
I haven't done anything too complicated yet but
Hi Shivaram,
OK so let's assume the script CANNOT take a different user and that it must
be 'root'. The typical workaround is as you said, allow the ssh with the
root user. Now, don't laugh, but, this worked last Friday, but today
(Monday) it no longer works. :D Why? ...
...It seems that NOW,
Hi Guys,
I would like understanding why the Driver's RAM goes down, Does the
processing occur only in the workers?
Thanks
# Start Tests
computer1(Worker/Source Stream)
23:57:18 up 12:03, 1 user, load average: 0.03, 0.31, 0.44
total used free shared
Hmm -- That is strange. Can you paste the command you are using to launch
the instances ? The typical workflow is to use the spark-ec2 wrapper script
using the guidelines at http://spark.apache.org/docs/latest/ec2-scripts.html
Shivaram
On Mon, Apr 7, 2014 at 1:53 PM, Marco Costantini
Hello, Spark community! My name is Paul. I am a Spark newbie, evaluating
version 0.9.0 without any Hadoop at all, and need some help. I run into the
following error with the StatefulNetworkWordCount example (and similarly in my
prototype app, when I use the updateStateByKey operation). I get
Thanks Shivaram! Will give it a try and let you know.
Regards,
Pawan Venugopal
On Mon, Apr 7, 2014 at 3:38 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
You can create standalone jobs in SparkR as just R files that are run
using the sparkR script. These commands will be sent
Hi,
I am looking for users of spark to join my teams here at Amazon. If you are
reading this you probably qualify.
I am looking for developer of ANY level, but with an interest in spark. My
teams are leveraging spark to solve real business scenarios.
If you are interested, just shoot me a note
Few things that would be helpful.
1. Environment settings - you can find them on the environment tab in the
Spark application UI
2. Are you setting the HDFS configuration correctly in your Spark program?
For example, can you write a HDFS file from a Spark program (say
spark-shell) to your HDFS
any reason why RDDInfo suddenly became private in SPARK-1132?
we are using it to show users status of rdds
ok yeah we are using StageInfo and TaskInfo too...
On Mon, Apr 7, 2014 at 8:51 PM, Andrew Or and...@databricks.com wrote:
Hi Koert,
Other users have expressed interest for us to expose similar classes too
(i.e. StageInfo, TaskInfo). In the newest release, they will be available
as part of
1.: I will paste the full content of the environment page of the example
application running against the cluster at the end of this message.
2. and 3.: Following #2 I was able to see that the count was incorrectly 0
when running against the cluster, and following #3 I was able to get the
Great!!!
When i built it on another disk whose format is ext4, it works right now.
hadoop@ubuntu-1:~$ df -Th
FilesystemType Size Used Avail Use% Mounted on
/dev/sdb6 ext4 135G 8.6G 119G 7% /
udev devtmpfs 7.7G 4.0K 7.7G 1% /dev
tmpfs
On Mon, Apr 7, 2014 at 7:37 PM, Brad Miller bmill...@eecs.berkeley.eduwrote:
I am running the latest version of PySpark branch-0.9 and having some
trouble with join.
One RDD is about 100G (25GB compressed and serialized in memory) with
130K records, the other RDD is about 10G (2.5G
Hi all,
Here I am sharing a blog for beginners, about creating spark streaming
stand alone application and bundle the app as single runnable jar. Take a
look and drop your comments in blog page.
http://prabstechblog.blogspot.in/2014/04/a-standalone-spark-application-in-scala.html
Hi Everyone,
I saved a 2GB pdf file into MongoDB using GridFS. now i want process those
GridFS collection data using Java Spark Mapreduce. previously i have
successfully processed normal mongoDB collections(not GridFS) with Apache
spark using Mongo-Hadoop connector. now i'm unable to handle input
Yes, that is correct. If you are executing a Spark program across multiple
machines, that you need to use a distributed file system (HDFS API
compatible) for reading and writing data. In your case, your setup is
across multiple machines. So what is probably happening is that the the RDD
data is
Hello,
I am running Cloudera 4 node cluster with 1 Master and 3 Slaves. I am
connecting with Spark Master from scala using SparkContext. I am trying to
execute a simple java function from the distributed jar on every Spark
Worker but haven't found a way to communicate with each worker or a Spark
In my application, data parts inside an RDD partition have ralations. so I
need to do some operations beween them.
for example
RDD T1 has several partitions, each partition has three parts A, B and C.
then I transform T1 to T2. after transform, T2 also has three parts D, E and
F, D = A+B, E =
so, the data structure looks like:
D consists of D1, D2, D3 (DX is partition)
and
DX consists of d1, d2, d3 (dx is the part in your context)?
what you want to do is to transform
DX to (d1 + d2, d1 + d3, d2 + d3)?
Best,
--
Nan Zhu
On Tuesday, April 8, 2014 at 8:09 AM, wxhsdp wrote:
yes, how can i do this conveniently? i can use filter, but there will be so
many RDDs and it's not concise
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p3875.html
Sent from the Apache Spark User List mailing list archive at
If that’s the case, I think mapPartition is what you need, but it seems that
you have to load the partition into the memory as whole by toArray
rdd.mapPartition{D = {val p = D.toArray; ...}}
--
Nan Zhu
On Tuesday, April 8, 2014 at 8:40 AM, wxhsdp wrote:
yes, how can i do this
801 - 900 of 75493 matches
Mail list logo