Simone, here are some thoughts. Please check out the understanding closures
section of the Spark Programming Guide. Secondly, broadcast variables do not
propagate updates to the underlying data. You must either create a new
broadcast variable or alternately if you simply wish to accumulate
Have you tried reading the spark documentation?
http://spark.apache.org/docs/latest/programming-guide.html
Thank you,
Ilya Ganelin
-Original Message-
From: ÐΞ€ρ@Ҝ (๏̯͡๏) [deepuj...@gmail.commailto:deepuj...@gmail.com]
Sent: Thursday, August 06, 2015 12:41 AM Eastern Standard Time
To be Unpersisted the RDD must be persisted first. If it's set to None, then
it's not persisted, and as such does not need to be freed. Does that make sense
?
Thank you,
Ilya Ganelin
-Original Message-
From: Stahlman, Jonathan
Hi all – I’m just wondering if anyone has had success integrating Spark
Streaming with Zeppelin and actually dynamically updating the data in near
real-time. From my investigation, it seems that Zeppelin will only allow you to
display a snapshot of data, not a continuously updating table. Has
You may pass an optional parameter (blocking = false) to make it lazy.
Thank you,
Ilya Ganelin
-Original Message-
From: Jem Tucker [jem.tuc...@gmail.commailto:jem.tuc...@gmail.com]
Sent: Thursday, July 02, 2015 04:06 AM Eastern Standard Time
To: Akhil Das
Cc: user
Subject: Re: Making
The parallelize operation accepts as input a data structure in memory. When you
call it, you are necessarily operating In the memory space of the driver since
that is where user code executes. Until you have an RDD, you can't really
operate in a distributed way.
If your files are stores in a
Actually talk about this exact thing in a blog post here
http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/.
Keep in mind, you're actually doing a ton of math. Even with proper caching
and use of broadcast variables this will
Also check out the spark.cleaner.ttl property. Otherwise, you will accumulate
shuffle metadata in the memory of the driver.
Sent with Good (www.good.com)
-Original Message-
From: Silvio Fiorito
[silvio.fior...@granturing.commailto:silvio.fior...@granturing.com]
Sent: Monday, May 11,
Marco - why do you want data sorted both within and across partitions? If you
need to take an ordered sequence across all your data you need to either
aggregate your RDD on the driver and sort it, or use zipWithIndex to apply an
ordered index to your data that matches the order it was stored on
What command are you using to untar? Are you running out of disk space?
Sent with Good (www.good.com)
-Original Message-
From: ÐΞ€ρ@Ҝ (๏̯͡๏) [deepuj...@gmail.commailto:deepuj...@gmail.com]
Sent: Monday, April 27, 2015 11:44 AM Eastern Standard Time
To: user
Subject: Spark 1.3.1 Hadoop
...@boxever.commailto:michal.michal...@boxever.com]
Sent: Friday, April 24, 2015 11:04 AM Eastern Standard Time
To: Ganelin, Ilya
Cc: Spico Florin; user
Subject: Re: Does HadoopRDD.zipWithIndex method preserve the order of the input
data from Hadoop?
The problem I'm facing is that I need to process lines from input
Michael - you need to sort your RDD. Check out the shuffle documentation on the
Spark Programming Guide. It talks about this specifically. You can resolve this
in a couple of ways - either by collecting your RDD and sorting it, using
sortBy, or not worrying about the internal ordering. You can
-
From: Michal Michalski
[michal.michal...@boxever.commailto:michal.michal...@boxever.com]
Sent: Friday, April 24, 2015 11:18 AM Eastern Standard Time
To: Ganelin, Ilya
Cc: Spico Florin; user
Subject: Re: Does HadoopRDD.zipWithIndex method preserve the order of the input
data from Hadoop?
I
You need to expose that variable the same way you'd expose any other variable
in Python that you wanted to see across modules. As long as you share a spark
context all will work as expected.
http://stackoverflow.com/questions/142545/python-how-to-make-a-cross-module-variable
Sent with Good
Write Kafka stream to HDFS via Spark streaming then ingest files via Spark from
HDFS.
Sent with Good (www.good.com)
-Original Message-
From: Shushant Arora
[shushantaror...@gmail.commailto:shushantaror...@gmail.com]
Sent: Saturday, April 18, 2015 06:44 AM Eastern Standard Time
To:
Map partitions works as follows :
1) For each partition of your RDD, it provides an iterator over the values
within that partition
2) You then define a function that operates on that iterator
Thus if you do the following:
val parallel = sc.parallelize(1 to 10, 3)
parallel.mapPartitions( x =
You're not using the broadcasted variable within your map operations. You're
attempting to modify myObjrct directly which won't work because you are
modifying the serialized copy on the executor. You want to do
myObjectBroadcasted.value.insert and myObjectBroadcasted.value.lookup.
Sent with
When writing to hdfs Spark will not overwrite existing files or directories.
You must either manually delete these or use Java's Hadoop FileSystem class to
remove them.
Sent with Good (www.good.com)
-Original Message-
From: Pavel Velikhov
As Ted Yu points out, default block size is 128MB as of Hadoop 2.1.
Sent with Good (www.good.com)
-Original Message-
From: Ilya Ganelin [ilgan...@gmail.commailto:ilgan...@gmail.com]
Sent: Thursday, February 19, 2015 12:13 PM Eastern Standard Time
To: Alessandro Lulli;
Hi all - I've spent a while playing with this. Two significant sources of speed
up that I've achieved are
1) Manually multiplying the feature vectors and caching either the user or
product vector
2) By doing so, if one of the RDDs is a global it becomes possible to
parallelize this step by
Make a copy of your RDD with an extra entry in the beginning to offset. The you
can zip the two RDDs and run a map to generate an RDD of differences.
Sent with Good (www.good.com)
-Original Message-
From: derrickburns [derrickrbu...@gmail.commailto:derrickrbu...@gmail.com]
Sent:
]
Sent: Tuesday, January 13, 2015 08:06 PM Eastern Standard Time
To: Kevin Burton
Cc: Ganelin, Ilya; user@spark.apache.org
Subject: Re: quickly counting the number of rows in a partition?
Hi,
On Mon, Jan 12, 2015 at 8:09 PM, Ganelin, Ilya
ilya.gane...@capitalone.commailto:ilya.gane
There are two related options:
To solve your problem directly try:
val conf = new SparkConf().set(spark.yarn.driver.memoryOverhead, 1024)
val sc = new SparkContext(conf)
And the second, which increases the overall memory available on the driver, as
part of your spark-submit script add:
Try loading features as
Val userfeatures = sc.objectFile(path1)
Val productFeatures = sc.objectFile(path2)
And then call the constructor of the MatrixFsgtorizationModel with those.
Sent with Good (www.good.com)
-Original Message-
From: wanbo [gewa...@163.commailto:gewa...@163.com]
Hi all.
In order to get Spark to properly release memory during batch processing as a
workaround to issue https://issues.apache.org/jira/browse/SPARK-4927 I tear
down and re-initialize the spark context with :
context.stop() and
context = new SparkContext()
The problem I run into is that
426.7 MB
Free 402.5 MB
Free 402.5 MB
Free 426.7 MB
From: Ganelin, Ganelin, Ilya
ilya.gane...@capitalone.commailto:ilya.gane...@capitalone.com
Date: Tuesday, December 30, 2014 at 7:30 PM
To: Ilya Ganelin ilgan...@gmail.commailto:ilgan...@gmail.com, Patrick
Wendell pwend...@gmail.commailto:pwend
: Ilya Ganelin ilgan...@gmail.commailto:ilgan...@gmail.com
Date: Sunday, December 28, 2014 at 4:02 PM
To: Patrick Wendell pwend...@gmail.commailto:pwend...@gmail.com, Ganelin,
Ilya ilya.gane...@capitalone.commailto:ilya.gane...@capitalone.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Hi all, I have a long running job iterating over a huge dataset. Parts of this
operation are cached. Since the job runs for so long, eventually the overhead
of spark shuffles starts to accumulate culminating in the driver starting to
swap.
I am aware of the spark.cleanup.tll parameter that
Hi all – I’m running a long running batch-processing job with Spark through
Yarn. I am doing the following
Batch Process
val resultsArr =
sc.accumulableCollection(mutable.ArrayBuffer[ListenableFuture[Result]]())
InMemoryArray.forEach{
1) Using a thread pool, generate callable jobs that
Also, this may be related to this issue
https://issues.apache.org/jira/browse/SPARK-3885.
Further, to clarify, data is being written to Hadoop on the data nodes.
Would really appreciate any help. Thanks!
From: Ganelin, Ganelin, Ilya
ilya.gane...@capitalone.commailto:ilya.gane
Hi all – I’ve been storing the model userFeatures and productFeatures vectors
that are generated internally serialized on disk and importing them as a
separate job.
From: Sonal Goyal sonalgoy...@gmail.commailto:sonalgoy...@gmail.com
Date: Wednesday, December 10, 2014 at 5:31 AM
To: Yanbo Liang
You want to look further up the stack (there are almost certainly other errors
before this happens) and those other errors may give your better idea of what
is going on. Also if you are running on yarn you can run yarn logs
-applicationId yourAppId to get the logs from the data nodes.
Sent
Hi all, as the last stage of execution, I am writing out a dataset to disk.
Before I do this, I force the DAG to resolve so this is the only job left in
the pipeline. The dataset in question is not especially large (a few
gigabytes). During this step however, HDFS will inevitable crash. I will
Hi Bharath – I’m unsure if this is your problem but the
MatrixFactorizationModel in MLLIB which is the underlying component for ALS
expects your User/Product fields to be integers. Specifically, the input to ALS
is an RDD[Rating] and Rating is an (Int, Int, Double). I am wondering if
perhaps
Hello all. I have been running a Spark Job that eventually needs to do a large
join.
24 million x 150 million
A broadcast join is infeasible in this instance clearly, so I am instead
attempting to do it with Hash Partitioning by defining a custom partitioner as:
class
Why do you only want the third partition? You can access individual partitions
using the partitions() function. You can also filter your data using the
filter() function to only contain the data you care about. Moreover, when you
create your RDDs unless you define a custom partitioner you have
To set the number of spark cores used you must set two parameters in the actual
spark-submit script. You must set num-executors (the number of nodes to have)
and executor-cores (the number of cores per machinel) . Please see the Spark
configuration and tuning pages for more details.
You’ve basically got it.
Deployment step can be simply scp-ing the file to a known location on the
server and then executing a run script on the server that actually runs the
spark-submit.
From: Ashic Mahtab as...@live.commailto:as...@live.com
Date: Thursday, November 6, 2014 at 5:01 PM
To:
Very straightforward:
You want to use cartesian.
If you have two RDDs - RDD_1(³A²) and RDD_2(1,2,3)
RDD_1.cartesian(RDD_2) will generate the cross product between the two
RDDs and you will have
RDD_3((³A²,1), (³B²,2), (³C², 3))
On 11/3/14, 11:38 AM, david david...@free.fr wrote:
Hi,
I'm a
Hi Jan. I've actually written a function recently to do precisely that using
the RDD.randomSplit function. You just need to calculate how big each element
of your data is, then how many of each data can fit in each RDD to populate the
input to rqndomSplit. Unfortunately, in my case I wind up
Hi Ryan - I've been fighting the exact same issue for well over a month now. I
initially saw the issue in 1.02 but it persists in 1.1.
Jerry - I believe you are correct that this happens during a pause on
long-running jobs on a large data set. Are there any parameters that you
suggest tuning
Hey all – not writing to necessarily get a fix but more to get an understanding
of what’s going on internally here.
I wish to take a cross-product of two very large RDDs (using cartesian), the
product of which is well in excess of what can be stored on disk . Clearly that
is intractable, thus
Hello all, I am attempting to manually build the master branch of Spark against
Cloudera’s 5.2.0 deployment. To do this I am running:
mvn -Pyarn -Dhadoop.version=2.5.0-cdh5.2.0 -DskipTests clean package
The build completes successfully and then I run:
mvn -Pyarn -Phadoop.version=2.5.0-cdh5.2.0
Hi Ami - I suspect that your code is completing because you have nothing
to actually force resolution of your job. Spark executes lazily, so for
example, if you have a bunch of maps in sequence but nothing else, Spark
will not actually execute anything.
Try adding an RDD.count() on the last RDD
You cannot have nested RDD transformations in Scala Spark. The issue is that
when the outer operation is distributed to the cluster and kicks off a new job
(the inner query) the inner job no longer has the context for the outer job.
The way around this is to either do a join on two RDDs or to
Have you checked for any global variables in your scope? Remember that even if
variables are not passed to the function they will be included as part of the
context passed to the nodes. If you can't zen out what is breaking then try to
simplify what you're doing. Set up a simple test call (like
Would Rdd.map() do what you need? It will apply a function to every element of
the rdd and return a resulting RDD.
-Original Message-
From: Zhan Zhang [zzh...@hortonworks.commailto:zzh...@hortonworks.com]
Sent: Tuesday, October 28, 2014 11:23 PM Eastern Standard Time
To: Dai, Kevin
Cc:
Hi Xiangrui - I can certainly save the data before ALS - that would be a
great first step. Why would reducing the number of partitions help? I
would very much like to understand what¹s happening internally. Also, with
regards to Burak¹s earlier comment, here is the JIRA referencing this
problem.
48 matches
Mail list logo