1. If we add more executors to cluster and data is already cached inside
system(rdds are already there) . so, in that case
those executors will run job on new executors or not , as rdd are not
present there??
if yes, then how the performance on new executors ??
2. What is the replication factor
Hi Cheng,
Is it possibe to delete or replicate an rdd ??
rdd1 = textFile(hdfs...).cache()
rdd2 = rdd1.filter(userDefinedFunc1).cache()
rdd3 = rdd1.filter(userDefinedFunc2).cache()
I reframe above question , if rdd1 is around 50G and after filtering its
come around say 4G.
then to increase
Yes, the second example does that. It transforms all the points of a
partition into a single element the skyline, thus reduce will run on the
skyline of two partitions and not on single points.
Le 16 avr. 2014 06:47, Yanzhe Chen yanzhe...@gmail.com a écrit :
Eugen,
Thanks for your tip and I do
1. Spark prefers to run tasks where the data is, but it is able to move
cached data between executors if no cores are available where the data is
initially cached (which is often much faster than recomputing the data from
scratch). The result is that data is automatically spread out across the
I'd also say that running for 100 iterations is a waste of resources, as
ALS will typically converge pretty quickly, as in within 10-20 iterations.
On Wed, Apr 16, 2014 at 3:54 AM, Xiaoli Li lixiaolima...@gmail.com wrote:
Thanks a lot for your information. It really helps me.
On Tue, Apr
I want to know as follows:
what is a partition? how it works?
how it is different from hadoop partition?
For example:
sc.parallelize([1,2,3,4]).map(lambda x:
(x,x)).partitionBy(2).glom().collect()
[[(2,2), (4,4)], [(1,1), (3,3)]]
from this, we will get 2 partitions but what does it mean? how
Thanks Cheng , that was helpful..
On Wed, Apr 16, 2014 at 1:29 PM, Cheng Lian lian.cs@gmail.com wrote:
You can remove cached rdd1 from the cache manager by calling
rdd1.unpersist(). But here comes some subtleties: RDD.cache() is *lazy*while
RDD.unpersist() is *eager*. When .cache() is
Dear all,
I developed a application that the message size of communication
is greater than 10 MB sometimes.
For smaller datasets it works fine, but fails for larger datasets.
Please check the error message following.
I surveyed the situation online and lots of people said
the problem can be
Seem you have not enough memory on the spark driver. Hints below :
On 2014-04-15 12:10, Qin Wei wrote:
val resourcesRDD = jsonRDD.map(arg =
arg.get(rid).toString.toLong).distinct
// the program crashes at this line of code
val bcResources =
Hi,
I have browsed the online documentation and it is stated that PySpark only
read text files as sources. Is it still the case?
From what I understand, the RDD can after this first step be any serialized
python structure if the class definitions are well distributed.
Is it not possible to read
Howdy all,
I recently saw that the OrcInputFormat/OutputFormat's have been exposed
to be usable outside of hive (
https://issues.apache.org/jira/browse/HIVE-5728). Does anyone know how
one could use this with saveAsNewAPIHadoopFile to write records in orc
format?
In particular, I would
Hi,
I am running Spark 0.9.1 on a YARN cluster, and I am wondering which is the
correct way to add external jars when running a spark shell on a YARN cluster.
Packaging all this dependencies in an assembly which path is then set in
SPARK_YARN_APP_JAR (as written in the doc:
I too stuck on same issue , but on shark (0.9 with spark-0.9 ) on
hadoop-2.2.0 .
On rest hadoop versions , it works perfect
Regards,
Arpit Tak
On Wed, Apr 16, 2014 at 11:18 PM, Aureliano Buendia buendia...@gmail.comwrote:
Is this resolved in spark 0.9.1?
On Tue, Apr 15, 2014 at 6:55
Also try this ...
http://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_Ubuntu-12.04
http://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_HortonWorks_VM
Regards,
arpit
On Thu, Apr 10, 2014 at 3:04 AM, Pradeep baji
pradeep.chanum...@gmail.comwrote:
Thanks Prabeesh.
Its because , there is no sl4f directory exists there may be they
updating it .
https://oss.sonatype.org/content/repositories/snapshots/org/
Hard luck try after some time...
Regards,
Arpit
On Thu, Apr 17, 2014 at 12:33 AM, Yiou Li liy...@gmail.com wrote:
Hi all,
I am trying to
Glad to hear you're making progress! Do you have a working version of the
join? Is there anything else you need help with?
On Wed, Apr 16, 2014 at 7:11 PM, Roger Hoover roger.hoo...@gmail.comwrote:
Ah, in case this helps others, looks like RDD.zipPartitions will
accomplish step 4.
On
Thanks for following up. I hope to get some free time this afternoon to
get it working. Will let you know.
On Wed, Apr 16, 2014 at 12:43 PM, Andrew Ash and...@andrewash.com wrote:
Glad to hear you're making progress! Do you have a working version of the
join? Is there anything else you
Hi Bertrand,
We should probably add a SparkContext.pickleFile and RDD.saveAsPickleFile that
will allow saving pickled objects. Unfortunately this is not in yet, but there
is an issue up to track it: https://issues.apache.org/jira/browse/SPARK-1161.
In 1.0, one feature we do have now is the
Hi,,
I have large dataset of elemenst[RDD] and i want to divide it into two
exactly equal sized partitions maintaining order of elements.I tried using
RangePartitioner like var data= partitionedFile.partitionBy(new
RangePartitioner(2, partitionedFile)).
This doesnt give satisfactory results
Never mind. I'll take it from both Andrew and Syed's comments that the
answer is yes. Dunno why I thought otherwise.
On Wed, Apr 16, 2014 at 5:43 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
I’m running into a similar issue as the OP. I’m running the same job over
and over (with
Hi Sean,
It's true that the sbt is trying different links but ALL of them have
connections issue (which is actually 404 File not found error) and the
build process takes forever connecting different links..
I don't think it's a proxy issue because my other projects (other than
spark) builds well
From the Spark tuning guidehttp://spark.apache.org/docs/latest/tuning.html
:
In general, we recommend 2-3 tasks per CPU core in your cluster.
I think you can only get one task per partition to run concurrently for a
given RDD. So if your RDD has 10 partitions, then 10 tasks at most can
operate
When this is implemented, can you load/save an RDD of pickled objects to
HDFS?
On Thu, Apr 17, 2014 at 1:51 AM, Matei Zaharia matei.zaha...@gmail.comwrote:
Hi Bertrand,
We should probably add a SparkContext.pickleFile and RDD.saveAsPickleFile
that will allow saving pickled objects.
23 matches
Mail list logo