Hi Team,
Please let me know how to get list of all persisted RDD's ins park 2.0
shell
Regards,
Nancy
What version of spark of spark are you using?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28732.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I'm running into a weird issue with a stateful streaming job I'm running.
(Spark 2.1.0 reading from kafka 0-10 input stream.)
>From what I understand from the docs, by default the checkpoint interval
for stateful streaming is 10 * batchInterval. Since I'm running a batch
interval of 10 seconds,
The issue seems to be with primordial class loader. I cannot load the drivers
to all the nodes at the same location but have loaded the jars to HDFS. I
have tried SPARK_YARN_DIST_FILES as well as SPARK_CLASSPATH on the edge node
with no luck. Is there another way to load these jars through
The issue seems to be with primordial class loader. I cannot load the
drivers to all the nodes at the same location but have loaded the jars to
HDFS. I have tried SPARK_YARN_DIST_FILES as well as SPARK_CLASSPATH on the
edge node with no luck. Is there another way to load these jars
through
While I'm not sure why you're seeing an increase in partitions with such a
small data file, it's worth noting that the second parameter to textFile is
the *minimum* number of partitions so there's no guarantee you'll get
exactly that number.
--
Michael Mior
mm...@apache.org
2017-06-01 6:28
Hi,
I am creating a RDD from a text file by specifying number of partitions. But
it gives me different number of partitions than the specified one.
*/scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 0)
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[72] at
As a matter of interest what is the best way of creating virtualised
clusters all pointing to the same physical data?
thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
If mandatory, you can use a local cache like alluxio
Le 1 juin 2017 10:23 AM, "Mich Talebzadeh" a
écrit :
> Thanks Vincent. I assume by physical data locality you mean you are going
> through Isilon and HCFS and not through direct HDFS.
>
> Also I agree with you that
Thanks Vincent. I assume by physical data locality you mean you are going
through Isilon and HCFS and not through direct HDFS.
Also I agree with you that shared network could be an issue as well.
However, it allows you to reduce data redundancy (you do not need R3 in
HDFS anymore) and also you
I don't recommend this kind of design because you loose physical data
locality and you will be affected by "bad neighboors" that are also using
the network storage... We have one similar design but restricted to small
clusters (more for experiments than production)
2017-06-01 9:47 GMT+02:00 Mich
Thanks Jorn,
This was a proposal made by someone as the firm is already using this tool
on other SAN based storage and extend it to Big Data
On paper it seems like a good idea, in practice it may be a Wandisco
scenario again.. Of course as ever one needs to EMC for reference calls
ans whether
Hi,
I have done this (not Isilon, but another storage system). It can be efficient
for small clusters and depending on how you design the network.
What I have also seen is the microservice approach with object stores (e.g. In
the cloud s3, on premise swift) which is somehow also similar.
If
Thanks Sidney for your response,
To check if all the messages are processed I used accumulator and also add
a print statement for debuging.
*val accum = ssc.sparkContext.accumulator(0, "Debug Accumulator")*
*...*
*...*
*...*
*val mappedDataStream = dataStream.map(_._2);*
*
14 matches
Mail list logo