command to get list oin spark 2.0 scala of all persisted rdd's in spark 2.0 scala shell

2017-06-01 Thread nancy henry
Hi Team, Please let me know how to get list of all persisted RDD's ins park 2.0 shell Regards, Nancy

Re: Number Of Partitions in RDD

2017-06-01 Thread neil90
What version of spark of spark are you using? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730p28732.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

statefulStreaming checkpointing too often

2017-06-01 Thread David Rosenstrauch
I'm running into a weird issue with a stateful streaming job I'm running. (Spark 2.1.0 reading from kafka 0-10 input stream.) >From what I understand from the docs, by default the checkpoint interval for stateful streaming is 10 * batchInterval. Since I'm running a batch interval of 10 seconds,

Re: Creating Dataframe by querying Impala

2017-06-01 Thread morfious902002
The issue seems to be with primordial class loader. I cannot load the drivers to all the nodes at the same location but have loaded the jars to HDFS. I have tried SPARK_YARN_DIST_FILES as well as SPARK_CLASSPATH on the edge node with no luck. Is there another way to load these jars through

Re: Creating Dataframe by querying Impala

2017-06-01 Thread Anubhav Agarwal
The issue seems to be with primordial class loader. I cannot load the drivers to all the nodes at the same location but have loaded the jars to HDFS. I have tried SPARK_YARN_DIST_FILES as well as SPARK_CLASSPATH on the edge node with no luck. Is there another way to load these jars through

Re: Number Of Partitions in RDD

2017-06-01 Thread Michael Mior
While I'm not sure why you're seeing an increase in partitions with such a small data file, it's worth noting that the second parameter to textFile is the *minimum* number of partitions so there's no guarantee you'll get exactly that number. -- Michael Mior mm...@apache.org 2017-06-01 6:28

Number Of Partitions in RDD

2017-06-01 Thread Vikash Pareek
Hi, I am creating a RDD from a text file by specifying number of partitions. But it gives me different number of partitions than the specified one. */scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 0) people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[72] at

Re: An Architecture question on the use of virtualised clusters

2017-06-01 Thread Mich Talebzadeh
As a matter of interest what is the best way of creating virtualised clusters all pointing to the same physical data? thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: An Architecture question on the use of virtualised clusters

2017-06-01 Thread vincent gromakowski
If mandatory, you can use a local cache like alluxio Le 1 juin 2017 10:23 AM, "Mich Talebzadeh" a écrit : > Thanks Vincent. I assume by physical data locality you mean you are going > through Isilon and HCFS and not through direct HDFS. > > Also I agree with you that

Re: An Architecture question on the use of virtualised clusters

2017-06-01 Thread Mich Talebzadeh
Thanks Vincent. I assume by physical data locality you mean you are going through Isilon and HCFS and not through direct HDFS. Also I agree with you that shared network could be an issue as well. However, it allows you to reduce data redundancy (you do not need R3 in HDFS anymore) and also you

Re: An Architecture question on the use of virtualised clusters

2017-06-01 Thread vincent gromakowski
I don't recommend this kind of design because you loose physical data locality and you will be affected by "bad neighboors" that are also using the network storage... We have one similar design but restricted to small clusters (more for experiments than production) 2017-06-01 9:47 GMT+02:00 Mich

Re: An Architecture question on the use of virtualised clusters

2017-06-01 Thread Mich Talebzadeh
Thanks Jorn, This was a proposal made by someone as the firm is already using this tool on other SAN based storage and extend it to Big Data On paper it seems like a good idea, in practice it may be a Wandisco scenario again.. Of course as ever one needs to EMC for reference calls ans whether

Re: An Architecture question on the use of virtualised clusters

2017-06-01 Thread Jörn Franke
Hi, I have done this (not Isilon, but another storage system). It can be efficient for small clusters and depending on how you design the network. What I have also seen is the microservice approach with object stores (e.g. In the cloud s3, on premise swift) which is somehow also similar. If

Re: Message getting lost in Kafka + Spark Streaming

2017-06-01 Thread Vikash Pareek
Thanks Sidney for your response, To check if all the messages are processed I used accumulator and also add a print statement for debuging. *val accum = ssc.sparkContext.accumulator(0, "Debug Accumulator")* *...* *...* *...* *val mappedDataStream = dataStream.map(_._2);* *