Does dataframe spark API write/create a single file instead of directory as a result of write operation.
Hi, There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation. Below both options will create directory with a random file name. df.coalesce(1).write.csv() df.write.csv() Instead of creating directory with standard files (_SUCCESS , _committed , _started). I want a single file with file_name specified. Thanks
PowerIterationClustering
Hi guys, I am new to mlib and trying out PowerIterationClustering as per the example mentioned below, https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/mllib/JavaPowerIterationClusteringExample.java I am having trouble in understanding how the output is created. For instance if i change the input as shown below, i would like to understand how the algorithm arrived at grouping 0 and 2 , while keeping the rest in another cluster. k = 2 . Input : new Tuple3<>(0L, 1L, 0.9), new Tuple3<>(1L, 2L, 0.7), new Tuple3<>(2L, 3L, 0.3), new Tuple3<>(3L, 4L, 0.5), new Tuple3<>(4L, 5L, 0.2))); Output : 4 -> 0 0 -> 1 1 -> 0 3 -> 0 5 -> 0 2 -> 1 Kindly guide if you have any info on using the algorithm / point to some materials that are suitable for beginners on this context. Regards.
Re: Serialization error when using scala kernel with Jupyter
collect() returns the contents of the RDD back to the Driver in a local variable. Where is the local variable? Try val result = rdd.map(x => x + 1).collect() regards, Apostolos On 21/2/20 21:28, Nikhil Goyal wrote: Hi all, I am trying to use almond scala kernel to run spark session on Jupyter. I am using scala version 2.12.8. I am creating spark session with master set to Yarn. This is the code: val rdd = spark.sparkContext.parallelize(Seq(1, 2, 4)) rdd.map(x => x + 1).collect() Exception: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of org.apache.spark.rdd.MapPartitionsRDD I was wondering if anyone has seen this before. Thanks Nikhil -- Apostolos N. Papadopoulos, Associate Professor Department of Informatics Aristotle University of Thessaloniki Thessaloniki, GREECE tel: ++0030312310991918 email: papad...@csd.auth.gr twitter: @papadopoulos_ap web: http://datalab.csd.auth.gr/~apostol
Serialization error when using scala kernel with Jupyter
Hi all, I am trying to use almond scala kernel to run spark session on Jupyter. I am using scala version 2.12.8. I am creating spark session with master set to Yarn. This is the code: val rdd = spark.sparkContext.parallelize(Seq(1, 2, 4)) rdd.map(x => x + 1).collect() Exception: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of org.apache.spark.rdd.MapPartitionsRDD I was wondering if anyone has seen this before. Thanks Nikhil
Spark RDD ouput path for data lineage
Hi, i am trying to do data lineage, so i need to extract output path from RDD job (for example someRDD.saveAsTextFile("/path/")) using SparListener. How can i do that? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org