How to convert Dataset to Dataset in Spark Structured Streaming?

2017-05-30 Thread kant kodali
Hi All, I have a Dataset and I am trying to convert it into Dataset (json String) using Spark Structured Streaming. I have tried the following. df2.toJSON().writeStream().foreach(new KafkaSink()) This doesn't seem to work for the following reason. "Queries with streaming sources must be

Checkpointing fro reduceByKeyAndWindow with a window size of 1 hour and 24 hours

2017-05-30 Thread SRK
Hi, What happens if I dont specify checkpointing on a DStream that has reduceByKeyAndWindow with no inverse function? Would it cause the memory to be overflown? My window sizes are 1 hour and 24 hours. I cannot provide an inserse function for this as it is based on HyperLogLog. My code looks

Re: foreachPartition in Spark Java API

2017-05-30 Thread Anton Kravchenko
//ForEachPartFunction.java: import org.apache.spark.api.java.function.ForeachPartitionFunction; import org.apache.spark.sql.Row; import java.util.ArrayList; import java.util.Iterator; import java.util.List; public class ForEachPartFunction implements ForeachPartitionFunction{ public void

Re: Temp checkpoint directory for EMR (S3 or HDFS)

2017-05-30 Thread Asher Krim
You should actually be able to get to the underlying filesystem from your SparkContext: String originalFs = sparkContext.hadoopConfiguration().get("fs.defaultFS"); and then you could just use that: String checkpointPath = String.format("%s/%s/", originalFs, checkpointDirectory);

Re: foreachPartition in Spark Java API

2017-05-30 Thread Anton Kravchenko
Ok, there are at least two ways to do it: Dataset df = spark.read.csv("file:///C:/input_data/*.csv") df.foreachPartition(new ForEachPartFunction()); df.toJavaRDD().foreachPartition(new Void_java_func()); where ForEachPartFunction and Void_java_func are defined below: //

Re: Random Forest hangs without trace of error

2017-05-30 Thread Morten Hornbech
Hi Sumona I’m afraid I never really resolved the issue. Actually I have just had to rollback an upgrade from 2.1.0 to 2.1.1 because it (for reasons unknown) reintroduced the issue in our nightly integration tests (see

foreachPartition in Spark Java API

2017-05-30 Thread Anton Kravchenko
What would be a Java equivalent of the Scala code below? def void_function_in_scala(ipartition: Iterator[Row]): Unit ={ var df_rows=ArrayBuffer[String]() for(irow<-ipartition){ df_rows+=irow.toString } val df = spark.read.csv("file:///C:/input_data/*.csv")

Re: Disable queuing of spark job on Mesos cluster if sufficient resources are not found

2017-05-30 Thread Michael Gummelt
The driver will remain in the queue indefinitely, unless you issue a kill command at /v1/submissions/kill/ https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala#L64 On Mon, May 29, 2017 at 1:15 AM, Mevada, Vatsal

Re: Random Forest hangs without trace of error

2017-05-30 Thread Sumona Routh
Hi Morten, Were you able to resolve your issue with RandomForest? I am having similar issues with a newly trained model (that does have larger number of trees, smaller minInstancesPerNode, which is by design to produce the best performing model). I wanted to get some feedback on how you solved

No TypeTag Available for String

2017-05-30 Thread krishmah
I am currently using Spark 2.0.1 with Scala 2.11.8. However same code works with Scala 2.10.6. Please advise if I am missing something import org.apache.spark.sql.functions.udf val getFileName = udf{z:String => z.takeRight(z.length -z.lastIndexOf("/")-1)} and this gives me following error

Re: Temp checkpoint directory for EMR (S3 or HDFS)

2017-05-30 Thread Everett Anderson
Still haven't found a --conf option. Regarding a temporary HDFS checkpoint directory, it looks like when using --master yarn, spark-submit supplies a SPARK_YARN_STAGING_DIR environment variable. Thus, one could do the following when creating a SparkSession: val checkpointPath = new

user-unsubscr...@spark.apache.org

2017-05-30 Thread williamtellme123
From: Joel D [mailto:games2013@gmail.com] Sent: Monday, May 29, 2017 9:04 PM To: user@spark.apache.org Subject: Schema Evolution Parquet vs Avro Hi, We are trying to come up with the best storage format for handling schema changes in ingested data. We noticed that both avro

Re: Message getting lost in Kafka + Spark Streaming

2017-05-30 Thread Cody Koeninger
First thing I noticed, you should be using a singleton kafka producer, not recreating one every partition. It's threadsafe. On Tue, May 30, 2017 at 7:59 AM, Vikash Pareek wrote: > I am facing an issue related to spark streaming with kafka, my use case is as >

Message getting lost in Kafka + Spark Streaming

2017-05-30 Thread Vikash Pareek
I am facing an issue related to spark streaming with kafka, my use case is as follow: 1. Spark streaming(DirectStream) application reading data/messages from kafka topic and process it 2. On the basis of proccessed message, app will write proccessed message to different kafka topics for e.g. if

help require in converting CassandraRDD to VertexRDD & EdgeRDD

2017-05-30 Thread Tania Khan
Hi, My requirement is to read some edge and vertex data from different cassandra tables. Now I want to pass them to spark graph as VertexRDD and EdgeRDD and manipulate it further. Could anyone pls suggest me on how to convert the below CassandraRow RDDs to vertexRDD and EDgeRDD so that I can