Hi All,
I have a Dataset and I am trying to convert it into Dataset
(json String) using Spark Structured Streaming. I have tried the following.
df2.toJSON().writeStream().foreach(new KafkaSink())
This doesn't seem to work for the following reason.
"Queries with streaming sources must be
Hi,
What happens if I dont specify checkpointing on a DStream that has
reduceByKeyAndWindow with no inverse function? Would it cause the memory to
be overflown? My window sizes are 1 hour and 24 hours.
I cannot provide an inserse function for this as it is based on HyperLogLog.
My code looks
//ForEachPartFunction.java:
import org.apache.spark.api.java.function.ForeachPartitionFunction;
import org.apache.spark.sql.Row;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
public class ForEachPartFunction implements ForeachPartitionFunction{
public void
You should actually be able to get to the underlying filesystem from your
SparkContext:
String originalFs = sparkContext.hadoopConfiguration().get("fs.defaultFS");
and then you could just use that:
String checkpointPath = String.format("%s/%s/", originalFs,
checkpointDirectory);
Ok, there are at least two ways to do it:
Dataset df = spark.read.csv("file:///C:/input_data/*.csv")
df.foreachPartition(new ForEachPartFunction());
df.toJavaRDD().foreachPartition(new Void_java_func());
where ForEachPartFunction and Void_java_func are defined below:
//
Hi Sumona
I’m afraid I never really resolved the issue. Actually I have just had to
rollback an upgrade from 2.1.0 to 2.1.1 because it (for reasons unknown)
reintroduced the issue in our nightly integration tests (see
What would be a Java equivalent of the Scala code below?
def void_function_in_scala(ipartition: Iterator[Row]): Unit ={
var df_rows=ArrayBuffer[String]()
for(irow<-ipartition){
df_rows+=irow.toString
}
val df = spark.read.csv("file:///C:/input_data/*.csv")
The driver will remain in the queue indefinitely, unless you issue a kill
command at /v1/submissions/kill/
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala#L64
On Mon, May 29, 2017 at 1:15 AM, Mevada, Vatsal
Hi Morten,
Were you able to resolve your issue with RandomForest? I am having similar
issues with a newly trained model (that does have larger number of trees,
smaller minInstancesPerNode, which is by design to produce the best
performing model).
I wanted to get some feedback on how you solved
I am currently using Spark 2.0.1 with Scala 2.11.8. However same code works
with Scala 2.10.6. Please advise if I am missing something
import org.apache.spark.sql.functions.udf
val getFileName = udf{z:String => z.takeRight(z.length
-z.lastIndexOf("/")-1)}
and this gives me following error
Still haven't found a --conf option.
Regarding a temporary HDFS checkpoint directory, it looks like when using
--master yarn, spark-submit supplies a SPARK_YARN_STAGING_DIR environment
variable. Thus, one could do the following when creating a SparkSession:
val checkpointPath = new
From: Joel D [mailto:games2013@gmail.com]
Sent: Monday, May 29, 2017 9:04 PM
To: user@spark.apache.org
Subject: Schema Evolution Parquet vs Avro
Hi,
We are trying to come up with the best storage format for handling schema
changes in ingested data.
We noticed that both avro
First thing I noticed, you should be using a singleton kafka producer,
not recreating one every partition. It's threadsafe.
On Tue, May 30, 2017 at 7:59 AM, Vikash Pareek
wrote:
> I am facing an issue related to spark streaming with kafka, my use case is as
>
I am facing an issue related to spark streaming with kafka, my use case is as
follow:
1. Spark streaming(DirectStream) application reading data/messages from
kafka topic and process it
2. On the basis of proccessed message, app will write proccessed message to
different kafka topics
for e.g. if
Hi,
My requirement is to read some edge and vertex data from different
cassandra tables. Now I want to pass them to spark graph as VertexRDD and
EdgeRDD and manipulate it further. Could anyone pls suggest me on how to
convert the below CassandraRow RDDs to vertexRDD and EDgeRDD so that I can
15 matches
Mail list logo