Re: Spark AsyncEventQueue doubt

2018-05-28 Thread Yuanjian Li
Hi Askash The event dropping problem also triggered by slow listener or large number of events or both, the easy and simple way is change the config of `spark.scheduler.listenerbus.eventqueue.capacity`, its default value is 1. But if after change the queue capacity to a more lager

Pandas UDF for PySpark error. Big Dataset

2018-05-28 Thread Traku traku
Hi. I'm trying to use the new feature but I can't use it with a big dataset (about 5 million rows). I tried increasing executor memory, driver memory, partition number, but any solution can help me to solve the problem. One of the executor task increase the shufle memory until fails. Error is

trying to understand structured streaming aggregation with watermark and append outputmode

2018-05-28 Thread Koert Kuipers
hello all, just playing with structured streaming aggregations for the first time. this is my little program i run inside sbt: import org.apache.spark.sql.functions._ val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", )

Re: [Spark2.1] SparkStreaming to Cassandra performance problem

2018-05-28 Thread Saulo Sobreiro
Hi, I run a few more tests and found that even with a lot more operations on the scala side, python is outperformed... Dataset Stream duration: ~3 minutes (csv formatted data messages read from Kafka) Scala process/store time: ~3 minutes (map with split + metrics calculations + store raw +

Error on fetchin mass data from cassandra using SparkSQL

2018-05-28 Thread Soheil Pourbafrani
I tried to fetch some data from Cassandra using SparkSql. For small tables, all things go well but trying to fetch data from big tables I got the following error: java.lang.NoSuchMethodError:

Name error when writing data as orc

2018-05-28 Thread JF Chen
I am working on writing a dataset to orc format to hdfs, while I meet the following problem: Error: name expected at the position 1473 of 'string:boolean:string:string..zone:struct<$ref:string> ...' but '$' is found. where the position 1473 is at "$ref:string" place. Regard, Junfeng Chen

Re: Spark Structured Streaming is giving error “org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;”

2018-05-28 Thread Jacek Laskowski
Hi, After you leave Spark Structured Streaming right after you generate RDDs (for your streaming queries) you can do any kind of "joins". You're again in the old good days of RDD programming (with all the whistles and bells). Please note that Spark Structured Streaming != Spark Streaming since

Execution model in Spark

2018-05-28 Thread Esa Heikkinen
Hi I don't know whether this question is suitable for this forum, but I take the risk and ask :) In my understanding the execution model in Spark is very data (flow) stream oriented and specific. Is it difficult to build a control flow logic (like state-machine) outside of the stream specific