Are you sure that every message gets processed? It could be that some messages
failed passing the decoder.
And during the processing, are you maybe putting the events into a map? That
way, events with the same key could override each other and that way you'll
have less final events.
Your data node(s) is/are going down for some reason, check the logs of the
datanode and fix the underlying issue why datanode is going down.
There should be no need to delete any data, just starting the data nodes should
do the trick for you.
From: kant kodali [mailto:kanth...@gmail.com]
Sent:
It’s because the class in which you have defined the udf is not serializable.
Declare the udf in a class and make the class seriablizable.
From: shyla deshpande [mailto:deshpandesh...@gmail.com]
Sent: Thursday, June 01, 2017 10:08 AM
To: user
Subject: Spark sql with Zeppelin, Task not
Hello all,
I am using Zeppelin 0.7.1 with Spark 2.1.0
I am getting org.apache.spark.SparkException: Task not serializable error
when I try to cache the spark sql table. I am using a UDF on a column of
table and want to cache the resultant table . I can execute the paragraph
successfully when
Hello:
I am training the ALS model for recommendations. I have about 200m ratings
from about 10m users and 3m products. I have a small cluster with 48 cores
and 120gb cluster-wide memory.
My code is very similar to the example code
Hi All,
When my query is streaming I get the following error once in say 10
minutes. Lot of the solutions online seems to suggest just clear data
directories under datanode and namenode and restart the HDFS cluster but I
didn't see anything that explains the cause? If it happens so frequent what
So one thing I've noticed, if I turn on as much debug logging as I can
find, is that when reverse proxy is enabled I end up with a lot of
these threads being logged:
17/05/31 21:06:37 DEBUG SelectorManager: Starting Thread[MasterUI-218-s
elector-ClientSelectorManager@36a67cc9/14,5,main] on
+1 to Ayan's answer, I think this is a common distributed anti pattern that
trips us all up at some point or another.
You definitely want to (in most cases) yield and create a new
RDD/Dataframe/Dataset and then perform your save operation on that.
On 28 May 2017 at 21:09, ayan guha
Hi,
In our application pipeline we need to push the data from spark streaming
to a http server.
I would like to have a http client with below requirements.
1. synchronous calls
2. Http connection pool support
3. light weight and easy to use.
spray,akka http are mostly suited for async call .
Dear all,
I would appreciate if anyone could explain when does mapWithState terminate,
i.e. apply subsequent transformations such as writing the state to an external
sink?
Given a KafkaConsumer instance pulling messages from a Kafka topic, and a
mapWithState transformation updating the state
>
> So, my question is the same as stated in the following ticket which is Do
> we need create a checkpoint directory for each individual query?
>
Yes. Checkpoints record what data has been processed. Thus two different
queries need their own checkpoints.
Hi folks,
I'm running a containerized spark 2.1.0 setup. If I have reverse proxy
turned on, everything is fine if I have fewer than 10 workers. If I create
a cluster with 10 or more, the master web ui is unavailable.
This is even true for the same cluster if I start with a lower number and
Hi All,
I am using Spark 2.1.1 and forEachSink to write to Kafka. I call .start and
.awaitTermination for each query however I get the following error
"Cannot start query with id d4b3554d-ee1d-469c-bf9d-19976c8a7b47 as another
query with same id is already active"
So, my question is the same
Hi,
I realize this may not have direct relevance to Spark but has anyone tried
to create virtualized HDFS clusters using tools like ISILON or similar?
The prime motive behind this approach is to minimize the propagation or
copy of data which has regulatory implication. In shoret you want your
Hi,
I am trying to create a Dataframe by querying Impala Table. It works fine in
my local environment but when I try to run it in cluster I either get
Error:java.lang.ClassNotFoundException: com.cloudera.impala.jdbc41.Driver
or
No Suitable Driver found.
Can someone help me or direct me to
https://stackoverflow.com/questions/44280360/how-to-convert-datasetrow-to-dataset-of-json-messages-to-write-to-kafka
Thanks!
On Wed, May 31, 2017 at 1:41 AM, kant kodali wrote:
> small correction.
>
> If I try to convert a Row into a Json String it results into something
>
No it's running in standalone mode as Docker image on Kubernetes.
The only way I found was to access "stderr" file created under the "work"
directory in the SPARK_HOME but ... is it the right way ?
Paolo Patierno
Senior Software Engineer (IoT) @ Red Hat
Microsoft MVP on Windows Embedded & IoT
small correction.
If I try to convert a Row into a Json String it results into something like
this {"key1", "name", "value1": "hello", "key2", "ratio", "value2": 1.56 ,
"key3", "count", "value3": 34} but *what I need is something like this {
result: {"name": "hello", "ratio": 1.56, "count": 34}
Are you running the code with yarn?
if so, figure out the applicationID through the web ui, then run the next
command:
yarn logs your_application_id
Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman
Hi Jules,
I read that blog several times prior to asking this question.
Thanks!
On Wed, May 31, 2017 at 12:12 AM, Jules Damji wrote:
> Hello Kant,
>
> See is the examples in this blog explains how to deal with your particular
> case:
Hi,
I am trying to run Naive Bayes Model using Spark ML libraries, in Java.
The sample snippet of dataset is given below:
Raw Data -
But, as the input data needs to in numeric, so I am using one-hot-encoder
on the Gender field[m->0,1][f->1,0]; and the finally the 'features' vector
is
Hi all,
I have a simple cluster with one master and one worker. On another machine I
launch the driver where at some point I have following line of codes :
max.foreachRDD(rdd -> {
LOG.info("*** max.foreachRDD");
rdd.foreach(value -> {
LOG.info("*** rdd.foreach");
});
Hello Kant,
See is the examples in this blog explains how to deal with your particular
case:
https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html
Cheers
Jules
Sent from my iPhone
Pardon the dumb thumb typos :)
> On May 30, 2017, at
23 matches
Mail list logo