The exception is telling you precisely what is wrong. The kafka source has a
schema of (topic, partition, offset, key, value, timestamp, timestampType).
Nothing about those columns makes sense as a tweet. You need to inform spark
how to get from bytes to tweet, it doesn't know how you serialized
The default spark.sql.streaming.commitProtocolClass is
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ManifestFileCommitProtocol.scala
which may or may not be the best suited for all needs.
Code deploys could be improved by ensuring
1. would it not be more natural to write processed to kafka and sink
processed from kafka to s3?
2a. addBatch is the time Sink#addBatch took as measured by StreamExecution.
2b. getBatch is the time Source#getBatch took as measured by
StreamExecution.
3. triggerExecution is effectively end-to-end
You can leverage dynamic resource allocation with structured streaming.
Certainly there's an argument trivial jobs won't benefit. Certainly there's
an argument important jobs should have fixed resources for stable end to end
latency.
Few scenarios come to mind with benefits:
- I want my
Hello,
I am new in Apache Spark. I need to process different time series data
(numeric values which depend on time) and react on next actions:
1. Data is changing up or down too fast.
2. Data is changing constantly up or down too long.
For example, if the data have changed 30% up or down in the
Hi All,
I am not sure if the users list is the right list for this query, but I am
hoping if this is the wrong forum someone would point me to the right forum.
I work for a company which uses proprietary analytical ecosystem. I am
evangelising open-source and have been requested by management to
Resolved :)
Hi just a loopback on this (thanks for everyone's help).
In jupyter notebook the following command works and properly loads in the
Kafka jar files.
# Spin up a local Spark Session
spark = SparkSession.builder.appName('my_awesome')\
.config('spark.jars.packages',
I'm running a Jupyter-Spark setup and I want to benchmark my cluster with
different input parameters. To make sure the enivorment stays the same I'm
trying to reset(restart) the SparkContext, here is some code:
*temp_result_parquet = os.path.normpath('/home/spark_tmp_parquet')
Hi, any one knows how to fix below error?
java.lang.NoSuchMethodError:
Thanks Michael – this is a great article… very helpful
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Wednesday, August 23, 2017 4:33 PM
To: JG Perrin
Cc: user@spark.apache.org
Subject: Re: Joining 2 dataframes, getting result as nested list/structure in
I have written a unit test which uses multithreading to start and stop
Sparkstreamingjob and kafkaproducer. All the dependencies have been declared
in maven pom.xml file.
When i run the test, once the all the kafka messages are read and the
threads are stopped i continue to get the below
Are there any plans to include it in the future releases of Spark ?
Regards,
Aviral Agarwal
On Thu, Aug 24, 2017 at 3:11 PM, Akhil Das wrote:
> How are you reading the data? Its clearly saying
> *java.lang.NumberFormatException:
> For input string: "0645253_0001" *
>
> On
Hi Sea,
Could you let us know which ML algorithm you use? What's the number
instances and dimension of your dataset?
AFAIK, Spark MLlib can train model with several millions of feature if you
configure it correctly.
Thanks
Yanbo
On Thu, Aug 24, 2017 at 7:07 AM, Suzen, Mehmet
Usually analysts will not have access to data stored in the PCI Zone, you
could write the data out to a table for the analysts by masking the
sensitive information.
Eg:
> val mask_udf = udf((info: String) => info.patch(0, "*" * 12, 7))
> val df = sc.parallelize(Seq(("user1",
How many iterations are you doing on the data? Like Jörn said, you don't
necessarily need a billion samples for linear regression.
On Tue, Aug 22, 2017 at 6:28 PM, Sea aj wrote:
> Jorn,
>
> My question is not about the model type but instead, the spark capability
> on reusing
How are you reading the data? Its clearly saying
*java.lang.NumberFormatException:
For input string: "0645253_0001" *
On Tue, Aug 22, 2017 at 7:40 PM, Aviral Agarwal
wrote:
> Hi,
>
> I am trying to read hive orc transaction table through Spark but I am
> getting the
Have you tried setting spark.executor.instances=0 to a positive non-zero
value? Also, since its a streaming application set executor cores > 1.
On Wed, Aug 23, 2017 at 3:38 AM, Karthik Palaniappan wrote:
> I ran the HdfsWordCount example using this command:
>
>
17 matches
Mail list logo