Also I should mention that the `stream` Dstream definition is:
JavaInputDStream> stream =
KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.Subscribe(TOPIC, kafkaParams)
);
On Thu,
The following code is in SparkStreaming :
JavaInputDStream results = stream.map(record ->
SparkTest.getTime(record.value()) + ":"
+ Long.toString(System.currentTimeMillis()) + ":"
+ Arrays.deepToString(SparkTest.finall(record.value()))
+ ":" +
Hi All,
we are consuming messages from Kafka using Spark dsteam. Once the processing
is done we would like to update/insert the data in bulk fashion into the
database.
I was wondering what the best solution for this might be. Our Postgres
database table is not partitioned.
Thank you,
Ali
I am trying to connect spark streaming with flume with pull mode.
I have three machine and each one runs spark and flume agent at the same
time, where they are master, slave1, slave2.
I have set flume sink to slave1 on port 6689. Telnet slave1 6689 on other
two machine works well.
In my code, I
HI all
could anyone advise on how to control logging in
com,holdenkarau.spark.testing?
there are loads of spark logging statement every time i run a test
I tried to disable spark logging using statements below, but with no success
import org.apache.log4j.Logger
import
Hi All,
I have messages in a queue that might be coming in with few different
schemas like
msg 1 schema 1, msg2 schema2, msg3 schema3, msg 4 schema1
I want to put all of this in one data frame. is it possible with structured
streaming?
I am using Spark 2.2.0
Thanks!
The train method is on the Companion Object
https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.clustering.KMeans$
here is a decent resource on Companion Object usage:
https://docs.scala-lang.org/tour/singleton-objects.html
On Wed, Dec 13, 2017 at 9:16 AM Michael
It should be within your yarn-site.xml config file.The parameter name is
yarn.resourcemanager.am.max-attempts.
The directory should be /usr/lib/spark/conf/yarn-conf. Try to find this
directory on your gateway node if using Cloudera distribution.
On Wed, Dec 13, 2017 at 2:33 PM, Subhash Sriram
There are some more properties specifically for YARN here:
http://spark.apache.org/docs/latest/running-on-yarn.html
Thanks,
Subhash
On Wed, Dec 13, 2017 at 2:32 PM, Subhash Sriram
wrote:
> http://spark.apache.org/docs/latest/configuration.html
>
> On Wed, Dec 13,
http://spark.apache.org/docs/latest/configuration.html
On Wed, Dec 13, 2017 at 2:31 PM, Toy wrote:
> Hi,
>
> Can you point me to the config for that please?
>
> On Wed, 13 Dec 2017 at 14:23 Marcelo Vanzin wrote:
>
>> On Wed, Dec 13, 2017 at 11:21 AM,
Hi,
Can you point me to the config for that please?
On Wed, 13 Dec 2017 at 14:23 Marcelo Vanzin wrote:
> On Wed, Dec 13, 2017 at 11:21 AM, Toy wrote:
> > I'm wondering why am I seeing 5 attempts for my Spark application? Does
> Spark application
On Wed, Dec 13, 2017 at 11:21 AM, Toy wrote:
> I'm wondering why am I seeing 5 attempts for my Spark application? Does Spark
> application restart itself?
It restarts itself if it fails (up to a limit that can be configured
either per Spark application or globally in
Hi,
I'm wondering why am I seeing 5 attempts for my Spark application? Does
Spark application restart itself?[image: Screen Shot 2017-12-13 at 2.18.03
PM.png]
I am reading some data in a dataframe from a dynamo db table:
val data = spark.read.dynamodb("table")
data.filter($"field1".like("%hello%")).createOrReplaceTempView("temp")
spark.sql("select * from temp").show()
When I do the last statement I get results. If however I try to do:
Hi,
Just came across this while looking at the docs on how to use Spark’s Kmeans
clustering.
Note: This appears to be true in both 2.1 and 2.2 documentation.
The overview page:
https://spark.apache.org/docs/2.1.0/mllib-clustering.html#k-means
Here’ the example contains the following line:
Hi Arkadiusz,
Try 'rooting' your import. It looks like the import is being interpreted as
being relative to the previous.
'rooting; is done by adding the '_root_' prefix to your import:
import org.apache.spark.streaming.kafka.KafkaUtils
import
Hi,
I try to test spark streaming 2.2.0 version with confluent 3.3.0
I have got lot of error during compilation this is my sbt:
lazy val sparkstreaming = (project in file("."))
.settings(
name := "sparkstreaming",
organization := "org.arek",
version := "0.1-SNAPSHOT",
scalaVersion
Hi,
would it be possible to determine the Cook's distance using Spark?
thanks,
Richard
S3 can be realized cheaper than HDFS on Amazon.
As you correctly describe it does not support data locality. The data is
distributed to the workers.
Depending on your use case it can make sense to have HDFS as a temporary
“cache” for S3 data.
> On 13. Dec 2017, at 09:39, Philip Lee
> When Spark loads data from S3 (sc.textFile('s3://...'), how all data will be
> spread on Workers?
The data is read by workers. Only make sure that the data is splittable, by
using a splittable
format or by passing a list of files
sc.textFile('s3://.../*.txt')
to achieve full parallelism.
Hi
I have a few of questions about a structure of HDFS and S3 when Spark-like
loads data from two storage.
Generally, when Spark loads data from HDFS, HDFS supports data locality and
already own distributed file on datanodes, right? Spark could just process
data on workers.
What about S3?
21 matches
Mail list logo