uts an empty file with the same
filename as the original file, but the extension ".uploaded". This is an
indicator that the file has been fully (!) written to the fs. Otherwise you
risk only reading parts of the file.
Then, you can have a file system listener for this .upload file.
S
y file with the
> same filename as the original file, but the extension ".uploaded". This is
> an indicator that the file has been fully (!) written to the fs. Otherwise
> you risk only reading parts of the file.
> Then, you can have a file system listener for this .upload
ave a file system listener for this .upload file.
Spark streaming or Kafka are not needed/suitable, if the server is a file
server. You can use oozie (maybe with a simple custom action) to poll for
.uploaded files and transmit them.
> On 15 Sep 2016, at 19:00, Kappaganthu, Sivaram (ES)
> w
) is it possible for spark-stream to trigger a job after a file is placed
instead of triggering a job at fixed batch interval?
2) If it is not possible with Spark-streaming, can we control this with
Kafka/Flume
Thanks,
Sivaram
KafkaRDD
>> respectively if you want to see an example
>>
>> On Tue, Sep 13, 2016 at 9:37 AM, Daan Debie wrote:
>> > Ah, that makes it much clearer, thanks!
>> >
>> > It also brings up an additional question: who/what decides on the
>> > partiti
within a time slot.
> >> It's one RDD within a time slot, with multiple partitions.
> >>
> >> On Tue, Sep 13, 2016 at 9:01 AM, Daan Debie
> wrote:
> >> > Thanks, but that thread does not answer my questions, which are about
> >>
mechanisms and has been explicitly
> designed for import/export and is tested. Not sure if i would go for spark
> streaming if the use case is only storing, but I do not have the full
> picture of your use case.
>
> Anyway, what you could do is create a directory / hour/ day etc (whatever
>
Hi everyone,
I'm starting in Spark Streaming and would like to know somethings about data
arriving.
I know that SS uses micro-batches and they are received by workers and sent
to RDD. The master, on defined intervals, receives a poiter to micro-batch
in RDD and can use it to process data
Hi,
An alternative to Spark could be flume to store data from Kafka to HDFS. It
provides also some reliability mechanisms and has been explicitly designed for
import/export and is tested. Not sure if i would go for spark streaming if the
use case is only storing, but I do not have the full
Hi,
I have a Spark streaming that reads messages/prices from Kafka and writes
it as text file to HDFS.
This is pretty efficient. Its only function is to persist the incoming
messages to HDFS.
This is what it does
dstream.foreachRDD { pricesRDD =>
val x= pricesRDD.co
t; It also brings up an additional question: who/what decides on the
> > partitioning? Does Spark Streaming decide to divide a micro batch/RDD
> into
> > more than 1 partition based on size? Or is it something that the "source"
> > (SocketStream, KafkaStream etc.) de
Ah, that makes it much clearer, thanks!
It also brings up an additional question: who/what decides on the
partitioning? Does Spark Streaming decide to divide a micro batch/RDD into
more than 1 partition based on size? Or is it something that the "source"
(SocketStream, KafkaStream etc
example
On Tue, Sep 13, 2016 at 9:37 AM, Daan Debie wrote:
> Ah, that makes it much clearer, thanks!
>
> It also brings up an additional question: who/what decides on the
> partitioning? Does Spark Streaming decide to divide a micro batch/RDD into
> more than 1 partition based on
bie wrote:
> Thanks, but that thread does not answer my questions, which are about the
> distributed nature of RDDs vs the small nature of "micro batches" and on how
> Spark Streaming distributes work.
>
> On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh
> wrote:
>>
Thanks, but that thread does not answer my questions, which are about the
distributed nature of RDDs vs the small nature of "micro batches" and on
how Spark Streaming distributes work.
On Tue, Sep 13, 2016 at 3:34 PM, Mich Talebzadeh
wrote:
> Hi Daan,
>
> You may find this
Hi Daan,
You may find this link Re: Is "spark streaming" streaming or mini-batch?
<https://www.mail-archive.com/user@spark.apache.org/msg55914.html>
helpful. This was a thread in this forum not long ago.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com
Hi all!
When reading about Spark Streaming and its execution model, I see diagrams
like this a lot:
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n27699/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala-31-638.jpg>
It does a fine job explaini
, Ashok Kumar
> wrote:
>
>
> Hello Gurus,
>
> I am creating some figures and feed them into Kafka and then spark
> streaming.
>
> It works OK but I have the following issue.
>
> For now as a test I sent 5 prices in each batch interval. In the loop code
> this is what is h
he-spark-user-list.
1001560.n3.nabble.com/spark-streaming-kafka-connector-
questions-tp27681p27687.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
hed without really doing the
work.
Help is hugely appreciated.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafka-connector-questions-tp27681p27687.html
Sent from the Apache Spark User List mailing list archive at Nabble.
sing other things
>
> - You're not using the latest version, the latest version is for spark
> 2.0. There are two versions of the connector for spark 2.0, one for
> kafka 0.8 or higher, and one for kafka 0.10 or higher
>
> - Committing individual messages to kafka doesn't make an
re are two versions of the connector for spark 2.0, one for
kafka 0.8 or higher, and one for kafka 0.10 or higher
- Committing individual messages to kafka doesn't make any sense,
spark streaming deals with batches. If you're doing any aggregations
that involve shuffling, there isn
I am using the lastest streaming kafka connector
org.apache.spark
spark-streaming-kafka_2.11
1.6.2
I am facing the problem that a message is delivered two times to my
consumers. these two deliveries are 10+ seconds apart, it looks this is
caused by my lengthy message processing (took about 60
Hi,
Within spark streaming I have identified the data that I want to persist to
a Hive table. Table is already created.
These are the values for columns extracted
for(line <- pricesRDD.collect.toArray)
{
var index = line._2.split(',').view(0).toInt
cutors",
>>sparkWorkerUiRetainedExecutorsValue)
>> sparkConf.set("sparkWorkerUiRetainedStages",
>>sparkWorkerUiRetainedStagesValue)
>> sparkConf.set("sparkUiRetainedJobs",
>> sparkUiRetainedJob
rk.streaming.stopGracefullyOnShutdown","
> true")
> sparkConf.set("spark.streaming.receiver.writeAheadLog.enable",
> "true")
>
> sparkConf.set("spark.streaming.driver.writeAheadLog.closeFileAfterWrite",
> "tr
6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author
Hi,
This may not be feasible in Spark streaming.
I am trying to create a HiveContext in Spark streaming within the streaming
context
// Create a local StreamingContext with two working thread and batch
interval of 2 seconds.
val sparkConf = new SparkConf().
setAppName
Any help on this warmly appreciated.
On Tuesday, 6 September 2016, 21:31, Ashok Kumar
wrote:
Hello Gurus,
I am creating some figures and feed them into Kafka and then spark streaming.
It works OK but I have the following issue.
For now as a test I sent 5 prices in each batch interval
In general, see the material linked from
https://github.com/koeninger/kafka-exactly-once if you want a better
understanding of the direct stream.
For spark-streaming-kafka-0-8, the direct stream doesn't really care
about consumer group, since it uses the simple consumer. For the 0.10
ve
Hello everybody,
I am trying to understand how Kafka Direct Stream works. I'm interested in
having a production ready Spark Streaming application that consumes a Kafka
topic. But I need to guarantee there's (almost) no downtime, specially
during deploys (and submit) of new versions. Wha
Hello Gurus,
I am creating some figures and feed them into Kafka and then spark streaming.
It works OK but I have the following issue.
For now as a test I sent 5 prices in each batch interval. In the loop code this
is what is hapening
dstream.foreachRDD { rdd => val x= rdd.count
Hi,
Has anyone had experience of using such messaging system like Kafka to
connect Reuters Market Data System to Spark Streaming by any chance.
Thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.
Hi Spark community,
I have a Spark streaming application that reads from a Kinesis stream and
processes data. It calls some services which can experience transient
failures. When such a transient failure happens, a retry mechanism kicks
into action.
For the shutdown use case, I have a separate
It's has been a while since last update on the thread. Now Spark 2.0 is
available, do you guys know if there's any progress on Dynamic Allocation &
Spark Streaming?
On Mon, Oct 19, 2015 at 1:13 PM, robert towne
wrote:
> I have watched a few videos from Databricks/Andrew Or aro
How to attach a ReceiverSupervisor for a Custom receiver in Spark Streaming?
Is it possible to get a sequence number for the current batch (ie. first
batch is 0, second is 1, etc?).
I understand that I cannot use spark streaming window operation without
checkpointing to HDFS but Without window operation I don't think we can do much
with spark streaming. so since it is very essential can I use Cassandra as a
distributed storage? If so, can I see an example on how I can
Hi Everyone,
Does anyone know how to write parquet file after parsing data in Spark
Streaming?
Thanks,
Kevin.
.
>
> Cheers
>
> On Sat, 27 Aug 2016, 14:24 Renato Marroquín Mogrovejo, <
> renatoj.marroq...@gmail.com> wrote:
>
>> Hi Akhilesh,
>>
>> Thanks for your response.
>> I am using Spark 1.6.1 and what I am trying to do is to ingest parquet
>> files in
k 1.6.1 and what I am trying to do is to ingest parquet
> files into the Spark Streaming, not in batch operations.
>
> val ssc = new StreamingContext(sc, Seconds(5))
> ssc.sparkContext.hadoopConfiguration.set("parquet.read.support.class",
> "parquet.avr
Hi Akhilesh,
Thanks for your response.
I am using Spark 1.6.1 and what I am trying to do is to ingest parquet
files into the Spark Streaming, not in batch operations.
val ssc = new StreamingContext(sc, Seconds(5))
ssc.sparkContext.hadoopConfiguration.set("parquet.read.support.
Hi Renato,
Which version of Spark are you using?
If spark version is 1.3.0 or more then you can use SqlContext to read the
parquet file which will give you DataFrame. Please follow the below link:
https://spark.apache.org/docs/1.5.0/sql-programming-guide.html#loading-data-programmatically
Thank
Hi Renato,
Which version of Spark are you using?
If spark version is 1.3.0 or more then you can use SqlContext to read the
parquet file which will give you DataFrame. Please follow the below link:
https://spark.apache.org/docs/1.5.0/sql-programming-guide.html#loading-data-programmatically
Thank
pache-spark
Follow me at https://twitter.com/jaceklaskowski
On Fri, Aug 26, 2016 at 9:42 PM, kant kodali < kanth...@gmail.com > wrote:
is there a HTTP2 (v2) endpoint for Spark Streaming?
Anybody? I think Rory also didn't get an answer from the list ...
https://mail-archives.apache.org/mod_mbox/spark-user/201602.mbox/%3ccac+fre14pv5nvqhtbvqdc+6dkxo73odazfqslbso8f94ozo...@mail.gmail.com%3E
2016-08-26 17:42 GMT+02:00 Renato Marroquín Mogrovejo <
renatoj.marroq...@gmail.com>:
> Hi
te:
is there a HTTP2 (v2) endpoint for Spark Streaming?
awiam,
Jacek Laskowski
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski
On Fri, Aug 26, 2016 at 9:42 PM, kant kodali wrote:
> is there a HTTP2 (v2) endpoint for Spark Streaming?
>
is there a HTTP2 (v2) endpoint for Spark Streaming?
Hi all,
I am trying to use parquet files as input for DStream operations, but I
can't find any documentation or example. The only thing I found was [1] but
I also get the same error as in the post (Class
parquet.avro.AvroReadSupport not found).
Ideally I would like to do have something like this:
> On Aug 25, 2016, at 6:33 AM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
>
> Hi ,
>
> Released latest version of Receiver based Kafka Consumer for Spark
> Streaming.
>
> Receiver is compatible with Kafka versions 0.8.x, 0.9.x and 0.10.x and All
>
t; Released latest version of Receiver based Kafka Consumer for Spark Streaming.
>
> Receiver is compatible with Kafka versions 0.8.x, 0.9.x and 0.10.x and All
> Spark Versions
>
> Available at Spark Packages :
> https://spark-packages.org/package/dibbhatt/kafka-spark-consumer
>
Hi ,
Released latest version of Receiver based Kafka Consumer for Spark
Streaming.
Receiver is compatible with Kafka versions 0.8.x, 0.9.x and 0.10.x and All
Spark Versions
Available at Spark Packages :
https://spark-packages.org/package/dibbhatt/kafka-spark-consumer
Also at github : https
Hello,
We have a Spark streaming application (running Spark 1.6.1) that consumes
data from a message queue. The application is running in local[*] mode so
driver and executors are in a single JVM.
The issue that we are dealing with these days is that if any of our lambda
functions throw any
Hi Guys,
I am new to spark but I am wondering how do I compute the difference given a
bidirectional stream of numbers using spark streaming? To put it more concrete
say Bank A is sending money to Bank B and Bank B is sending money to Bank A
throughout the day such that at any given time we want
Is "spark streaming" streaming or mini-batch?
I look at something Like Complex Event Processing (CEP) which is a leader
use case for data streaming (and I am experimenting with Spark for it) and
in the realm of CEP there is really no such thing as continuous data
streaming. The point is
Hi Steve,
Thanks a lot for such an elaborative email (though it brought more
questions than answers but it's because I'm new to YARN in general and
Kerberos/tokens/tickets in particular).
Thanks also for liking my notes. I'm very honoured to hear it from
you. I value your work with Spark/YARN/Had
On 23 Aug 2016, at 17:58, Mich Talebzadeh
mailto:mich.talebza...@gmail.com>> wrote:
In general depending what you are doing you can tighten above parameters. For
example if you are using Spark Streaming for Anti-fraud detection, you may
stream data in at 2 seconds batch interval, Kee
> On 23 Aug 2016, at 11:26, Jacek Laskowski wrote:
>
> Hi Steve,
>
> Could you share your opinion on whether the token gets renewed or not?
> Is the token going to expire after 7 days anyway?
There's Hadoop service tokens, and Kerberos tickets. They are similar-ish, but
not quite the same.
e/blog/storm-in-a-teacup/> and it mentioned that spark
> streaming actually mini-batch not actual streaming.
>
> I have not used streaming and I am not sure what is the difference in the 2
> terms. Hence could not make a judgement myself.
>
Thanks everyone for clarifying.
On Tue, Aug 23, 2016 at 9:11 PM, Aseem Bansal wrote:
> I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/
> and it mentioned that spark streaming actually mini-batch not actual
> streaming.
>
> I have not used streaming an
I am running a spark streaming process where I am getting batch of data after n
seconds. I am using repartition to scale the application. Since the repartition
size is fixed we are getting lots of small files when batch size is very small.
Is there anyway I can change the partitioner logic
nds over a previous 10
minutes period of time
In general depending what you are doing you can tighten above parameters.
For example if you are using Spark Streaming for Anti-fraud detection, you
may stream data in at 2 seconds batch interval, Keep your windows length at
4 seconds and your sliding int
Spark streaming does not process 1 event at a time which is in general I
think what people call "Streaming." It instead processes groups of events.
Each group is a "MicroBatch" that gets processed at the same time.
Streaming theoretically always has better latency because th
It's based on "micro batching" model.
Sent from my iPhone
> On Aug 23, 2016, at 8:41 AM, Aseem Bansal wrote:
>
> I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/ and
> it mentioned that spark streaming actually mini-batch not actual strea
I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/
and it mentioned that spark streaming actually mini-batch not actual
streaming.
I have not used streaming and I am not sure what is the difference in the 2
terms. Hence could not make a judgement myself.
files, and logs can be accessed using YARN’s log utility.
You can get more information here:
https://spark.apache.org/docs/latest/running-on-yarn.html#configuration
At 2016-08-23 18:44:29, "Pradeep" wrote:
>Hi All,
>
>I am running Java spark streaming jobs in yarn-client m
Hi All,
I am running Java spark streaming jobs in yarn-client mode. Is there a way I
can manage logs rollover on edge node. I have a 10 second batch and log file
volume is huge.
Thanks,
Pradeep
-
To unsubscribe e-mail: user
Hi Steve,
Could you share your opinion on whether the token gets renewed or not?
Is the token going to expire after 7 days anyway? Why is the change in
the recent version for token renewal? See
https://github.com/apache/spark/commit/ab648c0004cfb20d53554ab333dd2d198cb94ffa
Pozdrawiam,
Jacek Lasko
On 21 Aug 2016, at 20:43, Mich Talebzadeh
mailto:mich.talebza...@gmail.com>> wrote:
Hi Kamesh,
The message you are getting after 7 days:
PriviledgedActionException as:sys_bio_replicator (auth:KERBEROS)
cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$
>
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Thu, Aug 18, 2016 at 11:51 AM, Kamesh wrote:
> > Hi all,
> >
> > I am running a spark streaming
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski
On Thu, Aug 18, 2016 at 11:51 AM, Kamesh wrote:
> Hi all,
>
> I am running a spark streaming application that store events into
> Secur
Ohh that explains the reason.
My use case does not need state management.
So i guess i am better off without checkpointing.
Thank you for clarification.
Regards,
Chandan
On Sat, Aug 20, 2016 at 8:24 AM, Cody Koeninger wrote:
> Checkpointing is required to be turned on in certain situations (e.g
Checkpointing is required to be turned on in certain situations (e.g.
updateStateByKey), but you're certainly not required to rely on it for
fault tolerance. I try not to.
On Fri, Aug 19, 2016 at 1:51 PM, chandan prakash
wrote:
> Thanks Cody for the pointer.
>
> I am able to do this now. Not us
Thanks
--jars /home/hduser/jars/spark-streaming-kafka-assembly_*2.11*-1.6.1.jar
sorted it out
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV
You seem to combining Scala 2.10 and 2.11 libraries - your sbt project is
2.11, where as you are trying to pull in spark-streaming-kafka-assembly_
*2.10*-1.6.1.jar.
On Fri, Aug 19, 2016 at 11:24 AM, Mich Talebzadeh wrote:
> Hi,
>
> My spark streaming app with 1.6.1 used to work.
>
Hi,
My spark streaming app with 1.6.1 used to work.
Now with
scala> sc version
res0: String = 2.0.0
Compiling with sbt assembly as before, with the following:
version := "1.0",
scalaVersion := "2.11.8",
mainClass in Compile := Some(&
Thanks Cody for the pointer.
I am able to do this now. Not using checkpointing. Rather storing offsets
in zookeeper for fault tolerance.
Spark Config changes now getting reflected in code deployment.
*Using this api :*
*KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder, (S
Hi,
Is there a way to specify in createDirectStream to receive only last 'n'
offsets of a specific topic and partition. I don't want to filter out in
foreachRDD.
Sent from Samsung Mobile.
Yes,
i looked into the source code implementation. sparkConf is serialized and
saved during checkpointing and re-created from the checkpoint directory at
time of restart. So any sparkConf parameter which you load from
application.config and set in sparkConf object in code cannot be changed
and re
Checkpointing is not kafka-specific. It encompasses metadata about the
application. You can't re-use a checkpoint if your application has changed.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#upgrading-application-code
On Thu, Aug 18, 2016 at 4:39 AM, chandan prakash
w
Hi all,
I am running a spark streaming application that store events into
Secure(Kerborized) HBase cluster. I launched this spark streaming
application by passing --principal and --keytab. Despite this, spark
streaming application is failing after *7days* with Token issue. Can
someone please
Is it possible that i use checkpoint directory to restart streaming but
with modified parameter value in config file (e.g. username/password for
db connection) ?
Thanks in advance.
Regards,
Chandan
On Thu, Aug 18, 2016 at 1:10 PM, chandan prakash
wrote:
> Hi,
> I am using direct kafka with ch
Hi,
I am using direct kafka with checkpointing of offsets same as :
https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/IdempotentExample.scala
I need to change some parameters like db connection params :
username/password for db connection .
I stopped streaming grac
A few months ago, I've started investigating part of an empirical research
several stream processing engines, including but not limited to Spark
Streaming.
As the benchmark should extend the scope further from performance metrics
such as throughput and latency, I've focused onto fault
n Fri, Aug 12, 2016 at 4:38 PM, Shifeng Xiao wrote:
> Hi folks,
>
> I am using Spark streaming, and I am not clear if there is smart way to
> restart the app once it fails, currently we just have one cron job to check
> if the job is running every 2 or 5 minutes and restart the
Hi folks,
I am using Spark streaming, and I am not clear if there is smart way to
restart the app once it fails, currently we just have one cron job to check
if the job is running every 2 or 5 minutes and restart the app when
necessary.
According to spark streaming guide:
- *YARN* - Yarn
The algorithm update is just broken into 2 steps: trainOn - to learn/update
the cluster centers, and predictOn - predicts cluster assignment on data
The StreamingKMeansExample you reference breaks up data into training and
test because you might want to score the predictions. If you don't care
ab
tering BlockManagerMaster
>> > 16/08/10 18:16:27 INFO DiskBlockManager: Created local directory at
>> > /tmp/blockmgr-0f110a7e-1edb-4140-9243-5579a7bc95ee
>> > 16/08/10 18:16:27 INFO MemoryStore: MemoryStore started with capacity
>> 511.5
>> > MB
>>
Dear All,
I was wondering why there is training data and testing data in kmeans ?
Shouldn't it be unsupervised learning with just access to stream data ?
I found similar question but couldn't understand the answer.
http://stackoverflow.com/questions/30972057/is-the-streaming-k-means-clustering-pr
> > 16/08/10 18:16:27 INFO DiskBlockManager: Created local directory at
> > /tmp/blockmgr-0f110a7e-1edb-4140-9243-5579a7bc95ee
> > 16/08/10 18:16:27 INFO MemoryStore: MemoryStore started with capacity
> 511.5
> > MB
> > 16/08/10 18:16:27 INFO SparkEnv: Registering Outp
t; http://192.168.126.131:4040
> 16/08/10 18:16:27 INFO HttpFileServer: HTTP File server directory is
> /tmp/spark-b60f692d-f5ea-44c1-aa21-ae132813828c/httpd-2b2a4e68-2952-41b0-a11b-f07860682749
> 16/08/10 18:16:27 INFO HttpServer: Starting HTTP Server
> 16/08/10 18:16:27 INFO Utils:
ileServer: HTTP File server directory is
/tmp/spark-b60f692d-f5ea-44c1-aa21-ae132813828c/httpd-2b2a4e68-2952-41b0-a11b-f07860682749
16/08/10 18:16:27 INFO HttpServer: Starting HTTP Server
16/08/10 18:16:27 INFO Utils: Successfully started service 'HTTP file
server' on port 59491.
16/08/10 18
what is best practice while processing files from s3 bucket in spark file
streaming ?? Like I keep on getting files in s3 path, have to process those in
batch but while processing some other files might come up. In this steaming
job, should I have to move files after end of our streaming batch
7:52 INFO ShutdownHookManager: Deleting directory
> /tmp/spark-6df1d6aa-896e-46e1-a2ed-199343dad0e2/httpd-
> 07b9c1b6-01db-45b5-9302-d2f67f7c490e
> 16/08/10 12:27:52 INFO RemoteActorRefProvider$RemotingTerminator:
> Remoting shut down.
> 16/08/10 12:27:52 INFO ShutdownHookManager: D
I am testing with one partition now. I am using Kafka 0.9 and Spark 1.6.1
(Scala 2.11). Just start with one topic first and then add more. I am not
partitioning the topic.
HTH,
Regards,
Sivakumaran
> On 10-Aug-2016, at 5:56 AM, Diwakar Dhanuskodi
> wrote:
>
> Hi Siva,
>
> Does topic has
12:27:52 INFO ShutdownHookManager: Deleting directory
/tmp/spark-6df1d6aa-896e-46e1-a2ed-199343dad0e2
[cloudera@quickstart ~]$ spark-submit --master local[3] --class
com.boa.poc.todelete --jars
/home/cloudera/lib/spark-streaming-kafka-assembly_2.10-1.6.2.jar,/home/cloudera/lib/spark-assembly-1.6.2-h
Hi Siva,
Does topic has partitions? which version of Spark you are using?
On Wed, Aug 10, 2016 at 2:38 AM, Sivakumaran S wrote:
> Hi,
>
> Here is a working example I did.
>
> HTH
>
> Regards,
>
> Sivakumaran S
>
> val topics = "test"
> val brokers = "localhost:9092"
> val topicsSet = topics.sp
It stops working at sqlContext.read.json(rdd.map(_._2)) . Topics without
partitions is working fine. Do I need to set any other configs
val kafkaParams =
Map[String,String]("bootstrap.servers"->"localhost:9093,localhost:9092", "
group.id" -> "xyz","auto.offset.reset"->"smallest")
Spark version is 1
No, you don't need a conditional. read.json on an empty rdd will
return an empty dataframe. Foreach on an empty dataframe or an empty
rdd won't do anything (a task will still get run, but it won't do
anything).
Leave the conditional out. Add one thing at a time to the working
rdd.foreach exampl
1001 - 1100 of 4890 matches
Mail list logo