Hi,
Suppose I have a job that uses some native libraries. I can launch
executors using a Docker container and everything is fine.
Now suppose I have some other job that uses some other native libraries
(and let's assume they just can't co-exist in the same docker image), but I
want to execute
If you want to do it through streaming API you have to pay Gnip; it's not free.
You can go through non-streaming Twitter API and convert it to stream yourself
though.
On 4 Aug 2015, at 09:29, Sadaf sa...@platalytics.com wrote:
Hi
Is there any way to get all old tweets since when the
If you start parallel Twitter streams, you will be in breach of their TOS.
They allow a small number of parallel stream in practice, but if you do it
on massive scale they'll ban you (I'm speaking from experience ;) ).
If you really need that level of data, you need to talk to a company called
You are probably listening to the sample stream, and THEN filtering. This
means you listen to 1% of the twitter stream, and then looking for the
tweet by Bloomberg, so there is a very good chance you don't see the
particular tweet.
In order to get all Bloomberg related tweets, you must connect to
, 2015, at 4:17 PM, Enno Shioji eshi...@gmail.com wrote:
You are probably listening to the sample stream, and THEN filtering.
This means you listen to 1% of the twitter stream, and then looking for the
tweet by Bloomberg, so there is a very good chance you don't see the
particular tweet
June 2015 03:57
*To:* Enno Shioji
*Cc:* Will Briggs; asoni.le...@gmail.com; ayan guha; user; Sateesh
Kavuri; Spark Enthusiast; Sabarish Sasidharan
*Subject:* Re: Spark or Storm
not being able to read from Kafka using multiple nodes
Kafka is plenty capable of doing this, by clustering
State guide -
https://storm.apache.org/documentation/Trident-state
In the end, I am totally open the suggestions and PRs on how to make the
programming guide easier to understand. :)
TD
On Thu, Jun 18, 2015 at 11:47 PM, Enno Shioji eshi...@gmail.com wrote:
Tbh I find the doc around
internal state (e.g. reduceByKeyAndWindow)
are exactly-once too.
Matei
On Jun 17, 2015, at 8:26 AM, Enno Shioji eshi...@gmail.com wrote:
The thing is, even with that improvement, you still have to make updates
idempotent or transactional yourself. If you read
http://spark.apache.org/docs/latest
We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm.
Some of the important draw backs are:
Spark has no back pressure (receiver rate limit can alleviate this to a
certain point, but it's far from ideal)
There is also no exactly-once semantics. (updateStateByKey can achieve
?
On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com
wrote:
We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm.
Some of the important draw backs are:
Spark has no back pressure (receiver rate limit can alleviate this to a
certain point, but it's far
) so that i know how much i accumulated at any given point as events
for same phone can go to any node / executor.
Can some one please tell me how can i achieve this is spark as in storm i
can have a bolt which can do this ?
Thanks,
On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi
accumulated at any given point as events
for same phone can go to any node / executor.
Can some one please tell me how can i achieve this is spark as in storm i
can have a bolt which can do this ?
Thanks,
On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote:
I guess both
, Enno Shioji eshi...@gmail.com wrote:
So Spark (not streaming) does offer exactly once. Spark Streaming however,
can only do exactly once semantics *if the update operation is idempotent*.
updateStateByKey's update operation is idempotent, because it completely
replaces the previous state.
So
:09 PM, Ashish Soni asoni.le...@gmail.com wrote:
Stream can also be processed in micro-batch / batches which is the main
reason behind Spark Steaming so what is the difference ?
Ashish
On Wed, Jun 17, 2015 at 9:04 AM, Enno Shioji eshi...@gmail.com wrote:
PS just to elaborate on my first
17, 2015 at 1:50 PM, Ashish Soni asoni.le...@gmail.com wrote:
As per my Best Understanding Spark Streaming offer Exactly once processing
, is this achieve only through updateStateByKey or there is another way to
do the same.
Ashish
On Wed, Jun 17, 2015 at 8:48 AM, Enno Shioji eshi
or storm for fault
tolerance and load balancing. Is it a correct approach?
On 17 Jun 2015 23:07, Enno Shioji eshi...@gmail.com wrote:
Hi Ayan,
Admittedly I haven't done much with Kinesis, but if I'm not mistaken you
should be able to use their processor interface for that. In this
example, it's
://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji eshi...@gmail.com wrote:
AFAIK KCL is *supposed* to provide fault tolerance and load balancing
(plus additionally, elastic scaling unlike Storm), Kinesis
My employer (adform.com) would like to use the Spark logo in a recruitment
event (to indicate that we are using Spark in our company). I looked in the
Spark repo (https://github.com/apache/spark/tree/master/docs/img) but
couldn't find a vector format.
Is a higher-res or vector format version
lazy val mapper = new ObjectMapper() with
ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
}
val jsonStream = myDStream.map(x= {
Holder.mapper.readValue[Map[String,Any]](x)
})
Thanks
Best Regards
On Sat, Feb 14, 2015 at 7:32 PM, Enno Shioji eshi
know if i use sparkSQL with schemaRDD
and all it will be much faster, but i need that in SparkStreaming.
Thanks
Best Regards
On Sat, Feb 14, 2015 at 8:04 PM, Enno Shioji eshi...@gmail.com wrote:
I see. I'd really benchmark how the parsing performs outside Spark (in a
tight loop or something
the ObjectMapper, if i take it outside of my map
operation then it throws Serializable Exceptions (Caused by:
java.io.NotSerializableException:
com.fasterxml.jackson.module.scala.modifiers.SetTypeModifier).
Thanks
Best Regards
On Sat, Feb 14, 2015 at 7:11 PM, Enno Shioji eshi...@gmail.com wrote:
If I
1
You have 4 CPU core and 34 threads (system wide, you likely have many more,
by the way).
Think of it as having 4 espresso machine and 34 baristas. Does the fact
that you have only 4 espresso machine mean you can only have 4 baristas? Of
course not, there's plenty more work other than making
. Not using gen-idea in sbt.
On Wed, Jan 14, 2015 at 8:52 AM, Jay Vyas jayunit100.apa...@gmail.com
wrote:
I find importing a working SBT project into IntelliJ is the way to
go.
How did you load the project into intellij?
On Jan 13, 2015, at 4:45 PM, Enno Shioji eshi...@gmail.com wrote
Had the same issue. I can't remember what the issue was but this works:
libraryDependencies ++= {
val sparkVersion = 1.2.0
Seq(
org.apache.spark %% spark-core % sparkVersion % provided,
org.apache.spark %% spark-streaming % sparkVersion % provided,
org.apache.spark %%
/9b94736c2bad2f4b8e23
ᐧ
On Mon, Jan 5, 2015 at 2:56 PM, Enno Shioji eshi...@gmail.com wrote:
Hi Gerard,
Thanks for the answer! I had a good look at it, but I couldn't figure out
whether one can use that to emit metrics from your application code.
Suppose I wanted to monitor the rate of bytes I produce, like so
Hi,
I extended org.apache.spark.streaming.TestSuiteBase for some testing, and I
was able to run this test fine:
test(Sliding window join with 3 second window duration) {
val input1 =
Seq(
Seq(req1),
Seq(req2, req3),
Seq(),
Seq(req4, req5, req6),
Seq(req7),
with class name to
let it loaded by metrics system, for the details you can refer to
http://spark.apache.org/docs/latest/monitoring.html or source code.
Thanks
Jerry
*From:* Enno Shioji [mailto:eshi...@gmail.com]
*Sent:* Sunday, January 4, 2015 7:47 AM
*To:* user@spark.apache.org
*Subject
I have a hack to gather custom application metrics in a Streaming job, but
I wanted to know if there is any better way of doing this.
My hack consists of this singleton:
object Metriker extends Serializable {
@transient lazy val mr: MetricRegistry = {
val metricRegistry = new
Also the job was deployed from the master machine in the cluster.
ᐧ
On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji eshi...@gmail.com wrote:
Oh sorry that was a edit mistake. The code is essentially:
val msgStream = kafkaStream
.map { case (k, v) = v}
.map
a machine that's not in
your Spark cluster? Then in client mode you're shipping data back to a
less-nearby machine, compared to with cluster mode. That could explain
the bottleneck.
On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji eshi...@gmail.com wrote:
Hi,
I have a very, very simple streaming
and cluster)? Accordingly what is the number of
executors/cores requested?
TD
On Wed, Dec 31, 2014 at 10:36 AM, Enno Shioji eshi...@gmail.com wrote:
Also the job was deployed from the master machine in the cluster.
On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji eshi...@gmail.com wrote:
Oh sorry
Hi, I'm facing a weird issue. Any help appreciated.
When I execute the below code and compare input and output, each record
in the output has some extra trailing data appended to it, and hence
corrupted. I'm just reading and writing, so the input and output should be
exactly the same.
I'm using
This poor soul had the exact same problem and solution:
http://stackoverflow.com/questions/24083332/write-and-read-raw-byte-arrays-in-spark-using-sequence-file-sequencefile
ᐧ
On Tue, Dec 30, 2014 at 10:58 AM, Enno Shioji eshi...@gmail.com wrote:
Hi, I'm facing a weird issue. Any help
Is anybody experiencing this? It looks like a bug in JetS3t to me, but
thought I'd sanity check before filing an issue.
I'm writing to S3 using ReceiverInputDStream#saveAsTextFiles with a S3 URL
(s3://fake-test/1234).
The code does write to S3, but with double forward slashes
ᐧ
I filed a new issue HADOOP-11444. According to HADOOP-10372, s3 is likely
to be deprecated anyway in favor of s3n.
Also the comment section notes that Amazon has implemented an EmrFileSystem
for S3 which is built using AWS SDK rather than JetS3t.
On Tue, Dec 23, 2014 at 2:06 PM, Enno Shioji
35 matches
Mail list logo