How could one specify a Docker image for each job to be used by executors?

2016-12-05 Thread Enno Shioji
Hi, Suppose I have a job that uses some native libraries. I can launch executors using a Docker container and everything is fine. Now suppose I have some other job that uses some other native libraries (and let's assume they just can't co-exist in the same docker image), but I want to execute

Re: Twitter live Streaming

2015-08-04 Thread Enno Shioji
If you want to do it through streaming API you have to pay Gnip; it's not free. You can go through non-streaming Twitter API and convert it to stream yourself though. On 4 Aug 2015, at 09:29, Sadaf sa...@platalytics.com wrote: Hi Is there any way to get all old tweets since when the

Re: Twitter streaming with apache spark stream only a small amount of tweets

2015-07-29 Thread Enno Shioji
If you start parallel Twitter streams, you will be in breach of their TOS. They allow a small number of parallel stream in practice, but if you do it on massive scale they'll ban you (I'm speaking from experience ;) ). If you really need that level of data, you need to talk to a company called

Re: Twitter4J streaming question

2015-07-23 Thread Enno Shioji
You are probably listening to the sample stream, and THEN filtering. This means you listen to 1% of the twitter stream, and then looking for the tweet by Bloomberg, so there is a very good chance you don't see the particular tweet. In order to get all Bloomberg related tweets, you must connect to

Re: Twitter4J streaming question

2015-07-23 Thread Enno Shioji
, 2015, at 4:17 PM, Enno Shioji eshi...@gmail.com wrote: You are probably listening to the sample stream, and THEN filtering. This means you listen to 1% of the twitter stream, and then looking for the tweet by Bloomberg, so there is a very good chance you don't see the particular tweet

Re: RE: Spark or Storm

2015-06-19 Thread Enno Shioji
June 2015 03:57 *To:* Enno Shioji *Cc:* Will Briggs; asoni.le...@gmail.com; ayan guha; user; Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan *Subject:* Re: Spark or Storm not being able to read from Kafka using multiple nodes Kafka is plenty capable of doing this, by clustering

Re: RE: Spark or Storm

2015-06-19 Thread Enno Shioji
State guide - https://storm.apache.org/documentation/Trident-state In the end, I am totally open the suggestions and PRs on how to make the programming guide easier to understand. :) TD On Thu, Jun 18, 2015 at 11:47 PM, Enno Shioji eshi...@gmail.com wrote: Tbh I find the doc around

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
internal state (e.g. reduceByKeyAndWindow) are exactly-once too. Matei On Jun 17, 2015, at 8:26 AM, Enno Shioji eshi...@gmail.com wrote: The thing is, even with that improvement, you still have to make updates idempotent or transactional yourself. If you read http://spark.apache.org/docs/latest

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far from ideal) There is also no exactly-once semantics. (updateStateByKey can achieve

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
? On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com wrote: We've evaluated Spark Streaming vs. Storm and ended up sticking with Storm. Some of the important draw backs are: Spark has no back pressure (receiver rate limit can alleviate this to a certain point, but it's far

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
) so that i know how much i accumulated at any given point as events for same phone can go to any node / executor. Can some one please tell me how can i achieve this is spark as in storm i can have a bolt which can do this ? Thanks, On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
accumulated at any given point as events for same phone can go to any node / executor. Can some one please tell me how can i achieve this is spark as in storm i can have a bolt which can do this ? Thanks, On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote: I guess both

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
, Enno Shioji eshi...@gmail.com wrote: So Spark (not streaming) does offer exactly once. Spark Streaming however, can only do exactly once semantics *if the update operation is idempotent*. updateStateByKey's update operation is idempotent, because it completely replaces the previous state. So

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
:09 PM, Ashish Soni asoni.le...@gmail.com wrote: Stream can also be processed in micro-batch / batches which is the main reason behind Spark Steaming so what is the difference ? Ashish On Wed, Jun 17, 2015 at 9:04 AM, Enno Shioji eshi...@gmail.com wrote: PS just to elaborate on my first

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
17, 2015 at 1:50 PM, Ashish Soni asoni.le...@gmail.com wrote: As per my Best Understanding Spark Streaming offer Exactly once processing , is this achieve only through updateStateByKey or there is another way to do the same. Ashish On Wed, Jun 17, 2015 at 8:48 AM, Enno Shioji eshi

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
or storm for fault tolerance and load balancing. Is it a correct approach? On 17 Jun 2015 23:07, Enno Shioji eshi...@gmail.com wrote: Hi Ayan, Admittedly I haven't done much with Kinesis, but if I'm not mistaken you should be able to use their processor interface for that. In this example, it's

Re: Spark or Storm

2015-06-17 Thread Enno Shioji
://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji eshi...@gmail.com wrote: AFAIK KCL is *supposed* to provide fault tolerance and load balancing (plus additionally, elastic scaling unlike Storm), Kinesis

Is a higher-res or vector version of Spark logo available?

2015-04-23 Thread Enno Shioji
My employer (adform.com) would like to use the Spark logo in a recruitment event (to indicate that we are using Spark in our company). I looked in the Spark repo (https://github.com/apache/spark/tree/master/docs/img) but couldn't find a vector format. Is a higher-res or vector format version

Re: SparkStreaming Low Performance

2015-02-14 Thread Enno Shioji
lazy val mapper = new ObjectMapper() with ScalaObjectMapper mapper.registerModule(DefaultScalaModule) } val jsonStream = myDStream.map(x= { Holder.mapper.readValue[Map[String,Any]](x) }) Thanks Best Regards On Sat, Feb 14, 2015 at 7:32 PM, Enno Shioji eshi

Re: SparkStreaming Low Performance

2015-02-14 Thread Enno Shioji
know if i use sparkSQL with schemaRDD and all it will be much faster, but i need that in SparkStreaming. Thanks Best Regards On Sat, Feb 14, 2015 at 8:04 PM, Enno Shioji eshi...@gmail.com wrote: I see. I'd really benchmark how the parsing performs outside Spark (in a tight loop or something

Re: SparkStreaming Low Performance

2015-02-14 Thread Enno Shioji
the ObjectMapper, if i take it outside of my map operation then it throws Serializable Exceptions (Caused by: java.io.NotSerializableException: com.fasterxml.jackson.module.scala.modifiers.SetTypeModifier). Thanks Best Regards On Sat, Feb 14, 2015 at 7:11 PM, Enno Shioji eshi...@gmail.com wrote: If I

Re: Profiling in YourKit

2015-02-07 Thread Enno Shioji
1 You have 4 CPU core and 34 threads (system wide, you likely have many more, by the way). Think of it as having 4 espresso machine and 34 baristas. Does the fact that you have only 4 espresso machine mean you can only have 4 baristas? Of course not, there's plenty more work other than making

Re: Problems with Spark Core 1.2.0 SBT project in IntelliJ

2015-01-13 Thread Enno Shioji
. Not using gen-idea in sbt. On Wed, Jan 14, 2015 at 8:52 AM, Jay Vyas jayunit100.apa...@gmail.com wrote: I find importing a working SBT project into IntelliJ is the way to go. How did you load the project into intellij? On Jan 13, 2015, at 4:45 PM, Enno Shioji eshi...@gmail.com wrote

Re: Problems with Spark Core 1.2.0 SBT project in IntelliJ

2015-01-13 Thread Enno Shioji
Had the same issue. I can't remember what the issue was but this works: libraryDependencies ++= { val sparkVersion = 1.2.0 Seq( org.apache.spark %% spark-core % sparkVersion % provided, org.apache.spark %% spark-streaming % sparkVersion % provided, org.apache.spark %%

Re: Registering custom metrics

2015-01-08 Thread Enno Shioji
/9b94736c2bad2f4b8e23 ᐧ On Mon, Jan 5, 2015 at 2:56 PM, Enno Shioji eshi...@gmail.com wrote: Hi Gerard, Thanks for the answer! I had a good look at it, but I couldn't figure out whether one can use that to emit metrics from your application code. Suppose I wanted to monitor the rate of bytes I produce, like so

TestSuiteBase based unit test using a sliding window join timesout

2015-01-07 Thread Enno Shioji
Hi, I extended org.apache.spark.streaming.TestSuiteBase for some testing, and I was able to run this test fine: test(Sliding window join with 3 second window duration) { val input1 = Seq( Seq(req1), Seq(req2, req3), Seq(), Seq(req4, req5, req6), Seq(req7),

Re: Better way of measuring custom application metrics

2015-01-04 Thread Enno Shioji
with class name to let it loaded by metrics system, for the details you can refer to http://spark.apache.org/docs/latest/monitoring.html or source code. Thanks Jerry *From:* Enno Shioji [mailto:eshi...@gmail.com] *Sent:* Sunday, January 4, 2015 7:47 AM *To:* user@spark.apache.org *Subject

Better way of measuring custom application metrics

2015-01-03 Thread Enno Shioji
I have a hack to gather custom application metrics in a Streaming job, but I wanted to know if there is any better way of doing this. My hack consists of this singleton: object Metriker extends Serializable { @transient lazy val mr: MetricRegistry = { val metricRegistry = new

Re: Big performance difference between client and cluster deployment mode; is this expected?

2014-12-31 Thread Enno Shioji
Also the job was deployed from the master machine in the cluster. ᐧ On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji eshi...@gmail.com wrote: Oh sorry that was a edit mistake. The code is essentially: val msgStream = kafkaStream .map { case (k, v) = v} .map

Re: Big performance difference between client and cluster deployment mode; is this expected?

2014-12-31 Thread Enno Shioji
a machine that's not in your Spark cluster? Then in client mode you're shipping data back to a less-nearby machine, compared to with cluster mode. That could explain the bottleneck. On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji eshi...@gmail.com wrote: Hi, I have a very, very simple streaming

Re: Big performance difference between client and cluster deployment mode; is this expected?

2014-12-31 Thread Enno Shioji
and cluster)? Accordingly what is the number of executors/cores requested? TD On Wed, Dec 31, 2014 at 10:36 AM, Enno Shioji eshi...@gmail.com wrote: Also the job was deployed from the master machine in the cluster. On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji eshi...@gmail.com wrote: Oh sorry

Writing and reading sequence file results in trailing extra data

2014-12-30 Thread Enno Shioji
Hi, I'm facing a weird issue. Any help appreciated. When I execute the below code and compare input and output, each record in the output has some extra trailing data appended to it, and hence corrupted. I'm just reading and writing, so the input and output should be exactly the same. I'm using

[SOLVED] Re: Writing and reading sequence file results in trailing extra data

2014-12-30 Thread Enno Shioji
This poor soul had the exact same problem and solution: http://stackoverflow.com/questions/24083332/write-and-read-raw-byte-arrays-in-spark-using-sequence-file-sequencefile ᐧ On Tue, Dec 30, 2014 at 10:58 AM, Enno Shioji eshi...@gmail.com wrote: Hi, I'm facing a weird issue. Any help

ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Enno Shioji
Is anybody experiencing this? It looks like a bug in JetS3t to me, but thought I'd sanity check before filing an issue. I'm writing to S3 using ReceiverInputDStream#saveAsTextFiles with a S3 URL (s3://fake-test/1234). The code does write to S3, but with double forward slashes

Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Enno Shioji
ᐧ I filed a new issue HADOOP-11444. According to HADOOP-10372, s3 is likely to be deprecated anyway in favor of s3n. Also the comment section notes that Amazon has implemented an EmrFileSystem for S3 which is built using AWS SDK rather than JetS3t. On Tue, Dec 23, 2014 at 2:06 PM, Enno Shioji