+1 to what Michael says
Drill is using logback and there has been zero pressure to move to another
framework.
On Mon, Nov 11, 2013 at 10:15 PM, Michael Rose wrote:
> I believe when Jon says "log4j" he refers to log4j2. Log4j2 is yet another
> successor to log4j, which claims to solve issues i
Slope one is one of the few algorithms that performs so uniformly poorly
that it is being removed from Mahout. I wouldn't recommend it for any
applications.
In general, recommendation isn't particularly well suited for on-line
operation since there is part of the computation that substantially
be
You can even add many forms of business logic into the same query
such as geographic constraints.
>
>
>
> 2014/1/11 Ted Dunning
>
>>
>> ...
>> Here are the links:
>>
>> [1] http://research.microsoft.com/apps/pubs/default.aspx?id=122779
>
On Sat, Jan 11, 2014 at 1:31 PM, Klausen Schaefersinho <
klaus.schaef...@gmail.com> wrote:
> @Ted: Thanks for your great response. Just one little question. With
>
> > cooccurrence analysis and is focused on sparsification of the
> cooccurrence matrix to produce an indicator matrix
>
> you mean t
Consider also whether you even *want* to pass large objects through your
tuples. If this will cause many copies of the object with no modification
or reference, you might be much better off leaving your object in a static
cache and simply passing around an ID. There are many heuristics for
managi
That is a fine idea if the information is not expected to change very often.
Zookeeper will not keep up with even a small Storm cluster that is trying
to change things at a high rate.
You have to multiply the change rate by the number of clients listening for
changes. If this product exceeds sev
Top what?
Most frequent? Or the top 1% based on some score attached to the tuples.
The latter is trivial. The former less so.
If you have the score problem, you just need to use an approximate quantile
algorithm like t-digest to get a continuous estimate of the 99-th percentile.
For the
On Tue, Jan 21, 2014 at 7:31 AM, wrote:
> You mentioned a approximate algorithm. That's great! I will check it out
> later. But, Is there a way to calculate it in a precise way?
If you want to select the 1% largest numbers, then you have a few choices.
If you have memory for the full set, you
On Wed, Jan 29, 2014 at 8:24 AM, Info wrote:
> Our engineering team has been hardly working on implementing our initial
> product and now we are ready to launch and grow a new innovative BigData
> company that will make BigData application development easier and faster.
You should proofread you
Feedback from Drill is that the next version (4.x) of Netty works better
than the Storm current (3).
Drill also does more explicit memory management so this might be a red
herring.
On Tue, Mar 4, 2014 at 2:36 AM, Richards Peter wrote:
> Hi Drew,
>
> Good that you identified the root cause for
In what sense do you mean when you say that reads in ZK are eventually
consistent?
You may get a slightly old value, but you are guaranteed to see a
consistent history. That is, if a value has values (which include version
numbers) v_1 ... v_n, then if you see v_i, you will never see v_j where j
bug if
>> ran for a week. Other correlated factors may include that the trident
>> topology has to occasionally fail batches, the zookeeper cluster has to be
>> under significant load from other applications beyond trident. I don't many
>> much details unfortunately.
>
Spark streaming is a very different animal than Storm in that it does
micro-batching rather than true streaming.
This has positives and negatives. Average latency on record by record
processing will appear to be abysmal compared to Storm. Throughput could
well be much higher because of the inher
Anybody who has ever only paid 40K$ to IBM for anything should deserve a
prize. That is just the entry fee.
On Mon, May 12, 2014 at 7:46 AM, Marc Vaillant wrote:
> To play devil's advocate, if you believe the stream performance gains,
> then the 40k will likely pay for itself in needing to de
Regardless of what IBM does, there is clearly a lot that Storm is not doing
for performance. This is largely by design for simplicity. For instance,
storm could use reflection and byte code engineering to merge bolts while
still allowing rearrangement and rebalancing. Likewise, when throughput
gets
On Sun, Jun 8, 2014 at 12:12 PM, joe roberts
wrote:
> Also, it seems Storm uses TCP via ZeroMQ by default -Is that right? And
> if so, can it be switched to use UDP or UDT instead, perhaps by replacing
> ZeroMQ with Netty?
>
Why would you want that?
Why do you think that UDP is faster?
On Sun, Jun 8, 2014 at 6:27 PM, joe roberts
wrote:
> To make it faster!
>
>
> On 6/8/2014 8:27 PM, Ted Dunning wrote:
>
>
> On Sun, Jun 8, 2014 at 12:12 PM, joe roberts <
> carl.roberts.zap...@gmail.com> wrote:
>
>
eptable, and some cases where reliable messages are
> needed (UDT), so for my particular use-cases, it is. As I understand it,
> Netty offers, UDP, UDT, and TCP classes, therefore, it provides what I need.
>
> On 6/8/2014 11:07 PM, Ted Dunning wrote:
>
>
> Why do you think t
aking measurements. JMH is your friend.
On Sun, Jun 8, 2014 at 9:11 PM, Ted Dunning wrote:
>
> If you read the replies on the SO question, you will find lots of people
> refuting the "UDP is faster" mantra.
>
> If you haven't already benchmarked Storm to determine th
ly stateless
> set of servers is really the way to go.
>
> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
> mich...@fullcontact.com
>
>
> On Sun, Jun 8, 2014 at 9:07 PM, Ted Dunning wrote:
>
&
mitation - Actions
> Command Interface (Client to Server ) *
>
>
>
>
> * - - Emotable Actions (/dance) - Take blue.sword - Give blue.sword to joe
> - Object identifiers for nouns - If red.goblin near a player is ID 30232
> then client sends: kill 30232 *
>
>
> * - Voice -
nt to Server ) *
>>
>>
>>
>>
>> * - - Emotable Actions (/dance) - Take blue.sword - Give blue.sword to
>> joe - Object identifiers for nouns - If red.goblin near a player is ID
>> 30232 then client sends: kill 30232 *
>>
>>
>> * - Voice - Channel
+5 points Number 10
On Mon, Jun 9, 2014 at 11:38 AM, P. Taylor Goetz wrote:
> This is a call to vote on selecting the winning Storm logo from the 3
> finalists.
>
> The three candidates are:
>
> * [No. 6 - Alec Bartos](
> http://storm.incubator.apache.org/2014/04/23/logo-abartos.html)
> * [No
I think that this vote is invalid. The points add up to more than 5.
One option is to reduce all by 5/8. Better option is for Binh to vote
again with a correct sum, say with 3 and 2 points.
On Mon, Jun 9, 2014 at 12:12 PM, Binh Nguyen Van wrote:
> #9 - 5 pts.
> #10 - 3 pts.
>
>
> On Mon, Jun
They are different.
Storm allows right now processing of tuples. Spark streaming requires
micro batching (which may be a really short time). Spark streaming allows
checkpointing of partial results in the stream supported by the framework.
Storm says you should roll your own or use trident.
App
el. e.g.
>> count of orders in last 1 minute, in Storm I have to write code to for
>> sliding windows and state management, while Spark seems to provide
>> operators to accomplish that. Tuple level operations such as enrichment,
>> filters etc.. seems also doable in both.
&
I love it. This is a real horse race!
On Mon, Jun 9, 2014 at 2:17 PM, Adam Lewis wrote:
> #10 - 5 pts.
>
>
> On Mon, Jun 9, 2014 at 5:02 PM, joe roberts > wrote:
>
>> 10 = 5 pts.
>>
>>
>> On 6/9/2014 2:38 PM, P. Taylor Goetz wrote:
>>
>> This is a call to vote on selecting the winning Stor
On Mon, Jun 9, 2014 at 2:27 PM, P. Taylor Goetz wrote:
> There is one study that I’m aware of that claims Spark streaming is
> insanely faster than Storm.
I like your way of describing the two tools as starting from differing
extremes with a common territory around micro-batching.
As such, it
On Mon, Jun 9, 2014 at 3:48 PM, Rajiv Onat wrote:
> a) I have stream of orders (keyed on customerid, source is socket)
> b) I filter for those orders that is from my high value customers (I have
> to make sure I have this list of high value customers available on all bolt
> tasks in memory for fa
te, in Storm I have to write code to for
> sliding windows and state management, while Spark seems to provide
> operators to accomplish that. Tuple level operations such as enrichment,
> filters etc.. seems also doable in both.
>
>
> On Mon, Jun 9, 2014 at 12:24 PM, Ted Dunning
>
If you can afford a bit more time for insertion, consider also t-digest.
Differences relative to the high dynamic range histogram system include:
- HDR histograms assume an exponential distribution. t-digest handles
arbitrary distributions
- t-digest is much more accurate near extreme values.
CodeHale doesn't handle extreme skew on measurements well last time I
looked. For throughput, averages are great. For latency, you need very
high percentiles to understand what is happening.
On Mon, Jun 16, 2014 at 6:00 PM, Michael Rose
wrote:
> What kind of issues does Metrics have that lead
32 matches
Mail list logo