Re: CapacityScheduler vs. FairScheduler

2016-07-07 Thread Wangda Tan
Hi folks,

I've a short write-up about feature-wise comparison between fair scheduler
and capacity scheduler for latest Hadoop.

https://wangda.live/2016/07/07/an-updated-feature-comparison-between-capacity-scheduler-and-fair-scheduler/

HTH,

- Wangda

On Sun, Jun 19, 2016 at 11:21 PM, sandeep vura 
wrote:

> Hi,
>
> I too have same doubt !! Please clarify.
>
> Regards,
> sandeep.v
>
> On Fri, Jun 10, 2016 at 6:08 AM, Alvin Chyan  wrote:
>
>> I have the same question.
>>
>> Thanks!
>>
>>
>> *Alvin Chyan*Lead Software Engineer, Data
>> 901 Marshall St, Suite 200, Redwood City, CA 94063
>>
>>
>> turn.com    |   @TurnPlatform
>> 
>>
>> This message is Turn Confidential, except for information included that
>> is already available to the public. If this message was sent to you
>> accidentally, please delete it.
>>
>> On Fri, Jun 3, 2016 at 11:04 AM, Lars Francke 
>> wrote:
>>
>>> Hi,
>>>
>>> I've been using Hadoop for years and have always just taken for granted
>>> that FairScheduler = Cloudera and CapacityScheduler = Hortonworks/Yahoo.
>>> There are some comparisons but all of them are years old and somewhat (if
>>> not entirely) outdated.
>>>
>>> The documentation doesn't really help and neither does the Javadoc. The
>>> code of both is fairly complex.
>>>
>>> So my question is: How do these two Schedulers really differ today? What
>>> are some features that one has that the other doesn't? Are there any
>>> fundamental differences (anymore)?
>>>
>>> Any insight is welcome.
>>>
>>> Thank you!
>>>
>>> Cheers,
>>> Lars
>>>
>>
>>
>


Re: Help designing application architecture

2016-07-07 Thread Ted Yu
For 1) you don't have to introduce external storage.

You can define case classes for the known formats.

FYI

On Thu, Jul 7, 2016 at 4:40 PM, venito camelas 
wrote:

> I'm pretty new to this and I have a use case I'm not sure how to
> implement, I'll try to explain it and I'd appreciate if anyone could point
> me in the right direction.
>
> The case has these requirements:
>  1 - Any user shoud be able to define the format of the information they
> want to store (channel). For example, user X defines a channel named
> "coordinate":
> coordinate = {
> "X" : "Float",
> "Y" : "Float",
> "instant" : "Timestamp"
> }
>   Every channel has some time value, it can be an instant (like above) or
> a period of time ("start" : "Timestamp", "end" : "Timestamp")
>
>  2 - Given the previous example, the user should be able to ask the
> following questions:
> 2.1 When was the last time I went near {X : x, Y : y}?  --> Process the
> information in order to get the "near" places and return the newest one.
> 2.2 Where was I on march 6th between 1pm and 2pm?   --> Query by time
>
>
>
> For 1) I was thinking of using some Document oriented storage because of
> the channels lack of structure, not sure that's the only thing to consider
> though.
>
> For 2.1) I'd use some MR job
>
> For 2.2) I think it would be better to have the information in the
> document storage and make the queries there.
>
> Is it a good approach to have the information stored both in the hdfs and
> the document oriented storage (for processing and querying respectively)?
>
> As I mentioned in the beginning, I'm really new to this and I'm just
> trying to learn..so sorry if my doubts are silly.
>
> Any suggestion or any good reference related to this will be much
> appreciated.
>


Help designing application architecture

2016-07-07 Thread venito camelas
I'm pretty new to this and I have a use case I'm not sure how to implement,
I'll try to explain it and I'd appreciate if anyone could point me in the
right direction.

The case has these requirements:
 1 - Any user shoud be able to define the format of the information they
want to store (channel). For example, user X defines a channel named
"coordinate":
coordinate = {
"X" : "Float",
"Y" : "Float",
"instant" : "Timestamp"
}
  Every channel has some time value, it can be an instant (like above) or a
period of time ("start" : "Timestamp", "end" : "Timestamp")

 2 - Given the previous example, the user should be able to ask the
following questions:
2.1 When was the last time I went near {X : x, Y : y}?  --> Process the
information in order to get the "near" places and return the newest one.
2.2 Where was I on march 6th between 1pm and 2pm?   --> Query by time



For 1) I was thinking of using some Document oriented storage because of
the channels lack of structure, not sure that's the only thing to consider
though.

For 2.1) I'd use some MR job

For 2.2) I think it would be better to have the information in the document
storage and make the queries there.

Is it a good approach to have the information stored both in the hdfs and
the document oriented storage (for processing and querying respectively)?

As I mentioned in the beginning, I'm really new to this and I'm just trying
to learn..so sorry if my doubts are silly.

Any suggestion or any good reference related to this will be much
appreciated.


YARN application start event

2016-07-07 Thread Alvaro Brandon
Hello everyone:

I was wondering if there is any way to capture the event of an application
starting in YARN. The idea is to implement a Listener that every time a
YARN application starts, will query the REST API to get the current memory
and cores availables in the cluster. Any ideas on this?

Thanks in advance,

Alvaro