Re: how to find top N values using map-reduce ?

2013-02-01 Thread Eugene Kirpichov
Hi,

Can you tell more about:
 * How big is N
 * How big is the input dataset
 * How many mappers you have
 * Do input splits correlate with the sorting criterion for top N?

Depending on the answers, very different strategies will be optimal.



On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar praveen...@gmail.comwrote:

 I am looking for a better solution for this.

 1 way to do this would be to find top N values from each mappers and
 then find out the top N out of them in 1 reducer.  I am afraid that
 this won't work effectively if my N is larger than number of values in
 my inputsplit (or mapper input).

 Otherway is to just sort all of them in 1 reducer and then do the cat of
 top-N.

 Wondering if there is any better approach to do this ?

 Regards
 Praveenesh




-- 
Eugene Kirpichov
http://www.linkedin.com/in/eugenekirpichov
http://jkff.info/software/timeplotters - my performance visualization tools


Re: Analysis of Log Files

2012-07-03 Thread Eugene Kirpichov
So you want to compute select max(date) from log group by product?
Can you describe how far you have advanced so far and where precisely
are you stuck?

On Tue, Jul 3, 2012 at 3:23 PM, Shailesh Samudrala
shailesh2...@gmail.com wrote:
 I am writing a sample application to analyze some log files of webpage
 accesses. Basically, the log files record which products where accessed,
 and on what date.
 I want to write a MapReduce program to determine on what date was a product
 most accessed.
 Please share your ideas with me. Thanks!



-- 
Eugene Kirpichov
http://www.linkedin.com/in/eugenekirpichov


Re: Analysis of Log Files

2012-07-03 Thread Eugene Kirpichov
Ok, I see, so you need to 1) group and count everything group by date
and product_id = {date, product_id, count} (this is 1 map+reduce) 2)
group this by product_id and get the value of date for which cnt is
highest (this is another 1 map+reduce).
Does this sound sensible?

I'm not sure if this can be efficiently done with just 1 stage of map+reduce.

On Tue, Jul 3, 2012 at 3:36 PM, Shailesh Samudrala
shailesh2...@gmail.com wrote:
 i want to find out how many times a product was searched during a day, and
 then select the day when this is highest.

 Until now, I have extracted all the required fields from the search string,
 and I am confused about what exactly I should be passing from the mapper to
 the reducer.

 On Tue, Jul 3, 2012 at 3:30 PM, Eugene Kirpichov ekirpic...@gmail.comwrote:

 So you want to compute select max(date) from log group by product?
 Can you describe how far you have advanced so far and where precisely
 are you stuck?

 On Tue, Jul 3, 2012 at 3:23 PM, Shailesh Samudrala
 shailesh2...@gmail.com wrote:
  I am writing a sample application to analyze some log files of webpage
  accesses. Basically, the log files record which products where accessed,
  and on what date.
  I want to write a MapReduce program to determine on what date was a
 product
  most accessed.
  Please share your ideas with me. Thanks!



 --
 Eugene Kirpichov
 http://www.linkedin.com/in/eugenekirpichov




-- 
Eugene Kirpichov
http://www.linkedin.com/in/eugenekirpichov


Re: Analysis of Log Files

2012-07-03 Thread Eugene Kirpichov
Well, then you can simply do it like this:
Map: emit key=product_id value=date
Reduce for a particular product_id: manually count (in a hashtable)
dates and their counts, return the date with the highest count

Assuming you've started selling products later than computers were
invented, this should be fine w.r.t. performance and memory
consumption :)

On Tue, Jul 3, 2012 at 3:52 PM, Shailesh Samudrala
shailesh2...@gmail.com wrote:
 Yes, I think that is possible, but I'm looking for a 1 MapReduce job
 solution, if possible.

 On Tue, Jul 3, 2012 at 3:46 PM, Eugene Kirpichov ekirpic...@gmail.comwrote:

 Ok, I see, so you need to 1) group and count everything group by date
 and product_id = {date, product_id, count} (this is 1 map+reduce) 2)
 group this by product_id and get the value of date for which cnt is
 highest (this is another 1 map+reduce).
 Does this sound sensible?

 I'm not sure if this can be efficiently done with just 1 stage of
 map+reduce.

 On Tue, Jul 3, 2012 at 3:36 PM, Shailesh Samudrala
 shailesh2...@gmail.com wrote:
  i want to find out how many times a product was searched during a day,
 and
  then select the day when this is highest.
 
  Until now, I have extracted all the required fields from the search
 string,
  and I am confused about what exactly I should be passing from the mapper
 to
  the reducer.
 
  On Tue, Jul 3, 2012 at 3:30 PM, Eugene Kirpichov ekirpic...@gmail.com
 wrote:
 
  So you want to compute select max(date) from log group by product?
  Can you describe how far you have advanced so far and where precisely
  are you stuck?
 
  On Tue, Jul 3, 2012 at 3:23 PM, Shailesh Samudrala
  shailesh2...@gmail.com wrote:
   I am writing a sample application to analyze some log files of webpage
   accesses. Basically, the log files record which products where
 accessed,
   and on what date.
   I want to write a MapReduce program to determine on what date was a
  product
   most accessed.
   Please share your ideas with me. Thanks!
 
 
 
  --
  Eugene Kirpichov
  http://www.linkedin.com/in/eugenekirpichov
 



 --
 Eugene Kirpichov
 http://www.linkedin.com/in/eugenekirpichov




-- 
Eugene Kirpichov
http://www.linkedin.com/in/eugenekirpichov


Re: execute millions of grep

2011-11-03 Thread Eugene Kirpichov
If you really need to do millions of exact text queries against millions of
documents in realtime, a simple grep is not going to be sufficient for you.
You'll need smarter datastructures and algorithms.

Please specify how frequently the set of *queries* changes and what you
consider real time.

On Thu, Nov 3, 2011 at 2:46 PM, Oliver Krohne oliver.kro...@yieldkit.comwrote:

 Hi,

 I' am evaluating different solutions for massive phrase query execution. I
 need to execute millions of greps or more precise phrase queries consisting
 of 1-4 terms against millions of documents. I saw the hadoop grep example
 but this is executing grep with one regex.

 I also saw the Side data distribution / Distributed Cache possibility
 of hadoop. So I could pass them to the mapper and execute each query agains
 the input line. The input line would be the entire text of an document
 (usually 50-500 words).

 As I am aiming to  have these information almost in realtime another
 questions arises about adhoc map/reduce jobs. Is there a limit of running a
 lot of jobs in parallel, lets say if I would fire a new job once a new
 document arises. This job would only process that particular document. Or I
 would batch 100-1000 documents and then fire the job.

 Can anyone advise an approach of doing it with hadoop?

 Thanks in advance,
 Oliver
















-- 
Eugene Kirpichov
Principal Engineer, Mirantis Inc. http://www.mirantis.com/
Editor, http://fprog.ru/


Re: execute millions of grep

2011-11-03 Thread Eugene Kirpichov
Hi Oliver,
I have solved a similar problem before, and it seems to me that the best 
solution is to build an automaton (with whole words, not letters on edges) on 
the set of queries (basically, a trie) that will allow you to find all 
queries matching a document by a single pass over the document.
For 100k queries, the true will be very fast and very easy to build, and it 
will not consume much memory.
The document scan is as follows for each word:
 - add root of trie to active nodes set
 - for each node in the active set, replace the node with its child 
corresponding to the scanned word.

Then whenever you add a leaf node to the active set, you are seeing a match of 
some query that ends at the current word.

Please tell if this description needs more detail.

03.11.2011, в 15:20, Oliver Krohne oliver.kro...@yieldkit.com написал(а):

 Hi Eugene,
 
 thanks for the quick reply.
 
 Not only the queries are changing but also the documents:
 
 a)The long time goal would be able to process a document or a batch of 
 documents after a certain event which a mean with realtime. There will be 
 different types of documents which are updated differently. One set documents 
 will be updated every 1-2 hours. Other documents will be updated once a week. 
 So an event means that the queries have to run against that set documents 
 which are updated. So probably I need to run the queries every hour against 
 against a subset and one a week the other documents need to be searched.
 
 b)The queries are changing a couple of times a day. At this stage we have 
 around 100k queries but the will grow every day. A huge set of the queries 
 remain the same but I expect that after 3 month 10% of the queries will 
 change every day. Either they are deleted or new ones are added. 
 
 One of the main question am still thinking of is where to search in. Search 
 the queries in the documents or search the document terms in the queries. 
 
 I will need at first exact match of a phrase query and in a next step phrase 
 match given a precision/slop.
 
 Thanks for your help,
 Oliver
 
 
 Am 03.11.2011 um 11:52 schrieb Eugene Kirpichov:
 
 If you really need to do millions of exact text queries against millions of
 documents in realtime, a simple grep is not going to be sufficient for you.
 You'll need smarter datastructures and algorithms.
 
 Please specify how frequently the set of *queries* changes and what you
 consider real time.
 
 On Thu, Nov 3, 2011 at 2:46 PM, Oliver Krohne 
 oliver.kro...@yieldkit.comwrote:
 
 Hi,
 
 I' am evaluating different solutions for massive phrase query execution. I
 need to execute millions of greps or more precise phrase queries consisting
 of 1-4 terms against millions of documents. I saw the hadoop grep example
 but this is executing grep with one regex.
 
 I also saw the Side data distribution / Distributed Cache possibility
 of hadoop. So I could pass them to the mapper and execute each query agains
 the input line. The input line would be the entire text of an document
 (usually 50-500 words).
 
 As I am aiming to  have these information almost in realtime another
 questions arises about adhoc map/reduce jobs. Is there a limit of running a
 lot of jobs in parallel, lets say if I would fire a new job once a new
 document arises. This job would only process that particular document. Or I
 would batch 100-1000 documents and then fire the job.
 
 Can anyone advise an approach of doing it with hadoop?
 
 Thanks in advance,
 Oliver
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 -- 
 Eugene Kirpichov
 Principal Engineer, Mirantis Inc. http://www.mirantis.com/
 Editor, http://fprog.ru/
 
 ---
 
 Oliver Krohne
 Founder  CTO
 
 YieldKit UG (haftungsbeschränkt)
 Mittelweg 161
 20148 Hamburg
 
 T +49 40 209 349 771
 F +49 40 209 349 779
 E oliver.kro...@yieldkit.com
 
 http://www.yieldkit.com
 
 _
 Sitz der Gesellschaft: Hamburg
 Geschäftsführer: Sandra Tiemann, Oliver Krohne
 Handelsregister: Amtsgericht Hamburg HRB 109104
 
 
 
 
 
 
 
 
 
 
 


Re: execute millions of grep

2011-11-03 Thread Eugene Kirpichov
Hi,

Hm - I was not assuming that you'll have 20mln queries.
However, you can partition the query space into as many parts as needed to
fit each part in memory, and accordingly increase the number of passes over
a document - or just do it in parallel with mapreduce.

I am not sure I understand how your inverted index solution works. Can you
elaborate?

On Thu, Nov 3, 2011 at 3:55 PM, Oliver Krohne oliver.kro...@yieldkit.comwrote:

 Hi Eugene,

 thanks for the hint. I looked at automaton and of course 100k are not
 consuming a lot of memory and a single pass through the document is
 perfect. But how will it work with 20 million and more queries?
 In addition there will be different sets of queries for each language. So
 we can expect to have 20 million + queries for each language. Of course
 they would be different automatons.

 Another approach I am thinking of is to build an inverted index of the
 queries and during map I will lookup the queries for each word and during
 reduce do normal query processing. What do you think about that ?

 regards,
 Oliver

 Am 03.11.2011 um 12:37 schrieb Eugene Kirpichov:

  Hi Oliver,
  I have solved a similar problem before, and it seems to me that the best
 solution is to build an automaton (with whole words, not letters on edges)
 on the set of queries (basically, a trie) that will allow you to find all
 queries matching a document by a single pass over the document.
  For 100k queries, the true will be very fast and very easy to build, and
 it will not consume much memory.
  The document scan is as follows for each word:
  - add root of trie to active nodes set
  - for each node in the active set, replace the node with its child
 corresponding to the scanned word.
 
  Then whenever you add a leaf node to the active set, you are seeing a
 match of some query that ends at the current word.
 
  Please tell if this description needs more detail.
 
  03.11.2011, в 15:20, Oliver Krohne oliver.kro...@yieldkit.com
 написал(а):
 
  Hi Eugene,
 
  thanks for the quick reply.
 
  Not only the queries are changing but also the documents:
 
  a)The long time goal would be able to process a document or a batch of
 documents after a certain event which a mean with realtime. There will be
 different types of documents which are updated differently. One set
 documents will be updated every 1-2 hours. Other documents will be updated
 once a week. So an event means that the queries have to run against that
 set documents which are updated. So probably I need to run the queries
 every hour against against a subset and one a week the other documents need
 to be searched.
 
  b)The queries are changing a couple of times a day. At this stage we
 have around 100k queries but the will grow every day. A huge set of the
 queries remain the same but I expect that after 3 month 10% of the queries
 will change every day. Either they are deleted or new ones are added.
 
  One of the main question am still thinking of is where to search in.
 Search the queries in the documents or search the document terms in the
 queries.
 
  I will need at first exact match of a phrase query and in a next step
 phrase match given a precision/slop.
 
  Thanks for your help,
  Oliver
 
 
  Am 03.11.2011 um 11:52 schrieb Eugene Kirpichov:
 
  If you really need to do millions of exact text queries against
 millions of
  documents in realtime, a simple grep is not going to be sufficient for
 you.
  You'll need smarter datastructures and algorithms.
 
  Please specify how frequently the set of *queries* changes and what you
  consider real time.
 
  On Thu, Nov 3, 2011 at 2:46 PM, Oliver Krohne 
 oliver.kro...@yieldkit.comwrote:
 
  Hi,
 
  I' am evaluating different solutions for massive phrase query
 execution. I
  need to execute millions of greps or more precise phrase queries
 consisting
  of 1-4 terms against millions of documents. I saw the hadoop grep
 example
  but this is executing grep with one regex.
 
  I also saw the Side data distribution / Distributed Cache
 possibility
  of hadoop. So I could pass them to the mapper and execute each query
 agains
  the input line. The input line would be the entire text of an document
  (usually 50-500 words).
 
  As I am aiming to  have these information almost in realtime another
  questions arises about adhoc map/reduce jobs. Is there a limit of
 running a
  lot of jobs in parallel, lets say if I would fire a new job once a new
  document arises. This job would only process that particular
 document. Or I
  would batch 100-1000 documents and then fire the job.
 
  Can anyone advise an approach of doing it with hadoop?
 
  Thanks in advance,
  Oliver
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  --
  Eugene Kirpichov
  Principal Engineer, Mirantis Inc. http://www.mirantis.com/
  Editor, http://fprog.ru/
 
  ---
 
  Oliver Krohne
  Founder  CTO
 
  YieldKit UG (haftungsbeschränkt)
  Mittelweg 161
  20148 Hamburg
 
  T +49 40 209 349 771
  F +49 40 209 349 779
  E oliver.kro

Re: data locality

2011-10-26 Thread Eugene Kirpichov
Thanks!

2011/10/26 Steve Loughran ste...@apache.org:
 On 26/10/11 05:22, Eugene Kirpichov wrote:

 But I guess it isn't always possible to achieve optimal scheduling, right?
 What's done then; any account for network topology perhaps?

 I'd recommend this paper if you are curious, it explains the Fair Scheduler

 http://www.cs.berkeley.edu/~matei/papers/2010/eurosys_delay_scheduling.pdf

 You can plug in different schedulers with different policies if you want




-- 
Eugene Kirpichov
Principal Engineer, Mirantis Inc. http://www.mirantis.com/
Editor, http://fprog.ru/


Re: data locality

2011-10-25 Thread Eugene Kirpichov
But I guess it isn't always possible to achieve optimal scheduling, right?
What's done then; any account for network topology perhaps?



26.10.2011, в 4:42, Mapred Learn mapred.le...@gmail.com написал(а):

 Yes that's right !
 
 Sent from my iPhone
 
 On Oct 25, 2011, at 5:36 PM, ivan.nov...@emc.com wrote:
 
 So I guess the job tracker is the one reading the HDFS meta-data and then
 optimizing the scheduling of map jobs based on that?
 
 
 On 10/25/11 3:13 PM, Shevek she...@karmasphere.com wrote:
 
 We pray to $deity that the mapreduce block size is about the same as (or
 smaller than) the hdfs block size. We also pray that file format
 synchronization points are frequent when compared to block boundaries.
 
 The JobClient finds the location of each block of each file. It splits the
 job into FileSplit(s), with one per block.
 
 Each FileSplit is processed by a task. The Split contains the locations in
 which the task should best be run.
 
 The last block may be very short. It is then subsumed into the preceding
 block.
 
 Some data is transferred between nodes when the synchronization point for
 the file format is not at a block boundary. (It basically never is, but we
 hope it's close, or the purpose of MR locality is defeated.)
 
 Specifically to your questions: Most of the data should be read from the
 local hdfs node under the above assumptions. The communication layer
 between
 mapreduce and hdfs is not special.
 
 S.
 
 On 25 October 2011 11:49, ivan.nov...@emc.com wrote:
 
 Hello,
 
 I am trying to understand how data locality works in hadoop.
 
 If you run a map reduce job do the mappers only read data from the host
 on
 which they are running?
 
 Is there a communication protocol between the map reduce layer and HDFS
 layer so that the mapper gets optimized to read data locally?
 
 Any pointers on which layer of the stack handles this?
 
 Cheers,
 Ivan
 
 


Re: Hbase + mapreduce -- operational design question

2011-09-10 Thread Eugene Kirpichov
I believe HBase has some kind of TTL (timeout-based expiry) for
records and it can clean them up on its own.

On Sat, Sep 10, 2011 at 1:54 AM, Dhodapkar, Chinmay
chinm...@qualcomm.com wrote:
 Hello,
 I have a setup where a bunch of clients store 'events' in an Hbase table . 
 Also, periodically(once a day), I run a mapreduce job that goes over the 
 table and computes some reports.

 Now my issue is that the next time I don't want mapreduce job to process the 
 'events' that it has already processed previously. I know that I can mark 
 processed event in the hbase table and the mapper can filter them them out 
 during the next run. But what I would really like/want is that previously 
 processed events don't even hit the mapper.

 One solution I can think of is to backup the hbase table after running the 
 job and then clear the table. But this has lot of problems..
 1) Clients may have inserted events while the job was running.
 2) I could disable and drop the table and then create it again...but then the 
 clients would complain about this short window of unavailability.


 What do people using Hbase (live) + mapreduce typically do. ?

 Thanks!
 Chinmay





-- 
Eugene Kirpichov
Principal Engineer, Mirantis Inc. http://www.mirantis.com/
Editor, http://fprog.ru/


HBase and HDFS sync support in the main branch

2011-09-09 Thread Eugene Kirpichov
Hello,

It appears from http://wiki.apache.org/hadoop/Hbase/HdfsSyncSupport
that for more than at least the latest year, durable edits haven't
been supported by the main branch of HBase  HDFS - only by a non-main
branch, an unreleased branch and a third-party distribution.

This seems strange to me, as
http://hbase.apache.org/acid-semantics.html seems to claim that
durability is present, without indicating that it's actually not, by
default (part of the page looks like a plan, not like a claim about
the current state of things, which is also confusing).

Is this because durable edits are not that much of a needed feature?

[Disclaimer: I've not done much with HBase and Hadoop in general in a
while, so I may be asking questions completely stupid in the current
context]

-- 
Eugene Kirpichov
Principal Engineer, Mirantis Inc. http://www.mirantis.com/
Editor, http://fprog.ru/


Re: HBase and HDFS sync support in the main branch

2011-09-09 Thread Eugene Kirpichov
Hello Arun,

Thanks for the clarification. Do you mean 0.20.205 will be a default
(main, trunk, you name it) release and have durable edits?

On Fri, Sep 9, 2011 at 2:22 PM, Arun C Murthy a...@hortonworks.com wrote:
 Eugene,

  Currently we are close to getting 0.20.205 frozen which will be the first 
 Apache Hadoop release with proper support for HBase.

 hth,
 Arun

 On Sep 9, 2011, at 9:06 AM, Eugene Kirpichov wrote:

 Hello,

 It appears from http://wiki.apache.org/hadoop/Hbase/HdfsSyncSupport
 that for more than at least the latest year, durable edits haven't
 been supported by the main branch of HBase  HDFS - only by a non-main
 branch, an unreleased branch and a third-party distribution.

 This seems strange to me, as
 http://hbase.apache.org/acid-semantics.html seems to claim that
 durability is present, without indicating that it's actually not, by
 default (part of the page looks like a plan, not like a claim about
 the current state of things, which is also confusing).

 Is this because durable edits are not that much of a needed feature?

 [Disclaimer: I've not done much with HBase and Hadoop in general in a
 while, so I may be asking questions completely stupid in the current
 context]

 --
 Eugene Kirpichov
 Principal Engineer, Mirantis Inc. http://www.mirantis.com/
 Editor, http://fprog.ru/





-- 
Eugene Kirpichov
Principal Engineer, Mirantis Inc. http://www.mirantis.com/
Editor, http://fprog.ru/