Re: how to find top N values using map-reduce ?
Hi, Can you tell more about: * How big is N * How big is the input dataset * How many mappers you have * Do input splits correlate with the sorting criterion for top N? Depending on the answers, very different strategies will be optimal. On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar praveen...@gmail.comwrote: I am looking for a better solution for this. 1 way to do this would be to find top N values from each mappers and then find out the top N out of them in 1 reducer. I am afraid that this won't work effectively if my N is larger than number of values in my inputsplit (or mapper input). Otherway is to just sort all of them in 1 reducer and then do the cat of top-N. Wondering if there is any better approach to do this ? Regards Praveenesh -- Eugene Kirpichov http://www.linkedin.com/in/eugenekirpichov http://jkff.info/software/timeplotters - my performance visualization tools
Re: Analysis of Log Files
So you want to compute select max(date) from log group by product? Can you describe how far you have advanced so far and where precisely are you stuck? On Tue, Jul 3, 2012 at 3:23 PM, Shailesh Samudrala shailesh2...@gmail.com wrote: I am writing a sample application to analyze some log files of webpage accesses. Basically, the log files record which products where accessed, and on what date. I want to write a MapReduce program to determine on what date was a product most accessed. Please share your ideas with me. Thanks! -- Eugene Kirpichov http://www.linkedin.com/in/eugenekirpichov
Re: Analysis of Log Files
Ok, I see, so you need to 1) group and count everything group by date and product_id = {date, product_id, count} (this is 1 map+reduce) 2) group this by product_id and get the value of date for which cnt is highest (this is another 1 map+reduce). Does this sound sensible? I'm not sure if this can be efficiently done with just 1 stage of map+reduce. On Tue, Jul 3, 2012 at 3:36 PM, Shailesh Samudrala shailesh2...@gmail.com wrote: i want to find out how many times a product was searched during a day, and then select the day when this is highest. Until now, I have extracted all the required fields from the search string, and I am confused about what exactly I should be passing from the mapper to the reducer. On Tue, Jul 3, 2012 at 3:30 PM, Eugene Kirpichov ekirpic...@gmail.comwrote: So you want to compute select max(date) from log group by product? Can you describe how far you have advanced so far and where precisely are you stuck? On Tue, Jul 3, 2012 at 3:23 PM, Shailesh Samudrala shailesh2...@gmail.com wrote: I am writing a sample application to analyze some log files of webpage accesses. Basically, the log files record which products where accessed, and on what date. I want to write a MapReduce program to determine on what date was a product most accessed. Please share your ideas with me. Thanks! -- Eugene Kirpichov http://www.linkedin.com/in/eugenekirpichov -- Eugene Kirpichov http://www.linkedin.com/in/eugenekirpichov
Re: Analysis of Log Files
Well, then you can simply do it like this: Map: emit key=product_id value=date Reduce for a particular product_id: manually count (in a hashtable) dates and their counts, return the date with the highest count Assuming you've started selling products later than computers were invented, this should be fine w.r.t. performance and memory consumption :) On Tue, Jul 3, 2012 at 3:52 PM, Shailesh Samudrala shailesh2...@gmail.com wrote: Yes, I think that is possible, but I'm looking for a 1 MapReduce job solution, if possible. On Tue, Jul 3, 2012 at 3:46 PM, Eugene Kirpichov ekirpic...@gmail.comwrote: Ok, I see, so you need to 1) group and count everything group by date and product_id = {date, product_id, count} (this is 1 map+reduce) 2) group this by product_id and get the value of date for which cnt is highest (this is another 1 map+reduce). Does this sound sensible? I'm not sure if this can be efficiently done with just 1 stage of map+reduce. On Tue, Jul 3, 2012 at 3:36 PM, Shailesh Samudrala shailesh2...@gmail.com wrote: i want to find out how many times a product was searched during a day, and then select the day when this is highest. Until now, I have extracted all the required fields from the search string, and I am confused about what exactly I should be passing from the mapper to the reducer. On Tue, Jul 3, 2012 at 3:30 PM, Eugene Kirpichov ekirpic...@gmail.com wrote: So you want to compute select max(date) from log group by product? Can you describe how far you have advanced so far and where precisely are you stuck? On Tue, Jul 3, 2012 at 3:23 PM, Shailesh Samudrala shailesh2...@gmail.com wrote: I am writing a sample application to analyze some log files of webpage accesses. Basically, the log files record which products where accessed, and on what date. I want to write a MapReduce program to determine on what date was a product most accessed. Please share your ideas with me. Thanks! -- Eugene Kirpichov http://www.linkedin.com/in/eugenekirpichov -- Eugene Kirpichov http://www.linkedin.com/in/eugenekirpichov -- Eugene Kirpichov http://www.linkedin.com/in/eugenekirpichov
Re: execute millions of grep
If you really need to do millions of exact text queries against millions of documents in realtime, a simple grep is not going to be sufficient for you. You'll need smarter datastructures and algorithms. Please specify how frequently the set of *queries* changes and what you consider real time. On Thu, Nov 3, 2011 at 2:46 PM, Oliver Krohne oliver.kro...@yieldkit.comwrote: Hi, I' am evaluating different solutions for massive phrase query execution. I need to execute millions of greps or more precise phrase queries consisting of 1-4 terms against millions of documents. I saw the hadoop grep example but this is executing grep with one regex. I also saw the Side data distribution / Distributed Cache possibility of hadoop. So I could pass them to the mapper and execute each query agains the input line. The input line would be the entire text of an document (usually 50-500 words). As I am aiming to have these information almost in realtime another questions arises about adhoc map/reduce jobs. Is there a limit of running a lot of jobs in parallel, lets say if I would fire a new job once a new document arises. This job would only process that particular document. Or I would batch 100-1000 documents and then fire the job. Can anyone advise an approach of doing it with hadoop? Thanks in advance, Oliver -- Eugene Kirpichov Principal Engineer, Mirantis Inc. http://www.mirantis.com/ Editor, http://fprog.ru/
Re: execute millions of grep
Hi Oliver, I have solved a similar problem before, and it seems to me that the best solution is to build an automaton (with whole words, not letters on edges) on the set of queries (basically, a trie) that will allow you to find all queries matching a document by a single pass over the document. For 100k queries, the true will be very fast and very easy to build, and it will not consume much memory. The document scan is as follows for each word: - add root of trie to active nodes set - for each node in the active set, replace the node with its child corresponding to the scanned word. Then whenever you add a leaf node to the active set, you are seeing a match of some query that ends at the current word. Please tell if this description needs more detail. 03.11.2011, в 15:20, Oliver Krohne oliver.kro...@yieldkit.com написал(а): Hi Eugene, thanks for the quick reply. Not only the queries are changing but also the documents: a)The long time goal would be able to process a document or a batch of documents after a certain event which a mean with realtime. There will be different types of documents which are updated differently. One set documents will be updated every 1-2 hours. Other documents will be updated once a week. So an event means that the queries have to run against that set documents which are updated. So probably I need to run the queries every hour against against a subset and one a week the other documents need to be searched. b)The queries are changing a couple of times a day. At this stage we have around 100k queries but the will grow every day. A huge set of the queries remain the same but I expect that after 3 month 10% of the queries will change every day. Either they are deleted or new ones are added. One of the main question am still thinking of is where to search in. Search the queries in the documents or search the document terms in the queries. I will need at first exact match of a phrase query and in a next step phrase match given a precision/slop. Thanks for your help, Oliver Am 03.11.2011 um 11:52 schrieb Eugene Kirpichov: If you really need to do millions of exact text queries against millions of documents in realtime, a simple grep is not going to be sufficient for you. You'll need smarter datastructures and algorithms. Please specify how frequently the set of *queries* changes and what you consider real time. On Thu, Nov 3, 2011 at 2:46 PM, Oliver Krohne oliver.kro...@yieldkit.comwrote: Hi, I' am evaluating different solutions for massive phrase query execution. I need to execute millions of greps or more precise phrase queries consisting of 1-4 terms against millions of documents. I saw the hadoop grep example but this is executing grep with one regex. I also saw the Side data distribution / Distributed Cache possibility of hadoop. So I could pass them to the mapper and execute each query agains the input line. The input line would be the entire text of an document (usually 50-500 words). As I am aiming to have these information almost in realtime another questions arises about adhoc map/reduce jobs. Is there a limit of running a lot of jobs in parallel, lets say if I would fire a new job once a new document arises. This job would only process that particular document. Or I would batch 100-1000 documents and then fire the job. Can anyone advise an approach of doing it with hadoop? Thanks in advance, Oliver -- Eugene Kirpichov Principal Engineer, Mirantis Inc. http://www.mirantis.com/ Editor, http://fprog.ru/ --- Oliver Krohne Founder CTO YieldKit UG (haftungsbeschränkt) Mittelweg 161 20148 Hamburg T +49 40 209 349 771 F +49 40 209 349 779 E oliver.kro...@yieldkit.com http://www.yieldkit.com _ Sitz der Gesellschaft: Hamburg Geschäftsführer: Sandra Tiemann, Oliver Krohne Handelsregister: Amtsgericht Hamburg HRB 109104
Re: execute millions of grep
Hi, Hm - I was not assuming that you'll have 20mln queries. However, you can partition the query space into as many parts as needed to fit each part in memory, and accordingly increase the number of passes over a document - or just do it in parallel with mapreduce. I am not sure I understand how your inverted index solution works. Can you elaborate? On Thu, Nov 3, 2011 at 3:55 PM, Oliver Krohne oliver.kro...@yieldkit.comwrote: Hi Eugene, thanks for the hint. I looked at automaton and of course 100k are not consuming a lot of memory and a single pass through the document is perfect. But how will it work with 20 million and more queries? In addition there will be different sets of queries for each language. So we can expect to have 20 million + queries for each language. Of course they would be different automatons. Another approach I am thinking of is to build an inverted index of the queries and during map I will lookup the queries for each word and during reduce do normal query processing. What do you think about that ? regards, Oliver Am 03.11.2011 um 12:37 schrieb Eugene Kirpichov: Hi Oliver, I have solved a similar problem before, and it seems to me that the best solution is to build an automaton (with whole words, not letters on edges) on the set of queries (basically, a trie) that will allow you to find all queries matching a document by a single pass over the document. For 100k queries, the true will be very fast and very easy to build, and it will not consume much memory. The document scan is as follows for each word: - add root of trie to active nodes set - for each node in the active set, replace the node with its child corresponding to the scanned word. Then whenever you add a leaf node to the active set, you are seeing a match of some query that ends at the current word. Please tell if this description needs more detail. 03.11.2011, в 15:20, Oliver Krohne oliver.kro...@yieldkit.com написал(а): Hi Eugene, thanks for the quick reply. Not only the queries are changing but also the documents: a)The long time goal would be able to process a document or a batch of documents after a certain event which a mean with realtime. There will be different types of documents which are updated differently. One set documents will be updated every 1-2 hours. Other documents will be updated once a week. So an event means that the queries have to run against that set documents which are updated. So probably I need to run the queries every hour against against a subset and one a week the other documents need to be searched. b)The queries are changing a couple of times a day. At this stage we have around 100k queries but the will grow every day. A huge set of the queries remain the same but I expect that after 3 month 10% of the queries will change every day. Either they are deleted or new ones are added. One of the main question am still thinking of is where to search in. Search the queries in the documents or search the document terms in the queries. I will need at first exact match of a phrase query and in a next step phrase match given a precision/slop. Thanks for your help, Oliver Am 03.11.2011 um 11:52 schrieb Eugene Kirpichov: If you really need to do millions of exact text queries against millions of documents in realtime, a simple grep is not going to be sufficient for you. You'll need smarter datastructures and algorithms. Please specify how frequently the set of *queries* changes and what you consider real time. On Thu, Nov 3, 2011 at 2:46 PM, Oliver Krohne oliver.kro...@yieldkit.comwrote: Hi, I' am evaluating different solutions for massive phrase query execution. I need to execute millions of greps or more precise phrase queries consisting of 1-4 terms against millions of documents. I saw the hadoop grep example but this is executing grep with one regex. I also saw the Side data distribution / Distributed Cache possibility of hadoop. So I could pass them to the mapper and execute each query agains the input line. The input line would be the entire text of an document (usually 50-500 words). As I am aiming to have these information almost in realtime another questions arises about adhoc map/reduce jobs. Is there a limit of running a lot of jobs in parallel, lets say if I would fire a new job once a new document arises. This job would only process that particular document. Or I would batch 100-1000 documents and then fire the job. Can anyone advise an approach of doing it with hadoop? Thanks in advance, Oliver -- Eugene Kirpichov Principal Engineer, Mirantis Inc. http://www.mirantis.com/ Editor, http://fprog.ru/ --- Oliver Krohne Founder CTO YieldKit UG (haftungsbeschränkt) Mittelweg 161 20148 Hamburg T +49 40 209 349 771 F +49 40 209 349 779 E oliver.kro
Re: data locality
Thanks! 2011/10/26 Steve Loughran ste...@apache.org: On 26/10/11 05:22, Eugene Kirpichov wrote: But I guess it isn't always possible to achieve optimal scheduling, right? What's done then; any account for network topology perhaps? I'd recommend this paper if you are curious, it explains the Fair Scheduler http://www.cs.berkeley.edu/~matei/papers/2010/eurosys_delay_scheduling.pdf You can plug in different schedulers with different policies if you want -- Eugene Kirpichov Principal Engineer, Mirantis Inc. http://www.mirantis.com/ Editor, http://fprog.ru/
Re: data locality
But I guess it isn't always possible to achieve optimal scheduling, right? What's done then; any account for network topology perhaps? 26.10.2011, в 4:42, Mapred Learn mapred.le...@gmail.com написал(а): Yes that's right ! Sent from my iPhone On Oct 25, 2011, at 5:36 PM, ivan.nov...@emc.com wrote: So I guess the job tracker is the one reading the HDFS meta-data and then optimizing the scheduling of map jobs based on that? On 10/25/11 3:13 PM, Shevek she...@karmasphere.com wrote: We pray to $deity that the mapreduce block size is about the same as (or smaller than) the hdfs block size. We also pray that file format synchronization points are frequent when compared to block boundaries. The JobClient finds the location of each block of each file. It splits the job into FileSplit(s), with one per block. Each FileSplit is processed by a task. The Split contains the locations in which the task should best be run. The last block may be very short. It is then subsumed into the preceding block. Some data is transferred between nodes when the synchronization point for the file format is not at a block boundary. (It basically never is, but we hope it's close, or the purpose of MR locality is defeated.) Specifically to your questions: Most of the data should be read from the local hdfs node under the above assumptions. The communication layer between mapreduce and hdfs is not special. S. On 25 October 2011 11:49, ivan.nov...@emc.com wrote: Hello, I am trying to understand how data locality works in hadoop. If you run a map reduce job do the mappers only read data from the host on which they are running? Is there a communication protocol between the map reduce layer and HDFS layer so that the mapper gets optimized to read data locally? Any pointers on which layer of the stack handles this? Cheers, Ivan
Re: Hbase + mapreduce -- operational design question
I believe HBase has some kind of TTL (timeout-based expiry) for records and it can clean them up on its own. On Sat, Sep 10, 2011 at 1:54 AM, Dhodapkar, Chinmay chinm...@qualcomm.com wrote: Hello, I have a setup where a bunch of clients store 'events' in an Hbase table . Also, periodically(once a day), I run a mapreduce job that goes over the table and computes some reports. Now my issue is that the next time I don't want mapreduce job to process the 'events' that it has already processed previously. I know that I can mark processed event in the hbase table and the mapper can filter them them out during the next run. But what I would really like/want is that previously processed events don't even hit the mapper. One solution I can think of is to backup the hbase table after running the job and then clear the table. But this has lot of problems.. 1) Clients may have inserted events while the job was running. 2) I could disable and drop the table and then create it again...but then the clients would complain about this short window of unavailability. What do people using Hbase (live) + mapreduce typically do. ? Thanks! Chinmay -- Eugene Kirpichov Principal Engineer, Mirantis Inc. http://www.mirantis.com/ Editor, http://fprog.ru/
HBase and HDFS sync support in the main branch
Hello, It appears from http://wiki.apache.org/hadoop/Hbase/HdfsSyncSupport that for more than at least the latest year, durable edits haven't been supported by the main branch of HBase HDFS - only by a non-main branch, an unreleased branch and a third-party distribution. This seems strange to me, as http://hbase.apache.org/acid-semantics.html seems to claim that durability is present, without indicating that it's actually not, by default (part of the page looks like a plan, not like a claim about the current state of things, which is also confusing). Is this because durable edits are not that much of a needed feature? [Disclaimer: I've not done much with HBase and Hadoop in general in a while, so I may be asking questions completely stupid in the current context] -- Eugene Kirpichov Principal Engineer, Mirantis Inc. http://www.mirantis.com/ Editor, http://fprog.ru/
Re: HBase and HDFS sync support in the main branch
Hello Arun, Thanks for the clarification. Do you mean 0.20.205 will be a default (main, trunk, you name it) release and have durable edits? On Fri, Sep 9, 2011 at 2:22 PM, Arun C Murthy a...@hortonworks.com wrote: Eugene, Currently we are close to getting 0.20.205 frozen which will be the first Apache Hadoop release with proper support for HBase. hth, Arun On Sep 9, 2011, at 9:06 AM, Eugene Kirpichov wrote: Hello, It appears from http://wiki.apache.org/hadoop/Hbase/HdfsSyncSupport that for more than at least the latest year, durable edits haven't been supported by the main branch of HBase HDFS - only by a non-main branch, an unreleased branch and a third-party distribution. This seems strange to me, as http://hbase.apache.org/acid-semantics.html seems to claim that durability is present, without indicating that it's actually not, by default (part of the page looks like a plan, not like a claim about the current state of things, which is also confusing). Is this because durable edits are not that much of a needed feature? [Disclaimer: I've not done much with HBase and Hadoop in general in a while, so I may be asking questions completely stupid in the current context] -- Eugene Kirpichov Principal Engineer, Mirantis Inc. http://www.mirantis.com/ Editor, http://fprog.ru/ -- Eugene Kirpichov Principal Engineer, Mirantis Inc. http://www.mirantis.com/ Editor, http://fprog.ru/