Re: Neural Network in hadoop

2015-02-12 Thread Ted Dunning
That is a really old paper that basically pre-dates all of the recent important work in neural networks. You should look for works on Rectified Linear Units (ReLU), drop-out regularization, parameter servers (downpour sgd) and deep learning. Map-reduce as you have used it will not produce interes

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

2015-01-14 Thread Ted Dunning
have you considered implementing using something like spark? That could be much easier than raw map-reduce On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni wrote: > In KNN like algorithm we need to load model Data into cache for predicting > the records. > > Here is the example for KNN. > > >

Re: Wrapping around BitSet with the Writable interface

2013-05-12 Thread Ted Dunning
Another interesting alternative is the EWAH implementation of java bitsets that allow efficient compressed bitsets with very fast OR operations. https://github.com/lemire/javaewah See also https://code.google.com/p/sparsebitmap/ by the same authors. On Sun, May 12, 2013 at 1:11 PM, Bertrand Dec

Re: issues with decrease the default.block.size

2013-05-12 Thread Ted Dunning
The block size controls lots of things in Hadoop. It affects read parallelism, scalability, block allocation and other aspects of operations either directly or indirectly. On Sun, May 12, 2013 at 10:38 AM, shashwat shriparv < dwivedishash...@gmail.com> wrote: > The block size is for allocation

Re: What's the best disk configuration for hadoop? SSD's Raid levels, etc?

2013-05-11 Thread Ted Dunning
This sounds (with no real evidence) like you are a bit light on memory for that number of cores. That could cause you to be spilling map outputs early and very much slowing things down. On Fri, May 10, 2013 at 11:30 PM, David Parks wrote: > We’ve got a cluster of 10x 8core/24gb nodes, currently

Re: MapReduce - FileInputFormat and Locality

2013-05-08 Thread Ted Dunning
I think that you just said what the OP said. Your two cases reduce to the same single case that they had. Whether this matters is another question, but it seems like it could in cases where splits != blocks, especially if a split starts near the end of a block which could give an illusion of loca

Re: Hardware Selection for Hadoop

2013-05-07 Thread Ted Dunning
On Tue, May 7, 2013 at 5:53 AM, Michael Segel wrote: > While we have a rough metric on spindles to cores, you end up putting a > stress on the disk controllers. YMMV. > This is an important comment. Some controllers fold when you start pushing too much data. Testing nodes independently before i

Re: Hardware Selection for Hadoop

2013-05-05 Thread Ted Dunning
Data nodes normally are also task nodes. With 8 physical cores it isn't that unreasonable to have 64GB whereas 24GB really is going to pinch. Achieving highest performance requires that you match the capabilities of your nodes including CPU, memory, disk and networking. The standard wisdom is 4-

Re: Hardware Selection for Hadoop

2013-04-29 Thread Ted Dunning
I think that having more than 6 drives is better. More memory never hurts. If you have too little, you may have to run with fewer slots than optimal. 10GB networking is good. If not, having more than 2 1GBe ports is good, at least on distributions that can deal with them properly. On Mon, Apr

Re: M/R job optimization

2013-04-26 Thread Ted Dunning
Have you checked the logs? Is there a task that is taking a long time? What is that task doing? There are two basic possibilities: a) you have a skewed join like the other Ted mentioned. In this case, the straggler will be seen to be working on data. b) you have a hung process. This can be m

Re: Cartesian product in hadoop

2013-04-18 Thread Ted Dunning
It is rarely practical to do exhaustive comparisons on datasets of this size. The method used is to heuristically prune the cartesian product set and only examine pairs that have a high likelihood of being near. This can be done in many ways. Your suggestion of doing a map-side join is a reasona

Re: Physically moving HDFS cluster to new

2013-04-17 Thread Ted Dunning
It may or may not help you in your current distress, but MapR's distribution could handle this pretty easily. One method is direct distcp between clusters, but you could also use MapR's mirroring capabilities to migrate data. You can also carry a MapR cluster, change the IP addresses and relight

Re: Copy Vs DistCP

2013-04-14 Thread Ted Dunning
On Sun, Apr 14, 2013 at 10:33 AM, Mathias Herberts < mathias.herbe...@gmail.com> wrote: > > > > > This is absolutely true. Distcp dominates cp for large copies. On the > other hand cp dominates distcp for convenience. > > > > In my own experience, I love cp when copying relatively small amounts

Re: Copy Vs DistCP

2013-04-14 Thread Ted Dunning
's of GB and up), the startup time of distcp doesn't matter because once it gets going, it moves data much faster. > On Apr 14, 2013 6:15 AM, "Ted Dunning" wrote: > >> >> Lance, >> >> Never say never. >> >> Linux programs can read from t

Re: Copy Vs DistCP

2013-04-13 Thread Ted Dunning
Lance, Never say never. Linux programs can read from the right kind of Hadoop cluster without using FUSE. On Fri, Apr 12, 2013 at 10:15 AM, Lance Norskog wrote: > Shell 'cp' only works if you use 'fuse', which makes the HDFS file system > visible as a Unix mounted file system. Otherwise, U

Re: Bloom Filter analogy in SQL

2013-03-29 Thread Ted Dunning
This isn't a very Hadoop question. A Bloom filter is a very low level data structure that doesn't really any correlate in SQL. It allows you to find duplicates quickly and probabilistically. In return for a small probability of a false positive, it uses less memory. On Fri, Mar 29, 2013 at 5:3

Re: Million docs and word count scenario

2013-03-29 Thread Ted Dunning
Putting each document into a separate file is not likely to be a great thing to do. On the other hand, putting them all into one file may not be what you want either. It is probably best to find a middle ground and create files each with many documents and each a few gigabytes in size. On Fri,

Re: Which hadoop installation should I use on ubuntu server?

2013-03-28 Thread Ted Dunning
Also, Canonical just announced that MapR is available in the Partner repos. On Thu, Mar 28, 2013 at 7:22 AM, Nitin Pawar wrote: > apache bigtop has builds done for ubuntu > > you can check them at jenkins mentioned on bigtop.apache.org > > > On Thu, Mar 28, 2013 at 11:37 AM, David Parks wrote: >

Re: Static class vs Normal Class when to use

2013-03-28 Thread Ted Dunning
Another Ted piping in. For Hadoop use, it is dangerous to use anything but a static class for your mapper and reducer functions since you may accidentally think that you can access a closed variable from the parent. A static class cannot reference those values so you know that you haven't made th

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

2013-03-28 Thread Ted Dunning
The EMR distributions have special versions of the s3 file system. They might be helpful here. Of course, you likely aren't running those if you are seeing 5MB/s. An extreme alternative would be to light up an EMR cluster, copy to it, then to S3. On Thu, Mar 28, 2013 at 4:54 AM, Himanish Kusha

Re: Naïve k-means using hadoop

2013-03-27 Thread Ted Dunning
Spark would be an excellent choice for the iterative sort of k-means. It could be good for sketch-based algorithms as well, but the difference would be much less pronounced. On Wed, Mar 27, 2013 at 3:39 PM, Charles Earl wrote: > I would think also that starting with centers in some in-memory

Re: Naïve k-means using hadoop

2013-03-27 Thread Ted Dunning
And, of course, due credit should be given here. The advanced clustering algorithms in Crunch were lifted from the new stuff in Mahout pretty much step for step. The Mahout group would have loved to have contributions from the Cloudera guys instead of re-implementation, but you can't legislate ta

Re:

2013-03-25 Thread Ted Dunning
I would agree with David that this is not normally a good idea. There are situations, however, where you do need to control location of data and where the computation occurs. These requirements, however, normally only come up in real-time or low-latency situations. Ordinary Hadoop does not addre

Re: copytolocal vs distcp

2013-03-09 Thread Ted Dunning
Try file:///fs4/outdir Symbolic links can also help. Note that this file system has to be visible with the same path on all hosts. You may also be bandwidth limited by whatever is serving that file system. There are cases where you won't be limited by the file system. MapR, for instance, has a

Re: Accumulo and Mapreduce

2013-03-04 Thread Ted Dunning
Chaining the jobs is a fantastically inefficient solution. If you use Pig or Cascading, the optimizer will glue all of your map functions into a single mapper. The result is something like: (mapper1 -> mapper2 -> mapper3) => reducer Here the parentheses indicate that all of the map function

Re: Encryption in HDFS

2013-02-25 Thread Ted Dunning
Most recent crypto libraries use the special instructions on Intel processors. See for instance: http://software.intel.com/en-us/articles/intel-advanced-encryption-standard-aes-instructions-set On Mon, Feb 25, 2013 at 9:10 PM, Seonyeong Bak wrote: > Hello, I'm a university student. > > I imple

Re: mapr videos question

2013-02-24 Thread Ted Dunning
The MapR videos on programming and map-reduce are all general videos. The videos that cover capabilities like NFS, snapshots and mirrors are all MapR specific since ordinary Hadoop distributions like Cloudera, Hortonworks and Apache can't support those capabilities. The videos that cover MapR adm

Re: product recommendations engine

2013-02-17 Thread Ted Dunning
Yeah... you can make this work. First, if your setup is relatively small, then you won't need Hadoop. Second, having lots of kinds of actions is a very reasonable thing to have. My own suggestion is that you analyze these each for their predictive power independently and then combine them at rec

Re: Epic cause and company.

2013-02-16 Thread Ted Dunning
Please use other channels for recruiting. If you specifically have jobs for Apache contributors, try j...@apache.org This mailing list is for other purposes. On Fri, Feb 15, 2013 at 2:38 PM, Skye King Laskin wrote: > I have been retained by the VP of Engineering, with Progreso > Financiero - a

Re: Correlation between replication factor and read/write performance survey?

2013-02-11 Thread Ted Dunning
The delay due to replication is rarely a large problem in traditional map-reduce programs since many writes are occurring at once. The real problem comes because you are consuming 3x the total disk bandwidth so that the theoretical maximum equilibrium write bandwidth is limited to the lesser of ha

Re: Question related to Decompressor interface

2013-02-10 Thread Ted Dunning
All of these suggestions tend to founder on the problem of key management. What you need to do is 1) define your threats. 2) define your architecture including key management. 3) demonstrate how the architecture defends against the threat environment. I haven't seen more than a cursory comment

Re: Mutiple dfs.data.dir vs RAID0

2013-02-10 Thread Ted Dunning
wrote: > We have seen in several of our Hadoop clusters that LVM degrades > performance of our M/R jobs, and I remembered a message where > Ted Dunning was explaining something about this, and since > that time, we don't use LVM for Hadoop data directories. > > About RAID v

Re: How can I limit reducers to one-per-node?

2013-02-10 Thread Ted Dunning
For crawler type apps, typically you direct all of the URL's to crawl from a single domain to a single reducer. Typically, you also have many reducers so that you can get decent bandwidth. It is also common to consider the normal web politeness standards with a grain of salt, particularly by taki

Re: Interested in learning hadoop

2013-02-02 Thread Ted Dunning
As does MapR http://academy.mapr.com/ On Sat, Feb 2, 2013 at 7:53 AM, Chris Embree wrote: > Just to maintain some balance on the list, Hortonworks has similar > training vidos and a sandbox appliance. > > http://hortonworks.com/community/ > > Enjoy. > > > On Sat, Feb 2, 2013 at 10:02 AM, YouPe

Re: Hi,can u please help me how to retrieve the videos from hdfs

2013-02-02 Thread Ted Dunning
Works with a real-time version of Hadoop such as MapR. But you are right that HDFS and MapReduce were never intended for real-time use. On Fri, Feb 1, 2013 at 1:40 AM, Mohammad Tariq wrote: > How are going to store videos in HDFS?By 'playing video on the browser' I > assume it's gonna be realti

Re: Dell Hardware

2013-01-31 Thread Ted Dunning
We have tested both machines in our labs at MapR and both work well. Both run pretty hot so you need to keep a good eye on that. The R720 will have higher wattage per unit of storage due to the smaller number of drives per chassis. That may be a good match for ordinary Hadoop due to the lower I/

Re: Suggestions for Change Management System for Hadoop projects

2013-01-27 Thread Ted Dunning
Are you asking about change management for configurations and such? If so, there are good tools out there for managing that including puppet, chef and ansible. Or are you asking about something else? Both Cloudera and MapR have tools that help with centralized configuration management of cluster

Re: How to Backup HDFS data ?

2013-01-24 Thread Ted Dunning
Incremental backups are nice to avoid copying all your data again. You can code these at the application layer if you have nice partitioning and keep track correctly. You can also use platform level capabilities such as provided for by the MapR distribution. On Fri, Jan 25, 2013 at 3:23 PM, Hars

Re: Estimating disk space requirements

2013-01-18 Thread Ted Dunning
Jeff makes some good points here. On Fri, Jan 18, 2013 at 5:01 PM, Jeffrey Buell wrote: > I disagree. There are some significant advantages to using "many small > nodes" instead of "few big nodes". As Ted points out, there are some > disadvantages as well, so you have to look at the trade-offs

Re: Estimating disk space requirements

2013-01-18 Thread Ted Dunning
creating 20 individual servers on the cloud, and not create one > big server and make several virtual nodes inside it. > I will be paying for 20 different nodes.. all configured with hadoop and > connected to the cluster. > > Thanx for the intel :) > > > On Fri, Jan 18,

Re: Estimating disk space requirements

2013-01-18 Thread Ted Dunning
local folder. > And I am pretty sure it does not have a separate partition for root. > > Please help me explain what u meant and what else precautions should I > take. > > Thanks, > > Regards, > Ouch Whisper > 01010101010 > On Jan 18, 2013 11:11 PM, "Ted

Re: On a lighter note

2013-01-18 Thread Ted Dunning
Well, I think the actual name was "untergang". Same meaning. Sent from my iPhone On Jan 17, 2013, at 8:09 PM, Mohammad Tariq wrote: > You are right Michael, as always :) > > Warm Regards, > Tariq > https://mtariq.jux.com/ > cloudfront.blogspot.com > > > On Fri, Jan 18, 2013 at 6:33 AM, Mic

Re: Estimating disk space requirements

2013-01-18 Thread Ted Dunning
Where do you find 40gb disks now a days? Normally your performance is going to be better with more space but your network may be your limiting factor for some computations. That could give you some paradoxical scaling. Hbase will rarely show this behavior. Keep in mind you also want to allow

Re: Hadoop Scalability

2013-01-18 Thread Ted Dunning
Also, you may have to adjust your algorithms. For instance, the conventional standard algorithm for SVD is a Lanczos iterative algorithm. Iteration in Hadoop is death because of job invocation time ... what you wind up with is an algorithm that will handle big data but with a slow-down factor tha

Re: Does mapred.local.dir is important factor in reducer side?

2012-12-31 Thread Ted Dunning
that the local directory configs accept URIs in 2.x releases, > allowing users to plug alternative filesystems if they wanted to. > > On Tue, Jan 1, 2013 at 12:47 AM, Ted Dunning wrote: >> Hadoop, The Definitive Guide is only talking about Apache, CDH and >> Hortonworks here. &g

Re: Does mapred.local.dir is important factor in reducer side?

2012-12-31 Thread Ted Dunning
Hadoop, The Definitive Guide is only talking about Apache, CDH and Hortonworks here. The MapR distribution does not have this limitation and thus is one solution for this problem. Another solution is to do partial aggregates such as with a combiner. On Mon, Dec 31, 2012 at 8:14 AM, Majid Azimi w

Re: What is the preferred way to pass a small number of configuration parameters to a mapper or reducer

2012-12-28 Thread Ted Dunning
Answer B sounds pathologically bad to me. A or C are the only viable options. Neither B nor D work. B fails because it would be extremely hard to get the right records to the right components and because it pollutes data input with configuration data. D fails because statics don't work in paral

Re: hadoop -put command

2012-12-26 Thread Ted Dunning
The colon is a reserved character in a URI according to RFC 3986[1]. You should be able to percent encode those colons as %3A. [1] http://tools.ietf.org/html/rfc3986 On Wed, Dec 26, 2012 at 1:00 PM, Mohit Anchlia wrote: > It looks like hadoop fs -put command doesn't like ":" in the file names.

Re: What should I do with a 48-node cluster

2012-12-22 Thread Ted Dunning
sually 20 to 30 amp fully loaded though so if your >>> crushing word count at home your power bill is gonna get $. >>> >>> >>> >>> On Friday, December 21, 2012, Mark Kerzner >>> wrote: >>> > True! >>> > >>&g

Re: Alerting

2012-12-22 Thread Ted Dunning
Also, I think that Oozie allows for timeouts in job submission. That might answer your need. On Sat, Dec 22, 2012 at 2:08 PM, Ted Dunning wrote: > You can write a script to parse the Hadoop job list and send an alert. > > The trick of putting a retry into your workflow system is a

Re: Alerting

2012-12-22 Thread Ted Dunning
You can write a script to parse the Hadoop job list and send an alert. The trick of putting a retry into your workflow system is a nice one. If your program won't allow multiple copies to run at the same time, then if you re-invoke the program every, say, hour, then 5 retries implies that the pre

Re: Merging files

2012-12-22 Thread Ted Dunning
.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > > at org.apache.hadoop.tools.DistCp.main(DistCp.java:937) > > > On Sat, Dec 22, 2012 at 11:24 AM, Ted Dunning wrote: > >> The technical term for this is "c

Re: Merging files

2012-12-22 Thread Ted Dunning
The technical term for this is "copying". You may have heard of it. It is a subject of such long technical standing that many do not consider it worthy of detailed documentation. Distcp effects a similar process and can be modified to combine the input files into a single file. http://hadoop.ap

Re: What should I do with a 48-node cluster

2012-12-20 Thread Ted Dunning
On Thu, Dec 20, 2012 at 7:38 AM, Michael Segel wrote: > While Ted ignores that the world is going to end before X-Mas, he does hit > the crux of the matter head on. > > If you don't have a place to put it, the cost of setting it up would kill > you, not to mention that you can get newer hardware w

Re: What should I do with a 48-node cluster

2012-12-20 Thread Ted Dunning
Depending on the node characteristics, these might actually not be all that much use. Blades are usually designed assuming external storage like a SAN. That means that they usually don't have much disk which makes them only OK for Hadoop work. Also, what about the installation cost? Do you have

Re: Sane max storage size for DN

2012-12-12 Thread Ted Dunning
Yes it does make sense, depending on how much compute each byte of data will require on average. With ordinary Hadoop, it is reasonable to have half a dozen 2TB drives. With specialized versions of Hadoop considerably more can be supported. >From what you say, it sounds like you are suggesting t

Re: bounce message

2012-11-28 Thread Ted Dunning
Also, the moderators don't seem to read anything that goes by. On Wed, Nov 28, 2012 at 4:12 AM, sathyavageeswaran wrote: > In this group once anyone subscribes there is no exit route. > > -Original Message- > From: Tony Burton [mailto:tbur...@sportingindex.com] > Sent: 28 November 2012 1

Re: unsubscribe

2012-11-26 Thread Ted Dunning
You know, at some point, it becomes a simpler hypothesis that the mailing list is mal-configured and the moderators for the list are asleep at the switch. Assuming that the people subscribed to the Hadoop mailing lists are stupider than the average Apache mailing list subscriber is probably not a

Re: Mapping MySQL schema to Avro

2012-11-24 Thread Ted Dunning
On Sat, Nov 24, 2012 at 5:19 AM, Bart Verwilst wrote: > ** > > ... I'm not sure that i understand your comment about repeating values in > fmsswitchvalues, since they are different from the ones in fmssession? > I was just pointing out that there were fields in the fmssession record that didn't

Re: Mapping MySQL schema to Avro

2012-11-23 Thread Ted Dunning
This is probably the wrong list for your question. And, no, I don't think that your conversion is correct. To me it looks like you have lots of values in a fmsession. In the avro version, those values appear to be repeated in the fmsswitchvalues. That seems wrong. On Fri, Nov 23, 2012 at 5:12

Re: a question on NameNode

2012-11-19 Thread Ted Dunning
IT sounds like you could benefit from reading the basic papers on map-reduce in general. Hadoop is a reasonable facsimile of the original Google systems. Try looking at this: http://research.google.com/archive/mapreduce.html On Mon, Nov 19, 2012 at 7:14 AM, Kartashov, Andy wrote: > Thank you K

Re: HDFS block size

2012-11-16 Thread Ted Dunning
Andy's points are reasonable but there are a few omissions, - modern file systems are pretty good at writing large files into contiguous blocks if they have a reasonable amount of space available. - the seeks in question are likely to be more to do with checking directories for block locations th

Re: backup of hdfs data

2012-11-05 Thread Ted Dunning
Conventional enterprise backup systems are rarely scaled for hadoop needs. Both bandwidth and size are typically lacking. My employer, Mapr, offers a hadoop-derived distribution that includes both point in time snapshots and remote mirrors. Contact me off line for more info. Sent from my iP

Re: Set the number of maps

2012-11-01 Thread Ted Dunning
Is the spelling of the option correct? On Thu, Nov 1, 2012 at 6:43 AM, Cogan, Peter (Peter) < peter.co...@alcatel-lucent.com> wrote: > Hi > > I understand that the maximum number of concurrent map tasks is set > by mapred.tasktracker.map.tasks.maximum - however I wish to run with a > smaller num

Re: ClientProtocol create、mkdirs 、rename and delete methods are not Idempotent

2012-10-28 Thread Ted Dunning
I can better understand the problem. > > 2012/10/29 Ted Dunning > >> Create cannot be idempotent because of the problem of watches and >> sequential files. >> >> Similarly, mkdirs, rename and delete cannot generally be idempotent. In >> particular applicat

Re: Cluster wide atomic operations

2012-10-28 Thread Ted Dunning
On Sun, Oct 28, 2012 at 9:15 PM, David Parks wrote: > I need a unique & permanent ID assigned to new item encountered, which has > a constraint that it is in the range of, let’s say for simple discussion, > one to one million. > Having such a limited range may require that you have a central ser

Re: ClientProtocol create、mkdirs 、rename and delete methods are not Idempotent

2012-10-28 Thread Ted Dunning
Create cannot be idempotent because of the problem of watches and sequential files. Similarly, mkdirs, rename and delete cannot generally be idempotent. In particular applications, you might find it is OK to treat them as such, but there are definitely applications where they are not idempotent.

Re: Cluster wide atomic operations

2012-10-26 Thread Ted Dunning
This is better asked on the Zookeeper lists. The first answer is that global atomic operations are a generally bad idea. The second answer is that if you an batch these operations up then you can cut the evilness of global atomicity by a substantial factor. Are you sure you need a global counter

Re: rules engine with Hadoop

2012-10-20 Thread Ted Dunning
That probably means that your problem is pretty easy. Just code up a standard rules engine into a mapper. You can also build a user defined function (UDF) in Pig or Hive and Hadoop will handle the parallelism for you. On Sat, Oct 20, 2012 at 6:48 AM, Luangsay Sourygna wrote: > My problem would

Re: rules engine with Hadoop

2012-10-19 Thread Ted Dunning
Unification in a parallel cluster is a difficult problem. Writing very large scale unification programs is an even harder problem. What problem are you trying to solve? One option would be that you need to evaluate a conventionally-sized rulebase against many inputs. Map-reduce should be trivia

Re: DFS respond very slow

2012-10-15 Thread Ted Dunning
Uhhh... Alexey, did you really mean that you are running 100 mega bit per second network links? That is going to make hadoop run *really* slowly. Also, putting RAID under any DFS, be it Hadoop or MapR is not a good recipe for performance. Not that it matters if you only have 10mega bytes per sec

Re: Suitability of HDFS for live file store

2012-10-15 Thread Ted Dunning
If you are going to mention commercial distros, you should include MapR as well. Hadoop compatible, very scalable and handles very large numbers of files in a Posix-ish environment. On Mon, Oct 15, 2012 at 1:35 PM, Brian Bockelman wrote: > Hi, > > We use HDFS to process data for the LHC - somewh

Re: Spindle per Cores

2012-10-12 Thread Ted Dunning
I think that this rule of thumb is to prevent people configuring 2 disk clusters with 16 cores or 48 disk machines with 4 cores. Both configurations could make sense in narrow applications, but both would most probably be sub-optimal. Within narrow bands, I doubt you will see huge changes. I lik

Re: Spindle per Cores

2012-10-12 Thread Ted Dunning
It depends on your distribution. Some distributions are more efficient at driving spindles than others. Ratios as high as 2 spindles per core are sometimes quite reasonable. On Fri, Oct 12, 2012 at 10:46 AM, Patai Sangbutsarakum < silvianhad...@gmail.com> wrote: > I have read around about the h

Re: Logistic regression package on Hadoop

2012-10-12 Thread Ted Dunning
Harsh, THanks for the plug. Rajesh has been talking to us. On Fri, Oct 12, 2012 at 8:36 AM, Harsh J wrote: > Hi Rajesh, > > Please head over to the Apache Mahout project. See > https://cwiki.apache.org/MAHOUT/logistic-regression.html > > Apache Mahout is homed at http://mahout.apache.org and w

Re: Hadoop/Lucene + Solr architecture suggestions?

2012-10-11 Thread Ted Dunning
hough a combiner in the final m/r pass would really speed > up the hadoop shuffle. > > Lance > > From: "Ted Dunning" > To: user@hadoop.apache.org > Sent: Wednesday, October 10, 2012 11:13:36 PM > Subject: Re: Hadoop/Lucene + Solr architecture suggesti

Re: Why they recommend this (CPU) ?

2012-10-11 Thread Ted Dunning
a >> category in some way where this sweet spot for faster cores occurs? >> >> Russell Jurney http://datasyndrome.com >> >> On Oct 11, 2012, at 11:39 AM, Ted Dunning wrote: >> >> You should measure your workload. Your experience will vary >> dra

Re: Why they recommend this (CPU) ?

2012-10-11 Thread Ted Dunning
You should measure your workload. Your experience will vary dramatically with different computations. On Thu, Oct 11, 2012 at 10:56 AM, Russell Jurney wrote: > Anyone got data on this? This is interesting, and somewhat > counter-intuitive. > > Russell Jurney http://datasyndrome.com > > On Oct 11

Re: Hadoop/Lucene + Solr architecture suggestions?

2012-10-10 Thread Ted Dunning
loud cluster. This way the index files are not >> managed by Hadoop. >> >> Hi Lance, >> I'm curious if you've gotten that to work with a decent-sized (e.g. > >> 250 node) cluster? Even a trivial cluster seems to crush SolrCloud >> from a few

Re: Hadoop/Lucene + Solr architecture suggestions?

2012-10-10 Thread Ted Dunning
I prefer to create indexes in the reducer personally. Also you can avoid the copies if you use an advanced hadoop-derived distro. Email me off list for details. Sent from my iPhone On Oct 9, 2012, at 7:47 PM, Mark Kerzner wrote: > Hi, > > if I create a Lucene index in each mapper, locally,

Re: How to change topology

2012-10-09 Thread Ted Dunning
On Tue, Oct 9, 2012 at 12:17 PM, Steve Loughran wrote: > > > On 9 October 2012 16:51, Shinichi Yamashita wrote: > >> Hi Steve, >> >> Thank you for your reply. >> >> >> > no, it's the Namenode and JobTracker that needs to be restarted; >> > they are the bits that care where the boxes are. >> >> I

Re: Cumulative value using mapreduce

2012-10-04 Thread Ted Dunning
The answer is really the same. Your problem is just using a goofy representation for negative numbers (after all, negative numbers are a relatively new concept in accounting). You still need to use the account number as the key and the date as a sort key. Many financial institutions also process

Re: Cumulative value using mapreduce

2012-10-04 Thread Ted Dunning
Bertrand is almost right. The only difference is that the original poster asked about cumulative sum. This can be done in reducer exactly as Bertrand described except for two points that make it different from word count: a) you can't use a combiner b) the output of the program is as large as t

Re: HADOOP in Production

2012-10-02 Thread Ted Dunning
On Tue, Oct 2, 2012 at 7:05 PM, Hank Cohen wrote: > There is an important difference between real time and real fast > > Real time means that system response must meet a fixed schedule. > Real fast just means sooner is better. > Good thought, but real-time can also include a fixed schedule and a

Re: splitting jobtracker and namenode

2012-09-26 Thread Ted Dunning
Why are you changing the TTL on DNS if you aren't moving the name? If you are just changing the config to a new name, then caching won't matter. On Wed, Sep 26, 2012 at 1:46 PM, Patai Sangbutsarakum < silvianhad...@gmail.com> wrote: > Hi Hadoopers, > > My production Hadoop 0.20.2 cluster has bee

Re: why hadoop does not provide a round robin partitioner

2012-09-20 Thread Ted Dunning
The simplest solution for the situation as stated is to use an identity hash function. Of course, you can't split things any finer than the number of keys with this approach. If you can process different time periods independently, you may be able to add a small number of bits to your key to get

Re: IBM big insights distribution

2012-09-20 Thread Ted Dunning
On Thu, Sep 20, 2012 at 5:24 AM, Michael Segel wrote: > Why is it that when anyone asks a question about IBM Tom wants to take it > off line? > > Not very social. > Well, giving the mesquite broiling that anyone gets if they even mention any product competitive with certain Hadoop distributions,

Re: best way to join?

2012-09-03 Thread Ted Dunning
On Sun, Sep 2, 2012 at 12:26 PM, dexter morgan wrote: > ... Either way, any clustering process requires calculating the distance > of all points (not between all the points, but of all of them to some > relative point). Because i'll need a clustering MR job, ill probably use > it, despite as you s

Re: best way to join?

2012-08-31 Thread Ted Dunning
the 5000th nearest points to [300,200]. > > In the end my goal is to pre-process (as i wrote at the begining) this > list of N nearest points for every point in the file. Where N is a > parameter given to the job. Let's say 10 points. That's it. > No calculation after-wards,

Re: best way to join?

2012-08-30 Thread Ted Dunning
; > and calculating the distance as i go > > On Tue, Aug 28, 2012 at 11:07 PM, Ted Dunning wrote: > >> I don't mean that. >> >> I mean that a k-means clustering with pretty large clusters is a useful >> auxiliary data structure for finding nearest neighbors. Th

Re: How to unsubscribe (was Re: unsubscribe)

2012-08-29 Thread Ted Dunning
g and enjoy!!! > till all users of hadoop start blocking my email ID and organizers of > hadoop also block me specially io their mailing list. > > ** ** > > ** ** > > ** ** > > user-unsubscr...@hadoop.apache.org > > ** ** > > *From:* Ted Dunning [mailto:tdunn...@maprtech.com] > *Sent:

Re: How to unsubscribe (was Re: unsubscribe)

2012-08-29 Thread Ted Dunning
sathyavageeswaran wrote: > Of course have sent emails to all permutations and combinations of emails > listed with appropriate subject matter. > > ** ** > > *From:* Ted Dunning [mailto:tdunn...@maprtech.com] > *Sent:* 30 August 2012 10:12 > *To:* user@hadoop.apach

Re: How to unsubscribe (was Re: unsubscribe)

2012-08-29 Thread Ted Dunning
That was a stupid joke. It wasn't real advice. Have you sent email to the specific email address listed? On Thu, Aug 30, 2012 at 12:35 AM, sathyavageeswaran wrote: > I have tried every trick to get self unsubscribed. Yesterday I got a mail > saying you can't unsubscribe once subscribed. > > --

Re: best way to join?

2012-08-28 Thread Ted Dunning
sk with out running mahout / knn > algo? just by calculating distance between points? > join of a file with it self. > > Thanks > > On Tue, Aug 28, 2012 at 6:32 PM, Ted Dunning wrote: > >> >> >> On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan >> wrote:

Re: best way to join?

2012-08-28 Thread Ted Dunning
On Tue, Aug 28, 2012 at 9:48 AM, dexter morgan wrote: > > I understand your solution ( i think) , didn't think of that, in that > particular way. > I think that lets say i have 1M data-points, and running knn , that the > k=1M and n=10 (each point is a cluster that requires up to 10 points) > is a

Re: best way to join?

2012-08-27 Thread Ted Dunning
Mahout is getting some very fast knn code in version 0.8. The basic work flow is that you would first do a large-scale clustering of the data. Then you would make a second pass using the clustering to facilitate fast search for nearby points. The clustering will require two map-reduce jobs, one

Re: Can Hadoop replace the use of MQ b/w processes?

2012-08-19 Thread Ted Dunning
There is another much more active fork of Azkaban. See https://github.com/rbpark/azkaban On Sun, Aug 19, 2012 at 6:57 PM, Lance Norskog wrote: > Cool. I'm on the sidelines of a project trying to use Oozie in a large > Hadoop-ecology app. Oozie is the one thing marked 'to be replaced'. > > On

Re: medical diagnosis project

2012-08-16 Thread Ted Dunning
It would be more accurate to say that the participants who help people on this list are volunteers. There is absolutely no constraint on the topic material relative to whether it involves a profit motive. Topicality matters, but not profitability. Many of the participants are engaged in for-prof

Re: Hadoop hardware failure recovery

2012-08-10 Thread Ted Dunning
Hadoop's file system was (mostly) copied from the concepts of Google's old file system. The original paper is probably the best way to learn about that. http://research.google.com/archive/gfs.html On Fri, Aug 10, 2012 at 11:38 AM, Aji Janis wrote: > I am very new to Hadoop. I am considering