Re: Finding small subset in very large dataset
Hi, The bloomfilter solution works great, but I still have to copy the data around sometimes. I'm still wondering if I can reduce the associated data to the keys to a reference or something small (the 100 KB of data are very big), with which I can then later fetch the data in the reduce step. In the past I was using hbase to store the associated data in it (but unfortunately hbase proved to be very unreliable in my case). I will probably also start to compress the data in the value store, which will probably increase sorting speed (as the data is there probably uncompressed). Is there something else I could do to speed this process up? Thanks, Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Finding small subset in very large dataset
Hi Miles, I'm not following you. If I'm saving an associated hash or bit vector, how can I then quickly access the elements afterwards (the file with the data might be 100GB big and is on the DFS)? I could also directly save the offset of the data in the datafile as reference, and then on each reducer read in that big file only once. As all the keys are sorted, I can get all the needed values in one big read step (skipping those entries I don't need). Thibaut Miles Osborne wrote: just re-represent the associated data as a bit vector and set of hash functions. you then just copy this around, rather than the raw items themselves. Miles 2009/2/18 Thibaut_ tbr...@blue.lu: Hi, The bloomfilter solution works great, but I still have to copy the data around sometimes. I'm still wondering if I can reduce the associated data to the keys to a reference or something small (the 100 KB of data are very big), with which I can then later fetch the data in the reduce step. In the past I was using hbase to store the associated data in it (but unfortunately hbase proved to be very unreliable in my case). I will probably also start to compress the data in the value store, which will probably increase sorting speed (as the data is there probably uncompressed). Is there something else I could do to speed this process up? Thanks, Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22082598.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: AlredyBeingCreatedExceptions after upgrade to 0.19.0
Hello Rasi, https://issues.apache.org/jira/browse/HADOOP-5268 is my bug report. Thibaut -- View this message in context: http://www.nabble.com/AlredyBeingCreatedExceptions-after-upgrade-to-0.19.0-tp21631077p22060926.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: AlredyBeingCreatedExceptions after upgrade to 0.19.0
I have the same problem. is there any solution to this? Thibaut -- View this message in context: http://www.nabble.com/AlredyBeingCreatedExceptions-after-upgrade-to-0.19.0-tp21631077p22043484.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Finding small subset in very large dataset
Thanks, I didn't think about the bloom filter variant. That's the solution I was looking for :-) Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21977132.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Finding small subset in very large dataset
Hi, Let's say the smaller subset has name A. It is a relatively small collection 100 000 entries (could also be only 100), with nearly no payload as value. Collection B is a big collection with 10 000 000 entries (Each key of A also exists in the collection B), where the value for each key is relatively big ( 100 KB) For all the keys in A, I need to get the corresponding value from B and collect it in the output. - I can do this by reading in both files, and on the reduce step, do my computations and collect only those which are both in A and B. The map phase however will take very long as all the key/value pairs of collection B need to be sorted (and each key's value is 100 KB) at the end of the map phase, which is overkill if A is very small. What I would need is an option to somehow make the intersection first (Mapper only on keys, then a reduce functino based only on keys and not the corresponding values which collects the keys I want to take), and then running the map input and filtering the output collector or the input based on the results from the reduce phase. Or is there another faster way? Collection A could be so big that it doesn't fit into the memory. I could split collection A up into multiple smaller collections, but that would make it more complicated, so I want to evade that route. (This is similar to the approach I described above, just a manual approach) Thanks, Thibaut -- View this message in context: http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Version Mismatch when accessing hdfs through a nonhadoop java application?
Jason Venner-2 wrote: When you compile from svn, the svn state number becomes part of the required version for hdfs - the last time I looked at it was 0.15.3 but it may still be happening. Hi Jason, Client and server are using the same library file (I checked it again, hadoop-0.17.1-core.jar), so this shouldn't be a problem (both should be using it)? I also had the same problem with earlier versions. This is the startup message of the datanode / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = bluelu-PC/192.168.1.130 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.17.1 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 669344; compiled by 'hadoopqa' on Thu Jun 19 01:18:25 UTC 2008 Thibaut -- View this message in context: http://www.nabble.com/Version-Mismatch-when-accessing-hdfs-through-a-nonhadoop-java-application--tp18392343p18482013.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Version Mismatch when accessing hdfs through a nonhadoop java application?
Hi, It's pretty clear that both versions differ. I just can't make out any reason (except that maybe the transfer verion of the build version is higher than the one I use (I triple checked that I always use the same hadoop version!). Unfortunately, compiling hadoop fails with an error on my machine (must be windows related), so I have difficulties building a custom hadoop-core to see what version each versions have. Also, I'm unable to post a bug report? I always get redirected to the list page? It would be very helpful if someone else could look into it, or at least confirm the bug. The code is all there in my first email. Thanks, Thibaut Shengkai Zhu wrote: I've check cod ed in DataNode.java, exactly where you get the error; *...* *DataInputStream in=null;* *in = new DataInputStream( new BufferedInputStream(s.getInputStream(), BUFFER_SIZE)); short version = in.readShort(); if ( version != DATA_TRANFER_VERSION ) { throw new IOException( Version Mismatch ); }* *...* May be useful for you. On 7/11/08, Thibaut_ [EMAIL PROTECTED] wrote: Hi, I'm trying to access the hdfs of my hadoop cluster in a non hadoop application. Hadoop 0.17.1 is running on standart ports This is the code I use: FileSystem fileSystem = null; String hdfsurl = hdfs://localhost:50010; fileSystem = new DistributedFileSystem(); try { fileSystem.initialize(new URI(hdfsurl), new Configuration()); } catch (Exception e) { e.printStackTrace(); System.out.println(init error:); System.exit(1); } which fails with the exception: java.net.SocketTimeoutException: timed out waiting for rpc response at org.apache.hadoop.ipc.Client.call(Client.java:559) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:313) at org.apache.hadoop.dfs.DFSClient.createRPCNamenode(DFSClient.java:102) at org.apache.hadoop.dfs.DFSClient.init(DFSClient.java:178) at org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:68) at com.iterend.spider.conf.Config.getRemoteFileSystem(Config.java:72) at tests.RemoteFileSystemTest.main(RemoteFileSystemTest.java:22) init error: The haddop logfile contains the following error: 2008-07-10 23:05:47,840 INFO org.apache.hadoop.dfs.Storage: Storage directory \hadoop\tmp\hadoop-sshd_server\dfs\data is not formatted. 2008-07-10 23:05:47,840 INFO org.apache.hadoop.dfs.Storage: Formatting ... 2008-07-10 23:05:47,928 INFO org.apache.hadoop.dfs.DataNode: Registered FSDatasetStatusMBean 2008-07-10 23:05:47,929 INFO org.apache.hadoop.dfs.DataNode: Opened server at 50010 2008-07-10 23:05:47,933 INFO org.apache.hadoop.dfs.DataNode: Balancing bandwith is 1048576 bytes/s 2008-07-10 23:05:48,128 INFO org.mortbay.util.Credential: Checking Resource aliases 2008-07-10 23:05:48,344 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4 2008-07-10 23:05:48,346 INFO org.mortbay.util.Container: Started HttpContext[/static,/static] 2008-07-10 23:05:48,346 INFO org.mortbay.util.Container: Started HttpContext[/logs,/logs] 2008-07-10 23:05:49,047 INFO org.mortbay.util.Container: Started [EMAIL PROTECTED] 2008-07-10 23:05:49,244 INFO org.mortbay.util.Container: Started WebApplicationContext[/,/] 2008-07-10 23:05:49,247 INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50075 2008-07-10 23:05:49,247 INFO org.mortbay.util.Container: Started [EMAIL PROTECTED] 2008-07-10 23:05:49,257 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=DataNode, sessionId=null 2008-07-10 23:05:49,535 INFO org.apache.hadoop.dfs.DataNode: New storage id DS-2117780943-192.168.1.130-50010-1215723949510 is assigned to data-node 127.0.0.1:50010 2008-07-10 23:05:49,586 INFO org.apache.hadoop.dfs.DataNode: 127.0.0.1:50010In DataNode.run, data = FSDataset{dirpath='c:\hadoop\tmp\hadoop-sshd_server\dfs\data\current'} 2008-07-10 23:05:49,586 INFO org.apache.hadoop.dfs.DataNode: using BLOCKREPORT_INTERVAL of 360msec Initial delay: 6msec 2008-07-10 23:06:04,636 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 0 blocks got processed in 11 msecs 2008-07-10 23:19:54,512 ERROR org.apache.hadoop.dfs.DataNode: 127.0.0.1:50010:DataXceiver: java.io.IOException: Version Mismatch at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:961) at java.lang.Thread.run(Thread.java:619) Any ideas how I can fix this? The haddop cluster and my application are both using the same hadoop jar! Thanks for your help, Thibaut -- View this message in context: http://www.nabble.com/Version-Mismatch-when-accessing-hdfs-through-a-nonhadoop
Version Mismatch when accessing hdfs through a nonhadoop java application?
Hi, I'm trying to access the hdfs of my hadoop cluster in a non hadoop application. Hadoop 0.17.1 is running on standart ports This is the code I use: FileSystem fileSystem = null; String hdfsurl = hdfs://localhost:50010; fileSystem = new DistributedFileSystem(); try { fileSystem.initialize(new URI(hdfsurl), new Configuration()); } catch (Exception e) { e.printStackTrace(); System.out.println(init error:); System.exit(1); } which fails with the exception: java.net.SocketTimeoutException: timed out waiting for rpc response at org.apache.hadoop.ipc.Client.call(Client.java:559) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:313) at org.apache.hadoop.dfs.DFSClient.createRPCNamenode(DFSClient.java:102) at org.apache.hadoop.dfs.DFSClient.init(DFSClient.java:178) at org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:68) at com.iterend.spider.conf.Config.getRemoteFileSystem(Config.java:72) at tests.RemoteFileSystemTest.main(RemoteFileSystemTest.java:22) init error: The haddop logfile contains the following error: 2008-07-10 23:05:47,840 INFO org.apache.hadoop.dfs.Storage: Storage directory \hadoop\tmp\hadoop-sshd_server\dfs\data is not formatted. 2008-07-10 23:05:47,840 INFO org.apache.hadoop.dfs.Storage: Formatting ... 2008-07-10 23:05:47,928 INFO org.apache.hadoop.dfs.DataNode: Registered FSDatasetStatusMBean 2008-07-10 23:05:47,929 INFO org.apache.hadoop.dfs.DataNode: Opened server at 50010 2008-07-10 23:05:47,933 INFO org.apache.hadoop.dfs.DataNode: Balancing bandwith is 1048576 bytes/s 2008-07-10 23:05:48,128 INFO org.mortbay.util.Credential: Checking Resource aliases 2008-07-10 23:05:48,344 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4 2008-07-10 23:05:48,346 INFO org.mortbay.util.Container: Started HttpContext[/static,/static] 2008-07-10 23:05:48,346 INFO org.mortbay.util.Container: Started HttpContext[/logs,/logs] 2008-07-10 23:05:49,047 INFO org.mortbay.util.Container: Started [EMAIL PROTECTED] 2008-07-10 23:05:49,244 INFO org.mortbay.util.Container: Started WebApplicationContext[/,/] 2008-07-10 23:05:49,247 INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50075 2008-07-10 23:05:49,247 INFO org.mortbay.util.Container: Started [EMAIL PROTECTED] 2008-07-10 23:05:49,257 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=DataNode, sessionId=null 2008-07-10 23:05:49,535 INFO org.apache.hadoop.dfs.DataNode: New storage id DS-2117780943-192.168.1.130-50010-1215723949510 is assigned to data-node 127.0.0.1:50010 2008-07-10 23:05:49,586 INFO org.apache.hadoop.dfs.DataNode: 127.0.0.1:50010In DataNode.run, data = FSDataset{dirpath='c:\hadoop\tmp\hadoop-sshd_server\dfs\data\current'} 2008-07-10 23:05:49,586 INFO org.apache.hadoop.dfs.DataNode: using BLOCKREPORT_INTERVAL of 360msec Initial delay: 6msec 2008-07-10 23:06:04,636 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 0 blocks got processed in 11 msecs 2008-07-10 23:19:54,512 ERROR org.apache.hadoop.dfs.DataNode: 127.0.0.1:50010:DataXceiver: java.io.IOException: Version Mismatch at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:961) at java.lang.Thread.run(Thread.java:619) Any ideas how I can fix this? The haddop cluster and my application are both using the same hadoop jar! Thanks for your help, Thibaut -- View this message in context: http://www.nabble.com/Version-Mismatch-when-accessing-hdfs-through-a-nonhadoop-java-application--tp18392343p18392343.html Sent from the Hadoop core-user mailing list archive at Nabble.com.