Re: Finding small subset in very large dataset

2009-02-18 Thread Thibaut_

Hi,

The bloomfilter solution works great, but I still have to copy the data
around sometimes.

I'm still wondering if I can reduce the associated data to the keys to a
reference or something small (the 100 KB of data are very big), with which
I can then later fetch the data in the reduce step.

In the past I was using hbase to store the associated data in it (but
unfortunately hbase proved to be very unreliable in my case). I will
probably also start to compress the data in the value store, which will
probably increase sorting speed (as the data is there probably
uncompressed).
Is there something else I could do to speed this process up?

Thanks,
Thibaut
-- 
View this message in context: 
http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Finding small subset in very large dataset

2009-02-18 Thread Thibaut_

Hi Miles,

I'm not following you.
If I'm saving an associated hash or bit vector, how can I then quickly
access the elements afterwards (the file with the data might be 100GB big
and is on the DFS)?

I could also directly save the offset of the data in the datafile as
reference, and then on each reducer read in that big file only once. As all
the keys are sorted, I can get all the needed values in one big read step
(skipping those entries I don't need).


Thibaut



Miles Osborne wrote:
 
 just re-represent the associated data as a bit vector and set of hash
 functions.  you then just copy this around, rather than the raw items
 themselves.
 
 Miles
 
 2009/2/18 Thibaut_ tbr...@blue.lu:

 Hi,

 The bloomfilter solution works great, but I still have to copy the data
 around sometimes.

 I'm still wondering if I can reduce the associated data to the keys to a
 reference or something small (the 100 KB of data are very big), with
 which
 I can then later fetch the data in the reduce step.

 In the past I was using hbase to store the associated data in it (but
 unfortunately hbase proved to be very unreliable in my case). I will
 probably also start to compress the data in the value store, which will
 probably increase sorting speed (as the data is there probably
 uncompressed).
 Is there something else I could do to speed this process up?

 Thanks,
 Thibaut
 --
 View this message in context:
 http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.


 
 
 
 -- 
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.
 
 

-- 
View this message in context: 
http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22082598.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: AlredyBeingCreatedExceptions after upgrade to 0.19.0

2009-02-17 Thread Thibaut_

Hello Rasi,

https://issues.apache.org/jira/browse/HADOOP-5268 is my bug report.

Thibaut

-- 
View this message in context: 
http://www.nabble.com/AlredyBeingCreatedExceptions-after-upgrade-to-0.19.0-tp21631077p22060926.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: AlredyBeingCreatedExceptions after upgrade to 0.19.0

2009-02-16 Thread Thibaut_

I have the same problem.

is there any solution to this?

Thibaut


-- 
View this message in context: 
http://www.nabble.com/AlredyBeingCreatedExceptions-after-upgrade-to-0.19.0-tp21631077p22043484.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Finding small subset in very large dataset

2009-02-12 Thread Thibaut_

Thanks,

I didn't think about the bloom filter variant. That's the solution I was
looking for :-)

Thibaut
-- 
View this message in context: 
http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21977132.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Finding small subset in very large dataset

2009-02-11 Thread Thibaut_

Hi,

Let's say the smaller subset has name A. It is a relatively small collection
 100 000 entries (could also be only 100), with nearly no payload as value. 
Collection B is a big collection with 10 000 000 entries (Each key of A
also exists in the collection B), where the value for each key is relatively
big ( 100 KB)

For all the keys in A, I need to get the corresponding value from B and
collect it in the output.


- I can do this by reading in both files, and on the reduce step, do my
computations and collect only those which are both in A and B. The map phase
however will take very long as all the key/value pairs of collection B need
to be sorted (and each key's value is 100 KB) at the end of the map phase,
which is overkill if A is very small.

What I would need is an option to somehow make the intersection first
(Mapper only on keys, then a reduce functino based only on keys and not the
corresponding values which collects the keys I want to take), and then
running the map input and filtering the output collector or the input based
on the results from the reduce phase.

Or is there another faster way? Collection A could be so big that it doesn't
fit into the memory. I could split collection A up into multiple smaller
collections, but that would make it more complicated, so I want to evade
that route. (This is similar to the approach I described above, just a
manual approach)

Thanks,
Thibaut
-- 
View this message in context: 
http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Version Mismatch when accessing hdfs through a nonhadoop java application?

2008-07-16 Thread Thibaut_



Jason Venner-2 wrote:
 
 When you compile from svn, the svn state number becomes part of the 
 required version for hdfs - the last time I looked at it was 0.15.3 but 
 it may still be happening.
 
 
Hi Jason,

Client and server are using the same library file (I checked it again,
hadoop-0.17.1-core.jar), so this shouldn't be a problem (both should be
using it)? I also had the same problem with earlier versions.


This is the startup message of the datanode

/
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = bluelu-PC/192.168.1.130
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.17.1
STARTUP_MSG:   build =
http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 669344;
compiled by 'hadoopqa' on Thu Jun 19 01:18:25 UTC 2008


Thibaut
-- 
View this message in context: 
http://www.nabble.com/Version-Mismatch-when-accessing-hdfs-through-a-nonhadoop-java-application--tp18392343p18482013.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Version Mismatch when accessing hdfs through a nonhadoop java application?

2008-07-15 Thread Thibaut_

Hi,

It's pretty clear that both versions differ. I just can't make out any
reason (except that maybe the transfer verion of the build version is higher
than the one I use (I triple checked that I always use the same hadoop
version!).

Unfortunately, compiling hadoop fails with an error on my machine (must be
windows related), so I have difficulties building a custom hadoop-core to
see what version each versions have.

Also, I'm unable to post a bug report? I always get redirected to the list
page? It would be very helpful if someone else could look into it, or at
least confirm the bug. The code is all there in my first email.

Thanks,
Thibaut



Shengkai Zhu wrote:
 
 I've check cod ed in DataNode.java, exactly where you get the error;
 
 *...*
 *DataInputStream in=null;*
 *in = new DataInputStream(
 new BufferedInputStream(s.getInputStream(), BUFFER_SIZE));
 short version = in.readShort();
 if ( version != DATA_TRANFER_VERSION ) {
  throw new IOException( Version Mismatch );
 }*
 *...*
 
 May be useful for you.
 
 On 7/11/08, Thibaut_ [EMAIL PROTECTED] wrote:


 Hi, I'm trying to access the hdfs of my hadoop cluster in a non hadoop
 application. Hadoop 0.17.1 is running on standart ports

 This is the code I use:

 FileSystem fileSystem = null;
String hdfsurl = hdfs://localhost:50010;
 fileSystem = new DistributedFileSystem();

try {
fileSystem.initialize(new URI(hdfsurl), new
 Configuration());
} catch (Exception e) {
e.printStackTrace();
System.out.println(init error:);
System.exit(1);

}


 which fails with the exception:


 java.net.SocketTimeoutException: timed out waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:559)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown
 Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:313)
at
 org.apache.hadoop.dfs.DFSClient.createRPCNamenode(DFSClient.java:102)
at org.apache.hadoop.dfs.DFSClient.init(DFSClient.java:178)
at

 org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:68)
at
 com.iterend.spider.conf.Config.getRemoteFileSystem(Config.java:72)
at tests.RemoteFileSystemTest.main(RemoteFileSystemTest.java:22)
 init error:


 The haddop logfile contains the following error:

 2008-07-10 23:05:47,840 INFO org.apache.hadoop.dfs.Storage: Storage
 directory \hadoop\tmp\hadoop-sshd_server\dfs\data is not formatted.
 2008-07-10 23:05:47,840 INFO org.apache.hadoop.dfs.Storage: Formatting
 ...
 2008-07-10 23:05:47,928 INFO org.apache.hadoop.dfs.DataNode: Registered
 FSDatasetStatusMBean
 2008-07-10 23:05:47,929 INFO org.apache.hadoop.dfs.DataNode: Opened
 server
 at 50010
 2008-07-10 23:05:47,933 INFO org.apache.hadoop.dfs.DataNode: Balancing
 bandwith is 1048576 bytes/s
 2008-07-10 23:05:48,128 INFO org.mortbay.util.Credential: Checking
 Resource
 aliases
 2008-07-10 23:05:48,344 INFO org.mortbay.http.HttpServer: Version
 Jetty/5.1.4
 2008-07-10 23:05:48,346 INFO org.mortbay.util.Container: Started
 HttpContext[/static,/static]
 2008-07-10 23:05:48,346 INFO org.mortbay.util.Container: Started
 HttpContext[/logs,/logs]
 2008-07-10 23:05:49,047 INFO org.mortbay.util.Container: Started
 [EMAIL PROTECTED]
 2008-07-10 23:05:49,244 INFO org.mortbay.util.Container: Started
 WebApplicationContext[/,/]
 2008-07-10 23:05:49,247 INFO org.mortbay.http.SocketListener: Started
 SocketListener on 0.0.0.0:50075
 2008-07-10 23:05:49,247 INFO org.mortbay.util.Container: Started
 [EMAIL PROTECTED]
 2008-07-10 23:05:49,257 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
 Initializing JVM Metrics with processName=DataNode, sessionId=null
 2008-07-10 23:05:49,535 INFO org.apache.hadoop.dfs.DataNode: New storage
 id
 DS-2117780943-192.168.1.130-50010-1215723949510 is assigned to data-node
 127.0.0.1:50010
 2008-07-10 23:05:49,586 INFO org.apache.hadoop.dfs.DataNode:
 127.0.0.1:50010In DataNode.run, data =
 FSDataset{dirpath='c:\hadoop\tmp\hadoop-sshd_server\dfs\data\current'}
 2008-07-10 23:05:49,586 INFO org.apache.hadoop.dfs.DataNode: using
 BLOCKREPORT_INTERVAL of 360msec Initial delay: 6msec
 2008-07-10 23:06:04,636 INFO org.apache.hadoop.dfs.DataNode: BlockReport
 of
 0 blocks got processed in 11 msecs
 2008-07-10 23:19:54,512 ERROR org.apache.hadoop.dfs.DataNode:
 127.0.0.1:50010:DataXceiver: java.io.IOException: Version Mismatch
at
 org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:961)
at java.lang.Thread.run(Thread.java:619)


 Any ideas how I can fix this? The haddop cluster and my application are
 both
 using the same hadoop jar!

 Thanks for your help,
 Thibaut
 --
 View this message in context:
 http://www.nabble.com/Version-Mismatch-when-accessing-hdfs-through-a-nonhadoop

Version Mismatch when accessing hdfs through a nonhadoop java application?

2008-07-10 Thread Thibaut_

Hi, I'm trying to access the hdfs of my hadoop cluster in a non hadoop
application. Hadoop 0.17.1 is running on standart ports

This is the code I use:

FileSystem fileSystem = null;
String hdfsurl = hdfs://localhost:50010;
fileSystem = new DistributedFileSystem();

try {
fileSystem.initialize(new URI(hdfsurl), new 
Configuration());
} catch (Exception e) {
e.printStackTrace();
System.out.println(init error:);
System.exit(1);

}


which fails with the exception:


java.net.SocketTimeoutException: timed out waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:559)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:313)
at org.apache.hadoop.dfs.DFSClient.createRPCNamenode(DFSClient.java:102)
at org.apache.hadoop.dfs.DFSClient.init(DFSClient.java:178)
at
org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:68)
at com.iterend.spider.conf.Config.getRemoteFileSystem(Config.java:72)
at tests.RemoteFileSystemTest.main(RemoteFileSystemTest.java:22)
init error:


The haddop logfile contains the following error:

2008-07-10 23:05:47,840 INFO org.apache.hadoop.dfs.Storage: Storage
directory \hadoop\tmp\hadoop-sshd_server\dfs\data is not formatted.
2008-07-10 23:05:47,840 INFO org.apache.hadoop.dfs.Storage: Formatting ...
2008-07-10 23:05:47,928 INFO org.apache.hadoop.dfs.DataNode: Registered
FSDatasetStatusMBean
2008-07-10 23:05:47,929 INFO org.apache.hadoop.dfs.DataNode: Opened server
at 50010
2008-07-10 23:05:47,933 INFO org.apache.hadoop.dfs.DataNode: Balancing
bandwith is 1048576 bytes/s
2008-07-10 23:05:48,128 INFO org.mortbay.util.Credential: Checking Resource
aliases
2008-07-10 23:05:48,344 INFO org.mortbay.http.HttpServer: Version
Jetty/5.1.4
2008-07-10 23:05:48,346 INFO org.mortbay.util.Container: Started
HttpContext[/static,/static]
2008-07-10 23:05:48,346 INFO org.mortbay.util.Container: Started
HttpContext[/logs,/logs]
2008-07-10 23:05:49,047 INFO org.mortbay.util.Container: Started
[EMAIL PROTECTED]
2008-07-10 23:05:49,244 INFO org.mortbay.util.Container: Started
WebApplicationContext[/,/]
2008-07-10 23:05:49,247 INFO org.mortbay.http.SocketListener: Started
SocketListener on 0.0.0.0:50075
2008-07-10 23:05:49,247 INFO org.mortbay.util.Container: Started
[EMAIL PROTECTED]
2008-07-10 23:05:49,257 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=DataNode, sessionId=null
2008-07-10 23:05:49,535 INFO org.apache.hadoop.dfs.DataNode: New storage id
DS-2117780943-192.168.1.130-50010-1215723949510 is assigned to data-node
127.0.0.1:50010
2008-07-10 23:05:49,586 INFO org.apache.hadoop.dfs.DataNode:
127.0.0.1:50010In DataNode.run, data =
FSDataset{dirpath='c:\hadoop\tmp\hadoop-sshd_server\dfs\data\current'}
2008-07-10 23:05:49,586 INFO org.apache.hadoop.dfs.DataNode: using
BLOCKREPORT_INTERVAL of 360msec Initial delay: 6msec
2008-07-10 23:06:04,636 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
0 blocks got processed in 11 msecs
2008-07-10 23:19:54,512 ERROR org.apache.hadoop.dfs.DataNode:
127.0.0.1:50010:DataXceiver: java.io.IOException: Version Mismatch
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:961)
at java.lang.Thread.run(Thread.java:619)


Any ideas how I can fix this? The haddop cluster and my application are both
using the same hadoop jar!

Thanks for your help,
Thibaut
-- 
View this message in context: 
http://www.nabble.com/Version-Mismatch-when-accessing-hdfs-through-a-nonhadoop-java-application--tp18392343p18392343.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.