Re: Running queries using index on HDFS

Joey Echeverria Mon, 25 Jul 2011 16:15:44 -0700

To add to what Bobby said, you can get block locations with
fs.getFileBlockLocations() if you want to open based on locality.


-Joey

On Mon, Jul 25, 2011 at 3:00 PM, Robert Evans <ev...@yahoo-inc.com> wrote:
> Sofia,
>
> You can access any HDFS file from a normal java application so long as your 
> classpath and some configuration is set up correctly.  That is all that the 
> hadoop jar command does.  It is a shell script that sets up the environment 
> for java to work with Hadoop.  Look at the example for the Tool Class
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Tool.html
>
> If you delete the JobConf stuff you can then just talk to the FIleSystem by 
> doing the following
>
> Path p = new Path("URI OF FILE TO OPEN");
> FileSystem fs = p.getFileSystem(conf);
> InputStream in = fs.open(p);
>
> Now you can use in to read your data.  Just be sure to close it when you are 
> done.
>
> --Bobby Evans
>
>
>
> On 7/25/11 4:40 PM, "Sofia Georgiakaki" <geosofie_...@yahoo.com> wrote:
>
> Good evening,
>
> I have built an Rtree on HDFS, in order to improve the query performance of 
> high-selectivity spatial queries.
> The Rtree is composed of a number of hdfs files (each one created by one 
> Reducer, so as the number of the files is equal to the number of the 
> reducers), where each file is a subtree of the root of the Rtree.
> I investigate the way to use the Rtree in an efficient way, with respect to 
> the locality of each file on hdfs (data-placement).
>
>
> I would like to ask, if it is possible to read a file which is on hdfs, from 
> a java application (not MapReduce).
> In case this is not possible (as I believe), either I should download the 
> files on the local filesystem (which is not a solution, since the files could 
> be very large), orrun the queries using the Hadoop.
> In order to maximise the gain, I should probably process a batch of queries 
> during each Job, and run each query on a node that is "near" to the files 
> that are involved in handling the specific query.
>
> Can I find the node where each file is located (or at least most of its 
> blocks), and run on that node a reducer that handles these queries? Could the 
> function  DFSClient.getBlockLocations() help ?
>
> Thank you in advance,
> Sofia
>
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: Running queries using index on HDFS

Reply via email to