Re: Offline analytics for HDFS integration

Eric Pederson Wed, 22 Jul 2015 08:33:20 -0700

Hi Ashvin:

We are using tools like Spark (and Hive for metadata) to process files in
HDFS.   We're interested in both the Gemfire RDD and the Gemfire HDFS
integration as ways to access the data we have in Gemfire using Spark and
potentially Drill or Impala.


Thanks,


-- Eric

On Tue, Jul 21, 2015 at 1:35 AM, Ashvin A <[email protected]> wrote:

> Hi Eric,
>
> Currently HDFS store writes data in sequence file format and HFile format.
> Each value is a serialized event which contain metadata and the value
> provided by the user. The value can be deserialized using geode classes.
> Each file can be deserialized independently and does not depend on a live
> Geode cluster. A user level api to construct this data will be added soon
> (see GFInputFormat as an example).
>
> HDFS can be used as archive by means of Write-only regions. These regions
> do not follow LSM-tree structure. LSM structure is used for Read-Write
> regions.
>
> I am planning to create a jira and provide more details. Meanwhile, can
> you help us understand your use case. In your opinion, what could this
> interface look like? What about old versions of a key? Do you care for
> accessing hdfs files directly or is Hdfs Region interface better? Any other
> information that could be relevant to the hdfs region data access pattern.
>
> Thanks
> Ashvin
>
>
>
> On Mon, Jul 20, 2015 at 12:57 PM, Eric Pederson <[email protected]> wrote:
>
>> In the spec for HDFS integration it says that data events are archived on
>> HDFS for offline analysis.  How do you do offline analysis?  Is there an
>> API for the file format so third party tools can read it?  Or do you go
>> through an HDFS region?
>>
>> Also, just curious, are you using a LSM-tree to structure the data?
>>
>> Thanks,
>>
>> -- Eric
>>
>
>

Re: Offline analytics for HDFS integration

Reply via email to