If you intend to use files on HDFS, I would recommend using Parquet files.
It's a very fast columnar format that allows querying data very
efficiently. I believe a Spark data frame will take care of saving all the
columns in a Parquet file. So you could extract the data from Cassandra via
the Spark connector and save it to Parquet.

Or you can query Cassandra data directly from Spark, but it won't be as
fast as Parquet.

It's a trade-off between how much data to save to Parquet, how often, how
many queries, what format and whether you can tolerate some stale data.


On Sun, Oct 23, 2016 at 7:18 PM, Welly Tambunan <if05...@gmail.com> wrote:

> Another thing is,
>
> Let's say that we already have a structure data, the way we load that to
> HDFS is to turn that one into a files ?
>
> Cheers
>
> On Sun, Oct 23, 2016 at 6:18 PM, Welly Tambunan <if05...@gmail.com> wrote:
>
>> So basically you will store that files to HDFS and use Spark to process
>> it ?
>>
>> On Sun, Oct 23, 2016 at 6:03 PM, Joaquin Alzola <
>> joaquin.alz...@lebara.com> wrote:
>>
>>>
>>>
>>> I think what Ali mentions is correct:
>>>
>>> If you need a lot of queries that require joins, or complex analytics of
>>> the kind that Cassandra isn't suited for, then HDFS / HBase may be better.
>>>
>>>
>>>
>>> We have files in which one line contains 500 fields (separated by pipe)
>>> and each of this fields is particularly important.
>>>
>>> Cassandra will not manage that since you will need 500 indexes. HDFS is
>>> the proper way.
>>>
>>>
>>>
>>>
>>>
>>> *From:* Welly Tambunan [mailto:if05...@gmail.com]
>>> *Sent:* 23 October 2016 10:19
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Hadoop vs Cassandra
>>>
>>>
>>>
>>> I like muti data centre resillience in cassandra.
>>>
>>> I think thats plus one for cassandra.
>>>
>>> Ali, complex analytics can be done in spark right?
>>>
>>> On 23 Oct 2016 4:08 p.m., "Ali Akhtar" <ali.rac...@gmail.com> wrote:
>>>
>>> >
>>>
>>> > I would say it depends on your use case.
>>> >
>>> > If you need a lot of queries that require joins, or complex analytics
>>> of the kind that Cassandra isn't suited for, then HDFS / HBase may be
>>> better.
>>> >
>>> > If you can work with the cassandra way of doing things (creating new
>>> tables for each query you'll need to do, duplicating data - doing extra
>>> writes for faster reads) , then Cassandra should work for you. It is easier
>>> to setup and do dev ops with, in my experience.
>>> >
>>> > On Sun, Oct 23, 2016 at 2:05 PM, Welly Tambunan <if05...@gmail.com>
>>> wrote:
>>>
>>> >>
>>>
>>> >> I mean. HDFS and HBase.
>>> >>
>>> >> On Sun, Oct 23, 2016 at 4:00 PM, Ali Akhtar <ali.rac...@gmail.com>
>>> wrote:
>>>
>>> >>>
>>>
>>> >>> By Hadoop do you mean HDFS?
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Sun, Oct 23, 2016 at 1:56 PM, Welly Tambunan <if05...@gmail.com>
>>> wrote:
>>>
>>> >>>>
>>>
>>> >>>> Hi All,
>>> >>>>
>>> >>>> I read the following comparison between hadoop and cassandra. Seems
>>> the conclusion that we use hadoop for data lake ( cold data ) and Cassandra
>>> for hot data (real time data).
>>> >>>>
>>> >>>> http://www.datastax.com/nosql-databases/nosql-cassandra-and-hadoop
>>> <http://www.datastax.com/nosql-databases/nosql-cassandra-and-hadoop>
>>> >>>>
>>> >>>> My question is, can we just use cassandra to rule them all ?
>>> >>>>
>>> >>>> What we are trying to achieve is to minimize the moving part on our
>>> system.
>>> >>>>
>>> >>>> Any response would be really appreciated.
>>> >>>>
>>> >>>>
>>> >>>> Cheers
>>> >>>>
>>> >>>> --
>>> >>>> Welly Tambunan
>>> >>>> Triplelands
>>> >>>>
>>> >>>> http://weltam.wordpress.com <http://weltam.wordpress.com>
>>> >>>> http://www.triplelands.com <http://www.triplelands.com/blog/>
>>> >>>
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Welly Tambunan
>>> >> Triplelands
>>> >>
>>> >> http://weltam.wordpress.com <http://weltam.wordpress.com>
>>> >> http://www.triplelands.com <http://www.triplelands.com/blog/>
>>> >
>>> >
>>> This email is confidential and may be subject to privilege. If you are
>>> not the intended recipient, please do not copy or disclose its content but
>>> contact the sender immediately upon receipt.
>>>
>>
>>
>>
>> --
>> Welly Tambunan
>> Triplelands
>>
>> http://weltam.wordpress.com
>> http://www.triplelands.com <http://www.triplelands.com/blog/>
>>
>
>
>
> --
> Welly Tambunan
> Triplelands
>
> http://weltam.wordpress.com
> http://www.triplelands.com <http://www.triplelands.com/blog/>
>



-- 


Stefania Alborghetti

|+852 6114 9265| stefania.alborghe...@datastax.com

Reply via email to