Re: Storing large files for later processing through hadoop

2015-01-02 Thread Jacob Rhoden
If it's for auditing, if recommend pushing the files out somewhere reasonably 
external, Amazon S3 works well for this type of thing, and you don't have to 
worry too much about backups and the like.

__
Sent from iPhone

> On 3 Jan 2015, at 5:07 pm, Srinivasa T N  wrote:
> 
> Hi Wilm,
>The reason is that for some auditing purpose, I want to store the original 
> files also.
> 
> Regards,
> Seenu.
> 
>> On Fri, Jan 2, 2015 at 11:09 PM, Wilm Schumacher  
>> wrote:
>> Hi,
>> 
>> perhaps I totally misunderstood your problem, but why "bother" with
>> cassandra for storing in the first place?
>> 
>> If your MR for hadoop is only run once for each file (as you wrote
>> above), why not copy the data directly to hdfs, run your MR job and use
>> cassandra as sink?
>> 
>> As hdfs and yarn are more or less completely independent you could
>> perhaps use the "master" as ResourceManager (yarn) AND NameNode and
>> DataNode (hdfs) and launch your MR job directly and as mentioned use
>> Cassandra as sink for the reduced data. By this you won't need dedicated
>> hardware, as you only need the hdfs once, process and delete the files
>> afterwards.
>> 
>> Best wishes,
>> 
>> Wilm
> 


Re: Storing large files for later processing through hadoop

2015-01-02 Thread Srinivasa T N
Hi Wilm,
   The reason is that for some auditing purpose, I want to store the
original files also.

Regards,
Seenu.

On Fri, Jan 2, 2015 at 11:09 PM, Wilm Schumacher 
wrote:

> Hi,
>
> perhaps I totally misunderstood your problem, but why "bother" with
> cassandra for storing in the first place?
>
> If your MR for hadoop is only run once for each file (as you wrote
> above), why not copy the data directly to hdfs, run your MR job and use
> cassandra as sink?
>
> As hdfs and yarn are more or less completely independent you could
> perhaps use the "master" as ResourceManager (yarn) AND NameNode and
> DataNode (hdfs) and launch your MR job directly and as mentioned use
> Cassandra as sink for the reduced data. By this you won't need dedicated
> hardware, as you only need the hdfs once, process and delete the files
> afterwards.
>
> Best wishes,
>
> Wilm
>


Re: STCS limitation with JBOD?

2015-01-02 Thread Robert Coli
On Fri, Jan 2, 2015 at 11:28 AM, Colin  wrote:

> Forcing a major compaction is usually a bad idea.  What is your reason for
> doing that?
>

I'd say "often" and not "usually". Lots of people have schema where they
create way too much garbage, and major compaction can be a good response.
The docs' historic incoherent FUD notwithstanding.

=Rob


Re: is primary key( foo, bar) the same as primary key ( foo ) with a ‘set' of bars?

2015-01-02 Thread Robert Coli
On Fri, Jan 2, 2015 at 11:35 AM, Tyler Hobbs  wrote:

>
> This is not true (with one minor exception).  All operations on sets and
> maps require no reads.  The same is true for appends and prepends on lists,
> but delete and set operations on lists with (non-zero) indexes require the
> list to be read first.  However, the entire list does not need to be
> re-written to disk.
>

Thank you guys for the correction; a case where I am glad to be wrong. I
must have been thinking about the delete/set operations and have drawn an
erroneous inference. :)

=Rob


Re: Tombstones without DELETE

2015-01-02 Thread Tyler Hobbs
No worries!  They're a data type that was introduced in 1.2:
http://www.datastax.com/dev/blog/cql3_collections

On Fri, Jan 2, 2015 at 12:07 PM, Nikolay Mihaylov  wrote:

> Hi Tyler,
>
> sorry for very stupid question - what is a collection ?
>
> Nick
>
> On Wed, Dec 31, 2014 at 6:27 PM, Tyler Hobbs  wrote:
>
>> Overwriting an entire collection also results in a tombstone being
>> inserted.
>>
>> On Wed, Dec 24, 2014 at 7:09 AM, Ryan Svihla 
>> wrote:
>>
>>> You should probably ask on the Cassandra user mailling list.
>>>
>>> However, TTL is the only other case I can think of.
>>>
>>> On Tue, Dec 23, 2014 at 1:36 PM, Davide D'Agostino 
>>> wrote:
>>>
 Hi there,

 Following this:
 https://groups.google.com/a/lists.datastax.com/forum/#!searchin/java-driver-user/tombstone/java-driver-user/cHE3OOSIXBU/moLXcif1zQwJ

 Under what conditions Cassandra generates a tombstone?

 Basically I have not even big table on cassandra (90M rows) in my code
 there is no delete and I use prepared statements (but binding all necessary
 values).

 I'm aware that a tombstone gets created when:

 1. You delete the row
 2. You set a column to null while previously it had a value
 3. When you use prepared statements and you don't bind all the values

 Anything else that I should be aware of?

 Thanks!

 To unsubscribe from this group and stop receiving emails from it, send
 an email to java-driver-user+unsubscr...@lists.datastax.com.

>>>
>>>
>>>
>>> --
>>>
>>> [image: datastax_logo.png] 
>>>
>>> Ryan Svihla
>>>
>>> Solution Architect
>>>
>>> [image: twitter.png]  [image: linkedin.png]
>>> 
>>>
>>> DataStax is the fastest, most scalable distributed database technology,
>>> delivering Apache Cassandra to the world’s most innovative enterprises.
>>> Datastax is built to be agile, always-on, and predictably scalable to any
>>> size. With more than 500 customers in 45 countries, DataStax is the
>>> database technology and transactional backbone of choice for the worlds
>>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>>
>>>
>>
>>
>> --
>> Tyler Hobbs
>> DataStax 
>>
>
>


-- 
Tyler Hobbs
DataStax 


Re: is primary key( foo, bar) the same as primary key ( foo ) with a ‘set' of bars?

2015-01-02 Thread Tyler Hobbs
On Fri, Jan 2, 2015 at 1:13 PM, Eric Stevens  wrote:

> > And also stored entirely for each UPDATE. Change one element,
> re-serialize the whole thing to disk.
>
> Is this true?  I thought updates (adds, removes, but not overwrites)
> affected just the indicated columns.  Isn't it just the reads that involve
> reading the entire collection?


This is not true (with one minor exception).  All operations on sets and
maps require no reads.  The same is true for appends and prepends on lists,
but delete and set operations on lists with (non-zero) indexes require the
list to be read first.  However, the entire list does not need to be
re-written to disk.

-- 
Tyler Hobbs
DataStax 


Re: STCS limitation with JBOD?

2015-01-02 Thread Colin
Forcing a major compaction is usually a bad idea.  What is your reason for 
doing that?

--
Colin Clark 
+1-320-221-9531
 

> On Jan 2, 2015, at 1:17 PM, Dan Kinder  wrote:
> 
> Hi,
> 
> Forcing a major compaction (using nodetool compact) with STCS will result in 
> a single sstable (ignoring repair data). However this seems like it could be 
> a problem for large JBOD setups. For example if I have 12 disks, 1T each, 
> then it seems like on this node I cannot have one column family store more 
> than 1T worth of data (more or less), because all the data will end up in a 
> single sstable that can exist only on one disk. Is this accurate? The 
> compaction write path docs give a bit of hope that cassandra could split the 
> one final sstable across the disks, but I doubt it is able to and want to 
> confirm.
> 
> I imagine that RAID/LLVM, using LCS, or multiple cassandra instances not in 
> JBOD mode could be solutions to this (with their own problems), but want to 
> verify that this actually is a problem.
> 
> -dan


STCS limitation with JBOD?

2015-01-02 Thread Dan Kinder
Hi,

Forcing a major compaction (using nodetool compact
)
with STCS will result in a single sstable (ignoring repair data). However
this seems like it could be a problem for large JBOD setups. For example if
I have 12 disks, 1T each, then it seems like on this node I cannot have one
column family store more than 1T worth of data (more or less), because all
the data will end up in a single sstable that can exist only on one disk.
Is this accurate? The compaction write path docs

give a bit of hope that cassandra could split the one final sstable across
the disks, but I doubt it is able to and want to confirm.

I imagine that RAID/LLVM, using LCS, or multiple cassandra instances not in
JBOD mode could be solutions to this (with their own problems), but want to
verify that this actually is a problem.

-dan


Re: is primary key( foo, bar) the same as primary key ( foo ) with a ‘set' of bars?

2015-01-02 Thread Eric Stevens
> And also stored entirely for each UPDATE. Change one element,
re-serialize the whole thing to disk.

Is this true?  I thought updates (adds, removes, but not overwrites)
affected just the indicated columns.  Isn't it just the reads that involve
reading the entire collection?

DS docs talk about reading whole collections, but I don't see anything
about having to overwrite the entire collection each time.  That would
indicate a read then write style operation, which is antipatterny.

> When you query a table containing a collection, Cassandra retrieves the
collection in its entirety
http://www.datastax.com/documentation/cql/3.0/cql/cql_using/use_set_t.html



On Fri, Jan 2, 2015 at 11:48 AM, Robert Coli  wrote:

> On Thu, Jan 1, 2015 at 11:04 AM, DuyHai Doan  wrote:
>
>> 2) collections and maps are loaded entirely by Cassandra for each query,
>> whereas with clustering columns you can select a slice of columns
>>
>
> And also stored entirely for each UPDATE. Change one element, re-serialize
> the whole thing to disk.
>
> =Rob
>


Re: Best Time Series insert strategy

2015-01-02 Thread Robert Coli
On Tue, Dec 16, 2014 at 1:16 PM, Arne Claassen  wrote:

> 3) Go to consistency ANY.
>

Consistency level ANY should probably be renamed to NEVER and removed from
the software.

It is almost never the correct solution to any problem.

=Rob


Re: Number of SSTables grows after repair

2015-01-02 Thread Robert Coli
On Mon, Dec 15, 2014 at 1:51 AM, Michał Łowicki  wrote:

> We've noticed that number of SSTables grows radically after running
> *repair*. What we did today is to compact everything so for each node
> number of SStables < 10. After repair it jumped to ~1600 on each node. What
> is interesting is that size of many is very small. The smallest ones are
> ~60 bytes in size (http://paste.ofcode.org/6yyH2X52emPNrKdw3WXW3d)
>

This is semi-expected if using vnodes. There are various tickets open to
address aspects of this issue.


> Table information - http://paste.ofcode.org/32RijfxQkNeb9cx9GAAnM45
> We're using Cassandra 2.1.2.
>

https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/

=Rob


Re: is primary key( foo, bar) the same as primary key ( foo ) with a ‘set' of bars?

2015-01-02 Thread Robert Coli
On Thu, Jan 1, 2015 at 11:04 AM, DuyHai Doan  wrote:

> 2) collections and maps are loaded entirely by Cassandra for each query,
> whereas with clustering columns you can select a slice of columns
>

And also stored entirely for each UPDATE. Change one element, re-serialize
the whole thing to disk.

=Rob


sstable structure

2015-01-02 Thread Nikolay Mihaylov
Hi

from some time I try to find the structure of sstable is it documented
somewhere or can anyone explain it to me

I am speaking about "hex dump" bytes stored on the disk.

Nick.


Re: Tombstones without DELETE

2015-01-02 Thread Nikolay Mihaylov
Hi Tyler,

sorry for very stupid question - what is a collection ?

Nick

On Wed, Dec 31, 2014 at 6:27 PM, Tyler Hobbs  wrote:

> Overwriting an entire collection also results in a tombstone being
> inserted.
>
> On Wed, Dec 24, 2014 at 7:09 AM, Ryan Svihla  wrote:
>
>> You should probably ask on the Cassandra user mailling list.
>>
>> However, TTL is the only other case I can think of.
>>
>> On Tue, Dec 23, 2014 at 1:36 PM, Davide D'Agostino 
>> wrote:
>>
>>> Hi there,
>>>
>>> Following this:
>>> https://groups.google.com/a/lists.datastax.com/forum/#!searchin/java-driver-user/tombstone/java-driver-user/cHE3OOSIXBU/moLXcif1zQwJ
>>>
>>> Under what conditions Cassandra generates a tombstone?
>>>
>>> Basically I have not even big table on cassandra (90M rows) in my code
>>> there is no delete and I use prepared statements (but binding all necessary
>>> values).
>>>
>>> I'm aware that a tombstone gets created when:
>>>
>>> 1. You delete the row
>>> 2. You set a column to null while previously it had a value
>>> 3. When you use prepared statements and you don't bind all the values
>>>
>>> Anything else that I should be aware of?
>>>
>>> Thanks!
>>>
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to java-driver-user+unsubscr...@lists.datastax.com.
>>>
>>
>>
>>
>> --
>>
>> [image: datastax_logo.png] 
>>
>> Ryan Svihla
>>
>> Solution Architect
>>
>> [image: twitter.png]  [image: linkedin.png]
>> 
>>
>> DataStax is the fastest, most scalable distributed database technology,
>> delivering Apache Cassandra to the world’s most innovative enterprises.
>> Datastax is built to be agile, always-on, and predictably scalable to any
>> size. With more than 500 customers in 45 countries, DataStax is the
>> database technology and transactional backbone of choice for the worlds
>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>
>>
>
>
> --
> Tyler Hobbs
> DataStax 
>


Re: Storing large files for later processing through hadoop

2015-01-02 Thread Wilm Schumacher
Hi,

perhaps I totally misunderstood your problem, but why "bother" with
cassandra for storing in the first place?

If your MR for hadoop is only run once for each file (as you wrote
above), why not copy the data directly to hdfs, run your MR job and use
cassandra as sink?

As hdfs and yarn are more or less completely independent you could
perhaps use the "master" as ResourceManager (yarn) AND NameNode and
DataNode (hdfs) and launch your MR job directly and as mentioned use
Cassandra as sink for the reduced data. By this you won't need dedicated
hardware, as you only need the hdfs once, process and delete the files
afterwards.

Best wishes,

Wilm


Re: Storing large files for later processing through hadoop

2015-01-02 Thread mck
> Since the hadoop MR streaming job requires the file to be processed to be 
> present in HDFS,
>  I was thinking whether can it get directly from mongodb instead of me 
> manually fetching it 
> and placing it in a directory before submitting the hadoop job?


Hadoop M/R can get data directly from Cassandra. See CqlInputFormat.

~mck


Re: Storing large files for later processing through hadoop

2015-01-02 Thread Srinivasa T N
I agree that cassandra is a columnar store.  The storing of the raw xml
file, parsing the file using hadoop and then storing the extracted value is
only once.  The extracted data on which further operations will be done
suits well with the timeseries storage of the data provided by cassandra
and that is the reason I am trying to get the things done for which it is
not designed.

Regards,
Seenu.



On Fri, Jan 2, 2015 at 10:42 PM, Eric Stevens  wrote:

> > Can this split and combine be done automatically by cassandra when
> inserting/fetching the file without application being bothered about it?
>
> There are client libraries which offer recipes for this, but in general,
> no.
>
> You're trying to do something with Cassandra that it's not designed to
> do.  You can get there from here, but you're not going to have a good
> time.  If you need a document store, you should use a NoSQL solution
> designed with that in mind (Cassandra is a columnar store).  If you need a
> distributed filesystem, you should use one of those.
>
> If you do want to continue forward and do this with Cassandra, then you
> should definitely not do this on the same cluster as handles normal clients
> as the kind of workload you'd be subjecting this cluster to is going to
> cause all sorts of troubles for normal clients, particularly with respect
> to GC pressure, compaction and streaming problems, and many other
> consequences of vastly exceeding recommended limits.
>
> On Fri, Jan 2, 2015 at 9:53 AM, Srinivasa T N  wrote:
>
>>
>>
>> On Fri, Jan 2, 2015 at 5:54 PM, mck  wrote:
>>
>>>
>>> You could manually chunk them down to 64Mb pieces.
>>>
>>> Can this split and combine be done automatically by cassandra when
>> inserting/fetching the file without application being bothered about it?
>>
>>
>>>
>>> > 2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch
>>> > the file from cassandra to HDFS when I want to process it in hadoop
>>> cluster?
>>>
>>>
>>> We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No
>>> need for backups of it, no need to upgrade data, and we're free to wipe
>>> it whenever hadoop has been stopped.
>>> ~mck
>>>
>>
>> Since the hadoop MR streaming job requires the file to be processed to be
>> present in HDFS, I was thinking whether can it get directly from mongodb
>> instead of me manually fetching it and placing it in a directory before
>> submitting the hadoop job?
>>
>>
>> >> There was a datastax project before in being able to replace HDFS with
>> >> Cassandra, but i don't think it's alive anymore.
>>
>> I think you are referring to Brisk project (
>> http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/)
>> but I don't know its current status.
>>
>> Can I use http://gerrymcnicol.azurewebsites.net/ for my task in hand?
>>
>> Regards,
>> Seenu.
>>
>
>


Re: Storing large files for later processing through hadoop

2015-01-02 Thread Eric Stevens
> Can this split and combine be done automatically by cassandra when
inserting/fetching the file without application being bothered about it?

There are client libraries which offer recipes for this, but in general,
no.

You're trying to do something with Cassandra that it's not designed to do.
You can get there from here, but you're not going to have a good time.  If
you need a document store, you should use a NoSQL solution designed with
that in mind (Cassandra is a columnar store).  If you need a distributed
filesystem, you should use one of those.

If you do want to continue forward and do this with Cassandra, then you
should definitely not do this on the same cluster as handles normal clients
as the kind of workload you'd be subjecting this cluster to is going to
cause all sorts of troubles for normal clients, particularly with respect
to GC pressure, compaction and streaming problems, and many other
consequences of vastly exceeding recommended limits.

On Fri, Jan 2, 2015 at 9:53 AM, Srinivasa T N  wrote:

>
>
> On Fri, Jan 2, 2015 at 5:54 PM, mck  wrote:
>
>>
>> You could manually chunk them down to 64Mb pieces.
>>
>> Can this split and combine be done automatically by cassandra when
> inserting/fetching the file without application being bothered about it?
>
>
>>
>> > 2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch
>> > the file from cassandra to HDFS when I want to process it in hadoop
>> cluster?
>>
>>
>> We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No
>> need for backups of it, no need to upgrade data, and we're free to wipe
>> it whenever hadoop has been stopped.
>> ~mck
>>
>
> Since the hadoop MR streaming job requires the file to be processed to be
> present in HDFS, I was thinking whether can it get directly from mongodb
> instead of me manually fetching it and placing it in a directory before
> submitting the hadoop job?
>
>
> >> There was a datastax project before in being able to replace HDFS with
> >> Cassandra, but i don't think it's alive anymore.
>
> I think you are referring to Brisk project (
> http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/)
> but I don't know its current status.
>
> Can I use http://gerrymcnicol.azurewebsites.net/ for my task in hand?
>
> Regards,
> Seenu.
>


Re: Storing large files for later processing through hadoop

2015-01-02 Thread Srinivasa T N
On Fri, Jan 2, 2015 at 5:54 PM, mck  wrote:

>
> You could manually chunk them down to 64Mb pieces.
>
> Can this split and combine be done automatically by cassandra when
inserting/fetching the file without application being bothered about it?


>
> > 2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch
> > the file from cassandra to HDFS when I want to process it in hadoop
> cluster?
>
>
> We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No
> need for backups of it, no need to upgrade data, and we're free to wipe
> it whenever hadoop has been stopped.
> ~mck
>

Since the hadoop MR streaming job requires the file to be processed to be
present in HDFS, I was thinking whether can it get directly from mongodb
instead of me manually fetching it and placing it in a directory before
submitting the hadoop job?


>> There was a datastax project before in being able to replace HDFS with
>> Cassandra, but i don't think it's alive anymore.

I think you are referring to Brisk project (
http://blog.octo.com/en/introduction-to-datastax-brisk-an-hadoop-and-cassandra-distribution/)
but I don't know its current status.

Can I use http://gerrymcnicol.azurewebsites.net/ for my task in hand?

Regards,
Seenu.


Re: Storing large files for later processing through hadoop

2015-01-02 Thread mck
> 1) The FAQ … informs that I can have only files of around 64 MB …

See http://wiki.apache.org/cassandra/CassandraLimitations
 "A single column value may not be larger than 2GB; in practice, "single
 digits of MB" is a more reasonable limit, since there is no streaming
 or random access of blob values."

CASSANDRA-16  only covers pushing those objects through compaction.
Getting the objects in and out of the heap during normal requests is
still a problem.

You could manually chunk them down to 64Mb pieces.


> 2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch
> the file from cassandra to HDFS when I want to process it in hadoop cluster?


We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No
need for backups of it, no need to upgrade data, and we're free to wipe
it whenever hadoop has been stopped. 

Otherwise all our hadoop jobs still read from and write to Cassandra.
Cassandra is our "big data" platform, with hadoop/spark just providing
additional aggregation abilities. I think this is the effective way,
rather than trying to completely gut out HDFS. 

There was a datastax project before in being able to replace HDFS with
Cassandra, but i don't think it's alive anymore.

~mck


Storing large files for later processing through hadoop

2015-01-02 Thread Srinivasa T N
Hi All,
   The problem I am trying to address is:  Store the raw files (files are
in xml format and of the size arnd 700MB) in cassandra, later fetch it and
process it in hadoop cluster and populate back the processed data in
cassandra.  Regarding this, I wanted few clarifications:

1) The FAQ (
https://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage) informs
that I can have only files of around 64 MB but at the same time talks about
about an jira issue https://issues.apache.org/jira/browse/CASSANDRA-16
which is solved in 0.6 version itself.  So, in the present version of
cassandra (2.0.11), is there any limit on the size of the file in a column
and if so, what is it?
2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch the
file from cassandra to HDFS when I want to process it in hadoop cluster?

Regards,
Seenu.