Not necessarily.

It depends on the use case and what you intend to do with the data. 

4-6 TB will easily fit on an SMP box and can be efficiently searched by an 
RDBMS. 
Again it depends on what you want to do and how you want to do it. 

Informix’s IDS engine with its extensibility could still outperform spark in 
some use cases based on the proper use of indexes and amount of parallelism. 

There is a lot of cross over… now had you said 100TB+ on unstructured data… 
things may be different. 

Please understand that what would make spark more compelling is the TCO of the 
solution when compared to SMP boxes and software licensing. 

Its not that I don’t disagree with your statements, because moving from mssql 
or any small RDBMS to spark … doesn’t make a whole lot of sense. 
Just wanted to add that the decision isn’t as cut and dry as some think…. 

> On Jul 11, 2015, at 8:47 AM, Mohammed Guller <moham...@glassbeam.com> wrote:
> 
> Hi Roman,
> Yes, Spark SQL will be a better solution than standard RDBMS databases for 
> querying 4-6 TB data. You can pair Spark SQL with HDFS+Parquet to build a 
> powerful analytics solution.
>  
> Mohammed
>  
> From: David Mitchell [mailto:jdavidmitch...@gmail.com] 
> Sent: Saturday, July 11, 2015 7:10 AM
> To: Roman Sokolov
> Cc: Mohammed Guller; user; Ravisankar Mani
> Subject: Re: Spark performance
>  
> You can certainly query over 4 TB of data with Spark.  However, you will get 
> an answer in minutes or hours, not in milliseconds or seconds.  OLTP 
> databases are used for web applications, and typically return responses in 
> milliseconds.  Analytic databases tend to operate on large data sets, and 
> return responses in seconds, minutes or hours.  When running batch jobs over 
> large data sets, Spark can be a replacement for analytic databases like 
> Greenplum or Netezza.  
>  
>  
>  
> On Sat, Jul 11, 2015 at 8:53 AM, Roman Sokolov <ole...@gmail.com 
> <mailto:ole...@gmail.com>> wrote:
> Hello. Had the same question. What if I need to store 4-6 Tb and do queries? 
> Can't find any clue in documentation.
> 
> Am 11.07.2015 03:28 schrieb "Mohammed Guller" <moham...@glassbeam.com 
> <mailto:moham...@glassbeam.com>>:
> Hi Ravi,
> First, Neither Spark nor Spark SQL is a database. Both are compute engines, 
> which need to be paired with a storage system. Seconds, they are designed for 
> processing large distributed datasets. If you have only 100,000 records or 
> even a million records, you don’t need Spark. A RDBMS will perform much 
> better for that volume of data.
>  
> Mohammed
>  
> From: Ravisankar Mani [mailto:rrav...@gmail.com <mailto:rrav...@gmail.com>] 
> Sent: Friday, July 10, 2015 3:50 AM
> To: user@spark.apache.org <mailto:user@spark.apache.org>
> Subject: Spark performance
>  
> Hi everyone,
> 
> I have planned to move mssql server to spark?.  I have using around 50,000 to 
> 1l records.
>  The spark performance is slow when compared to mssql server.
>  
> What is the best data base(Spark or sql) to store or retrieve data around 
> 50,000 to 1l records ?
> 
> regards,
> Ravi
>  
> 
> 
>  
> --
> ### Confidential e-mail, for recipient's (or recipients') eyes only, not for 
> distribution. ###


Reply via email to