Re: Data Processing speed SQL Vs SPARK

2015-07-13 Thread Ashish Mukherjee
MySQL and PgSQL scale to millions. Spark or any distributed/clustered
computing environment would be inefficient for the kind of data size you
mention. That's because of coordination of processes, moving data around
etc.

On Mon, Jul 13, 2015 at 5:34 PM, Sandeep Giri sand...@knowbigdata.com
wrote:

 Even for 2L records the MySQL will be better.

 Regards,
 Sandeep Giri,
 +1-253-397-1945 (US)
 +91-953-899-8962 (IN)
 www.KnowBigData.com. http://KnowBigData.com.

 [image: linkedin icon] https://linkedin.com/company/knowbigdata [image:
 other site icon] http://knowbigdata.com  [image: facebook icon]
 https://facebook.com/knowbigdata [image: twitter icon]
 https://twitter.com/IKnowBigData https://twitter.com/IKnowBigData


 On Fri, Jul 10, 2015 at 9:54 AM, vinod kumar vinodsachin...@gmail.com
 wrote:

 For records below 50,000 SQL is better right?


 On Fri, Jul 10, 2015 at 12:18 AM, ayan guha guha.a...@gmail.com wrote:

 With your load, either should be fine.

 I would suggest you to run couple of quick prototype.

 Best
 Ayan

 On Fri, Jul 10, 2015 at 2:06 PM, vinod kumar vinodsachin...@gmail.com
 wrote:

 Ayan,

 I would want to process a data which  nearly around 5 records to 2L
 records(in flat).

 Is there is any scaling is there to decide what technology is
 best?either SQL or SPARK?



 On Thu, Jul 9, 2015 at 9:40 AM, ayan guha guha.a...@gmail.com wrote:

 It depends on workload. How much data you would want to process?
 On 9 Jul 2015 22:28, vinod kumar vinodsachin...@gmail.com wrote:

 Hi Everyone,

 I am new to spark.

 Am using SQL in my application to handle data in my application.I
 have a thought to move to spark now.

 Is data processing speed of spark better than SQL server?

 Thank,
 Vinod





 --
 Best Regards,
 Ayan Guha






RDD staleness

2015-05-31 Thread Ashish Mukherjee
Hello,

Since RDDs are created from data from Hive tables or HDFS, how do we ensure
they are invalidated when the source data is updated?

Regards,
Ashish


Re: Spark SQL v MemSQL/Voltdb

2015-05-28 Thread Ashish Mukherjee
Hi Mohit,

Thanks for your reply.

If my use case is purely querying read-only data (no transaction
scenarios), at what scale is one of them a better option than the other? I
am aware that for scale which can be supported on a single node, VoltDB is
a better choice. However, when the scale grows to a clustered scenario,
which is the right engine at various degrees of scale?

Regards,
Ashish

On Fri, May 29, 2015 at 6:57 AM, Mohit Jaggi mohitja...@gmail.com wrote:

 I have used VoltDB and Spark. The use cases for the two are quite
 different. VoltDB is intended for transactions and also supports queries on
 the same(custom to voltdb) store. Spark(SQL) is NOT suitable for
 transactions; it is designed for querying immutable data (which may exist
 in several different forms of stores).

  On May 28, 2015, at 7:48 AM, Ashish Mukherjee 
 ashish.mukher...@gmail.com wrote:
 
  Hello,
 
  I was wondering if there is any documented comparison of SparkSQL with
 MemSQL/VoltDB kind of in-memory SQL databases. MemSQL etc. too allow
 queries to be run in a clustered environment. What is  the major
 differentiation?
 
  Regards,
  Ashish




Spark SQL v MemSQL/Voltdb

2015-05-28 Thread Ashish Mukherjee
Hello,

I was wondering if there is any documented comparison of SparkSQL with
MemSQL/VoltDB kind of in-memory SQL databases. MemSQL etc. too allow
queries to be run in a clustered environment. What is  the major
differentiation?

Regards,
Ashish


Spark SQL and DataSources API roadmap

2015-03-27 Thread Ashish Mukherjee
Hello,

Is there any published community roadmap for SparkSQL and the DataSources
API?

Regards,
Ashish


Spark as a service

2015-03-24 Thread Ashish Mukherjee
Hello,

As of now, if I have to execute a Spark job, I need to create a jar and
deploy it.  If I need to run a dynamically formed SQL from a Web
application, is there any way of using SparkSQL in this manner? Perhaps,
through a Web Service or something similar.

Regards,
Ashish


Re: Question about Data Sources API

2015-03-24 Thread Ashish Mukherjee
Hello Michael,

Thanks for your quick reply.

My question wrt Java/Scala was related to extending the classes to support
new custom data sources, so was wondering if those could be written in
Java, since our company is a Java shop.

The additional push downs I am looking for are aggregations with grouping
and sorting.

Essentially, I am trying to evaluate if this API can give me much of what
is possible with the Apache MetaModel project.

Regards,
Ashish

On Tue, Mar 24, 2015 at 1:57 PM, Michael Armbrust mich...@databricks.com
wrote:

 On Tue, Mar 24, 2015 at 12:57 AM, Ashish Mukherjee 
 ashish.mukher...@gmail.com wrote:

 1. Is the Data Source API stable as of Spark 1.3.0?


 It is marked DeveloperApi, but in general we do not plan to change even
 these APIs unless there is a very compelling reason to.


 2. The Data Source API seems to be available only in Scala. Is there any
 plan to make it available for Java too?


 We tried to make all the suggested interfaces (other than CatalystScan
 which exposes internals and is only for experimentation) usable from Java.
 Is there something in particular you are having trouble with?


 3.  Are only filters and projections pushed down to the data source and
 all the data pulled into Spark for other processing?


 For now, this is all that is provided by the public stable API.  We left a
 hook for more powerful push downs
 (sqlContext.experimental.extraStrategies), and would be interested in
 feedback on other operations we should push down as we expand the API.



Question about Data Sources API

2015-03-24 Thread Ashish Mukherjee
Hello,

I have some questions related to the Data Sources API -

1. Is the Data Source API stable as of Spark 1.3.0?

2. The Data Source API seems to be available only in Scala. Is there any
plan to make it available for Java too?

3.  Are only filters and projections pushed down to the data source and all
the data pulled into Spark for other processing?

Regards,
Ashish


Spark with data on NFS v HDFS

2015-03-05 Thread Ashish Mukherjee
Hello,

I understand Spark can be used with Hadoop or standalone. I have certain
questions related to use of the correct FS for Spark data.

What is the efficiency trade-off in feeding data to Spark from NFS v HDFS?

If one is not using Hadoop, is it still usual to house data in HDFS for
Spark to read from because of better reliability compared to NFS?

Should data be stored on local FS (not NFS) only for Spark jobs which run
on single machine?

Regards,
Ashish


SparkSQL production readiness

2015-02-28 Thread Ashish Mukherjee
Hi,

I am exploring SparkSQL for my purposes of performing large relational
operations across a cluster. However, it seems to be in alpha right now. Is
there any indication when it would be considered production-level? I don't
see any info on the site.

Regards,
Ashish


Running in-memory SQL on streamed relational data

2015-02-28 Thread Ashish Mukherjee
Hi,

I have been looking at Spark Streaming , which seems to be for the use case
of live streams which are processed one line at a time generally in
real-time.

Since SparkSQL reads data from some filesystem, I was wondering if there is
something which connects SparkSQL with Spark Streaming, so I can send live
relational tuples in a stream (rather than read filesystem data) for SQL
operations.

Also, at present, doing it with Spark Streaming would have complexities of
handling multiple Dstreams etc. since I may want to run multiple adhoc
queries of this kind on adhoc data I stream through.

Has anyone done this kind of thing with Spark before? i.e combination of
SparkSQL with Streaming.

Regards,
Ashish


Spark Distributed Join

2015-02-13 Thread Ashish Mukherjee
Hello,

I have the following scenario and was wondering if I can use Spark to
address it.

I want to query two different data stores (say, ElasticSearch and MySQL)
and then merge the two result sets based on a join key between the two. Is
it appropriate to use Spark to do this join, if the intermediate data sets
are large? (This is a No-ETL scenario)

I was thinking of two possibilities -

1) Send the intermediate data sets to Spark through a stream and get Spark
to do the join. The complexity here is that there would be multiple
concurrent streams to deal with. If I don't use streams, there would be
intermediate disk writes and data transfer to the Spark master.

2) Don't use Spark and do the same with some in-memory distributed engine
like MemSQL or Redis.

What's the experts' view on this?

Regards,
Ashish