Can you share your approximate data size? all should be valid use cases for 
spark, wondering if you are providing enough resources.

Also - do you have some expectations in terms of performance? what does "slow 
down" mean?

For this usecase I would personally favor parquet over DB, and sql/dataframes 
over regular spark RDDs as you get some benefits related to predicate pushdown, 
etc.

Sent from my iPhone

> On 21 Oct 2015, at 00:29, Aliaksei Tsyvunchyk <atsyvunc...@exadel.com> wrote:
> 
> Hello all community members,
> 
> I need opinion of people who was using Spark before and can share there 
> experience to help me select technical approach.
> I have a project in Proof Of Concept phase, where we are evaluating 
> possibility of Spark usage for our use case. 
> Here is brief task description.
> We should process big amount of raw data to calculate ratings. We have 
> different type of textual source data. This is just text lines which 
> represents different type of input data (we call them type 20, type 24, type 
> 26, type 33, etc).
> To perform calculations we should make joins between diffrerent type of raw 
> data - event records (which represents actual user action) and users 
> description records (which represents person which performs action) and 
> sometimes with userGroup record (we group all users by some criteria).
> All ratings are calculated on daily basis and our dataset could be 
> partitioned by date (except probably reference data).
> 
> 
> So we have tried to implement it using possibly most obvious way, we parse 
> text file, store data in parquet format and trying to use sparkSQL to query 
> data and perform calculation.
> Experimenting with sparkSQL I’ve noticed that SQL query speed decreased 
> proportionally to data size growth. Base on this I assume that SparkSQL 
> performs full records scan while servicing my SQL queries.
> 
> So here are the questions I’m trying to find answers:
> 1.  Is parquet format appropriate for storing data in our case (to 
> efficiently query data)? Could it be more suitable to have some DB as storage 
> which could filter data efficiently before it gets to Spark processing engine 
> ?
> 2.  For now we assume that joins we are doing for calculations slowing down 
> execution. As alternatives we consider denormalizing data and join it on 
> parsing phase, but this increase data volume Spark should handle (due to the 
> duplicates we will get). Is it valid approach? Would it be better if we 
> create 2 RDD, from Parquet files filter them out and next join without 
> sparkSQL involvement?  Or joins in SparkSQL are fine and we should look for 
> performance bottlenecks in different place?
> 3.  Should we look closer on Cloudera Impala? As I know it is working over 
> the same parquet files and I’m wondering whether it gives better performance 
> for data querying ?
> 4.  90% of results we need could be pre-calculated since they are not change 
> after one day of data is loaded. So I think it makes sense to keep this 
> pre-calculated data in some DB storage which give me best performance while 
> querying by key. Now I’m consider to use Cassandra for this purpose due to 
> it’s perfect scalability and performance. Could somebody provide any other 
> options we can consider ?
> 
> Thanks in Advance,
> Any opinion will be helpful and greatly appreciated
> -- 
> 
> 
> CONFIDENTIALITY NOTICE: This email and files attached to it are 
> confidential. If you are not the intended recipient you are hereby notified 
> that using, copying, distributing or taking any action in reliance on the 
> contents of this information is strictly prohibited. If you have received 
> this email in error please notify the sender and delete this email.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to