Hello all community members,

I need opinion of people who was using Spark before and can share there 
experience to help me select technical approach.
I have a project in Proof Of Concept phase, where we are evaluating possibility 
of Spark usage for our use case. 
Here is brief task description.
We should process big amount of raw data to calculate ratings. We have 
different type of textual source data. This is just text lines which represents 
different type of input data (we call them type 20, type 24, type 26, type 33, 
etc).
To perform calculations we should make joins between diffrerent type of raw 
data - event records (which represents actual user action) and users 
description records (which represents person which performs action) and 
sometimes with userGroup record (we group all users by some criteria).
All ratings are calculated on daily basis and our dataset could be partitioned 
by date (except probably reference data).


So we have tried to implement it using possibly most obvious way, we parse text 
file, store data in parquet format and trying to use sparkSQL to query data and 
perform calculation.
Experimenting with sparkSQL I’ve noticed that SQL query speed decreased 
proportionally to data size growth. Base on this I assume that SparkSQL 
performs full records scan while servicing my SQL queries.

So here are the questions I’m trying to find answers:
1.  Is parquet format appropriate for storing data in our case (to efficiently 
query data)? Could it be more suitable to have some DB as storage which could 
filter data efficiently before it gets to Spark processing engine ?
2.  For now we assume that joins we are doing for calculations slowing down 
execution. As alternatives we consider denormalizing data and join it on 
parsing phase, but this increase data volume Spark should handle (due to the 
duplicates we will get). Is it valid approach? Would it be better if we create 
2 RDD, from Parquet files filter them out and next join without sparkSQL 
involvement?  Or joins in SparkSQL are fine and we should look for performance 
bottlenecks in different place?
3.  Should we look closer on Cloudera Impala? As I know it is working over the 
same parquet files and I’m wondering whether it gives better performance for 
data querying ?
4.  90% of results we need could be pre-calculated since they are not change 
after one day of data is loaded. So I think it makes sense to keep this 
pre-calculated data in some DB storage which give me best performance while 
querying by key. Now I’m consider to use Cassandra for this purpose due to it’s 
perfect scalability and performance. Could somebody provide any other options 
we can consider ?

Thanks in Advance,
Any opinion will be helpful and greatly appreciated
-- 


CONFIDENTIALITY NOTICE: This email and files attached to it are 
confidential. If you are not the intended recipient you are hereby notified 
that using, copying, distributing or taking any action in reliance on the 
contents of this information is strictly prohibited. If you have received 
this email in error please notify the sender and delete this email.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to