Hello all community members, I need opinion of people who was using Spark before and can share there experience to help me select technical approach. I have a project in Proof Of Concept phase, where we are evaluating possibility of Spark usage for our use case. Here is brief task description. We should process big amount of raw data to calculate ratings. We have different type of textual source data. This is just text lines which represents different type of input data (we call them type 20, type 24, type 26, type 33, etc). To perform calculations we should make joins between diffrerent type of raw data - event records (which represents actual user action) and users description records (which represents person which performs action) and sometimes with userGroup record (we group all users by some criteria). All ratings are calculated on daily basis and our dataset could be partitioned by date (except probably reference data).
So we have tried to implement it using possibly most obvious way, we parse text file, store data in parquet format and trying to use sparkSQL to query data and perform calculation. Experimenting with sparkSQL I’ve noticed that SQL query speed decreased proportionally to data size growth. Base on this I assume that SparkSQL performs full records scan while servicing my SQL queries. So here are the questions I’m trying to find answers: 1. Is parquet format appropriate for storing data in our case (to efficiently query data)? Could it be more suitable to have some DB as storage which could filter data efficiently before it gets to Spark processing engine ? 2. For now we assume that joins we are doing for calculations slowing down execution. As alternatives we consider denormalizing data and join it on parsing phase, but this increase data volume Spark should handle (due to the duplicates we will get). Is it valid approach? Would it be better if we create 2 RDD, from Parquet files filter them out and next join without sparkSQL involvement? Or joins in SparkSQL are fine and we should look for performance bottlenecks in different place? 3. Should we look closer on Cloudera Impala? As I know it is working over the same parquet files and I’m wondering whether it gives better performance for data querying ? 4. 90% of results we need could be pre-calculated since they are not change after one day of data is loaded. So I think it makes sense to keep this pre-calculated data in some DB storage which give me best performance while querying by key. Now I’m consider to use Cassandra for this purpose due to it’s perfect scalability and performance. Could somebody provide any other options we can consider ? Thanks in Advance, Any opinion will be helpful and greatly appreciated -- CONFIDENTIALITY NOTICE: This email and files attached to it are confidential. If you are not the intended recipient you are hereby notified that using, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error please notify the sender and delete this email. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org