Can you share your approximate data size? all should be valid use cases for spark, wondering if you are providing enough resources.
Also - do you have some expectations in terms of performance? what does "slow down" mean? For this usecase I would personally favor parquet over DB, and sql/dataframes over regular spark RDDs as you get some benefits related to predicate pushdown, etc. Sent from my iPhone > On 21 Oct 2015, at 00:29, Aliaksei Tsyvunchyk <atsyvunc...@exadel.com> wrote: > > Hello all community members, > > I need opinion of people who was using Spark before and can share there > experience to help me select technical approach. > I have a project in Proof Of Concept phase, where we are evaluating > possibility of Spark usage for our use case. > Here is brief task description. > We should process big amount of raw data to calculate ratings. We have > different type of textual source data. This is just text lines which > represents different type of input data (we call them type 20, type 24, type > 26, type 33, etc). > To perform calculations we should make joins between diffrerent type of raw > data - event records (which represents actual user action) and users > description records (which represents person which performs action) and > sometimes with userGroup record (we group all users by some criteria). > All ratings are calculated on daily basis and our dataset could be > partitioned by date (except probably reference data). > > > So we have tried to implement it using possibly most obvious way, we parse > text file, store data in parquet format and trying to use sparkSQL to query > data and perform calculation. > Experimenting with sparkSQL I’ve noticed that SQL query speed decreased > proportionally to data size growth. Base on this I assume that SparkSQL > performs full records scan while servicing my SQL queries. > > So here are the questions I’m trying to find answers: > 1. Is parquet format appropriate for storing data in our case (to > efficiently query data)? Could it be more suitable to have some DB as storage > which could filter data efficiently before it gets to Spark processing engine > ? > 2. For now we assume that joins we are doing for calculations slowing down > execution. As alternatives we consider denormalizing data and join it on > parsing phase, but this increase data volume Spark should handle (due to the > duplicates we will get). Is it valid approach? Would it be better if we > create 2 RDD, from Parquet files filter them out and next join without > sparkSQL involvement? Or joins in SparkSQL are fine and we should look for > performance bottlenecks in different place? > 3. Should we look closer on Cloudera Impala? As I know it is working over > the same parquet files and I’m wondering whether it gives better performance > for data querying ? > 4. 90% of results we need could be pre-calculated since they are not change > after one day of data is loaded. So I think it makes sense to keep this > pre-calculated data in some DB storage which give me best performance while > querying by key. Now I’m consider to use Cassandra for this purpose due to > it’s perfect scalability and performance. Could somebody provide any other > options we can consider ? > > Thanks in Advance, > Any opinion will be helpful and greatly appreciated > -- > > > CONFIDENTIALITY NOTICE: This email and files attached to it are > confidential. If you are not the intended recipient you are hereby notified > that using, copying, distributing or taking any action in reliance on the > contents of this information is strictly prohibited. If you have received > this email in error please notify the sender and delete this email. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org