Cluster sizing for recommendations

2015-07-06 Thread Danny Yates
Hi, I'm having trouble building a recommender and would appreciate a few pointers. I have 350,000,000 events which are stored in roughly 500,000 S3 files and are formatted as semi-structured JSON. These events are not all relevant to making recommendations. My code is (roughly): case class

ETL process design

2015-01-28 Thread Danny Yates
Hi, My apologies for what has ended up as quite a long email with a lot of open-ended questions, but, as you can see, I'm really struggling to get started and would appreciate some guidance from people with more experience. I'm new to Spark and big data in general, and I'm struggling with what I

Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Danny Yates
Hi, I've got a bunch of data stored in S3 under directories like this: s3n://blah/y=2015/m=01/d=25/lots-of-files.csv In Hive, if I issue a query WHERE y=2015 AND m=01, I get the benefit that it only scans the necessary directories for files to read. As far as I can tell from searching and

Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Danny Yates
Thanks Michael. I'm not actually using Hive at the moment - in fact, I'm trying to avoid it if I can. I'm just wondering whether Spark has anything similar I can leverage? Thanks

Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Danny Yates
Ah, well that is interesting. I'll experiment further tomorrow. Thank you for the info! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org