Hi Kaspar, This is definitely doable, but in my opinion, it's important to remember that, at its core, Spark is based around a functional programming paradigm - you're taking input sets of data and, by applying various transformations, you end up with a dataset that represents your "answer". Without knowing more about your use case, and keeping in mind that I'm very new to Spark, here are a few things I would want to think about if I were writing this as a non-Streaming Spark application:
1. What is your starting dataset? Do you have an initial set of parameters or a data source that is used to define each of the millions of requests? If so, then that should comprise your first RDD and you can perform subsequent transformations to prepare your HTTP requests (e.g., start with the information that drives the generation of the requests, and use map/flatMap to create an RDD that has the full list of requests you want to run). 2. Are the HTTP requests read-only, and/or idempotent (are you only looking up data, or are you performing requests that cause some sort of side effect)? Spark operations against RDDs work by defining a lineage graph, and transformations will be re-run if a partition in the lineage needs to be recalculated for any reason. If your HTTP requests are causing side-effects that should not be repeated, then Spark may not be the best fit for that portion of the job, and you might want to use something else, pipe the results into HDFS, and then analyze those using Spark.. 3. If your web service requests are lookups or are idempotent, then we're on the right track. Keep in mind that your web service probably will not scale as well as the Spark job - a naive first-pass implementation could easily overwhelm many services, particularly if/when partitions need to be recalculated. There are a few mechanisms you can use to mitigate this - one is to use mapPartitions rather than map when transforming the set of requests to the set of results, initialize an HTTP connection for each partition, and transform the data that defines the request into your desired dataset by invoking the web service. Using mapPartitions allows you to limit the number of concurrent HTTP connections to one per partition (although this may be too slow if your service is slow... there is obviously a bit of analysis, testing and profiling that would need to be done on the entire job). Another consideration would be to look at persisting or caching the intermediate results after you've successfully retrieved your results from the service, to reduce the likelihood of hitting the web service more than necessary. 4. Just realized you might be looking for help invoking an HTTP service programmatically from Scala / Spark - if so, you might want to look at the spray-client <http://spray.io/documentation/1.2.3/spray-client/> library. 5. With millions of web service requests, it's highly likely some will fail, for a variety of reasons. Look into using Scala's Try <http://www.scala-lang.org/api/2.11.5/index.html#scala.util.Try> or Either <http://www.scala-lang.org/api/2.11.5/index.html#scala.util.Either> monads to encode success / failure, and treat failed requests as first-class citizens in your RDD of results (by retrying them, filtering them, logging them, etc., based on your specific needs and use case). Make sure you are setting reasonable timeouts on your service calls to prevent the jSpark ob from getting stuck if the service turns into a black hole. As I said above, I'm pretty new to Spark, so others may have some better advice, or even tell you to ignore mine completely (no hard feelings, I promise - this is all very new to me). Good luck! Regards, Will On Wed, Jun 3, 2015 at 3:49 AM, kasparfischer <kaspar.fisc...@dreizak.com> wrote: > Hi everybody, > > I'm new to Spark, apologies if my question is very basic. > > I have a need to send millions of requests to a web service and analyse and > store the responses in an RDD. I can easy express the analysing part using > Spark's filter/map/etc. primitives but I don't know how to make the > requests. Is that something I can do from within Spark? Or Spark Streaming? > Or does it conflict with the way Spark works? > > I've found a similar question but am not sure whether the answer applies > here: > > > > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-Spark-Streaming-from-an-HTTP-api-tp12330.html > > Any clarifications or pointers would be super helpful! > > Thanks, > Kaspar > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Make-HTTP-requests-from-within-Spark-tp23129.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >