Try a Hadoop Custom InputFormat - I can give you some samples - While I have not tried this an input split has only a length (could be ignores if the format treats as non splittable) and a String for a location. If the location is a URL into wikipedia the whole thing should work. Hadoop InputFormats seem to be the best way to get large (say multi gigabyte files) into RDDs
- Spark and Python using generator of data bigger than RAM as in... jan.zikes
- Re: Spark and Python using generator of data bigger than ... Davies Liu
- Re: Spark and Python using generator of data bigger t... jan.zikes
- Re: Spark and Python using generator of data bigg... Davies Liu
- Re: Spark and Python using generator of data ... Steve Lewis
- Re: Spark and Python using generator of ... jan.zikes
- Re: Spark and Python using generator... jan.zikes