FYI, newAPIHadoopFile and newAPIHadoopRDD uses the NewHadoopRDD class itself underneath and it doesnt mean it will only read from HDFS. Give it a shot if you haven't tried it already (it just the inputformat and the reader which are different from your approach).
Thanks Best Regards On Mon, Aug 31, 2015 at 1:14 PM, Deepesh Maheshwari < deepesh.maheshwar...@gmail.com> wrote: > Hi Akhil, > > This code snippet is from below link > > https://github.com/crcsmnky/mongodb-spark-demo/blob/master/src/main/java/com/mongodb/spark/demo/Recommender.java > > Here it reading data from HDFS file system but in our case i need to read > from mongodb. > > I have tried it earlier and now again tried it but is giving below error > which is self explanantory. > > Exception in thread "main" java.io.IOException: No FileSystem for scheme: > mongodb > > On Mon, Aug 31, 2015 at 1:03 PM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> Here's a piece of code which works well for us (spark 1.4.1) >> >> Configuration bsonDataConfig = new Configuration(); >> bsonDataConfig.set("mongo.job.input.format", >> "com.mongodb.hadoop.BSONFileInputFormat"); >> >> Configuration predictionsConfig = new Configuration(); >> predictionsConfig.set("mongo.output.uri", mongodbUri); >> >> JavaPairRDD<Object,BSONObject> bsonRatingsData = >> sc.newAPIHadoopFile( >> ratingsUri, BSONFileInputFormat.class, Object.class, >> BSONObject.class, bsonDataConfig); >> >> >> Thanks >> Best Regards >> >> On Mon, Aug 31, 2015 at 12:59 PM, Deepesh Maheshwari < >> deepesh.maheshwar...@gmail.com> wrote: >> >>> Hi, I am using <spark.version>1.3.0</spark.version> >>> >>> I am not getting constructor for above values >>> >>> [image: Inline image 1] >>> >>> So, i tried to shuffle the values in constructor . >>> [image: Inline image 2] >>> >>> But, it is giving this error.Please suggest >>> [image: Inline image 3] >>> >>> Best Regards >>> >>> On Mon, Aug 31, 2015 at 12:43 PM, Akhil Das <ak...@sigmoidanalytics.com> >>> wrote: >>> >>>> Can you try with these key value classes and see the performance? >>>> >>>> inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat" >>>> >>>> >>>> keyClassName = "org.apache.hadoop.io.Text" >>>> valueClassName = "org.apache.hadoop.io.MapWritable" >>>> >>>> >>>> Taken from databricks blog >>>> <https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html> >>>> >>>> Thanks >>>> Best Regards >>>> >>>> On Mon, Aug 31, 2015 at 12:26 PM, Deepesh Maheshwari < >>>> deepesh.maheshwar...@gmail.com> wrote: >>>> >>>>> Hi, I am trying to read mongodb in Spark newAPIHadoopRDD. >>>>> >>>>> /**** Code *****/ >>>>> >>>>> config.set("mongo.job.input.format", >>>>> "com.mongodb.hadoop.MongoInputFormat"); >>>>> config.set("mongo.input.uri",SparkProperties.MONGO_OUTPUT_URI); >>>>> config.set("mongo.input.query","{host: 'abc.com'}"); >>>>> >>>>> JavaSparkContext sc=new JavaSparkContext("local", "MongoOps"); >>>>> >>>>> JavaPairRDD<Object, BSONObject> mongoRDD = >>>>> sc.newAPIHadoopRDD(config, >>>>> com.mongodb.hadoop.MongoInputFormat.class, >>>>> Object.class, >>>>> BSONObject.class); >>>>> >>>>> long count=mongoRDD.count(); >>>>> >>>>> There are about 1.5million record. >>>>> Though i am getting data but read operation took around 15min to read >>>>> whole. >>>>> >>>>> Is this Api really too slow or am i missing something. >>>>> Please suggest if there is an alternate approach to read data from >>>>> Mongo faster. >>>>> >>>>> Thanks, >>>>> Deepesh >>>>> >>>> >>>> >>> >> >