You might think about another storage layer not being mongodb (hdfs+orc+compression or hdfs+parquet+compression) to improve performance
Le jeu. 3 sept. 2015 à 9:15, Akhil Das <ak...@sigmoidanalytics.com> a écrit : > On SSD you will get around 30-40MB/s on a single machine (on 4 cores). > > Thanks > Best Regards > > On Mon, Aug 31, 2015 at 3:13 PM, Deepesh Maheshwari < > deepesh.maheshwar...@gmail.com> wrote: > >> tried it,,gives the same above exception >> >> Exception in thread "main" java.io.IOException: No FileSystem for scheme: >> mongodb >> >> In you case, do you have used above code. >> What read throughput , you get? >> >> On Mon, Aug 31, 2015 at 2:04 PM, Akhil Das <ak...@sigmoidanalytics.com> >> wrote: >> >>> FYI, newAPIHadoopFile and newAPIHadoopRDD uses the NewHadoopRDD class >>> itself underneath and it doesnt mean it will only read from HDFS. Give it a >>> shot if you haven't tried it already (it just the inputformat and the >>> reader which are different from your approach). >>> >>> Thanks >>> Best Regards >>> >>> On Mon, Aug 31, 2015 at 1:14 PM, Deepesh Maheshwari < >>> deepesh.maheshwar...@gmail.com> wrote: >>> >>>> Hi Akhil, >>>> >>>> This code snippet is from below link >>>> >>>> https://github.com/crcsmnky/mongodb-spark-demo/blob/master/src/main/java/com/mongodb/spark/demo/Recommender.java >>>> >>>> Here it reading data from HDFS file system but in our case i need to >>>> read from mongodb. >>>> >>>> I have tried it earlier and now again tried it but is giving below >>>> error which is self explanantory. >>>> >>>> Exception in thread "main" java.io.IOException: No FileSystem for >>>> scheme: mongodb >>>> >>>> On Mon, Aug 31, 2015 at 1:03 PM, Akhil Das <ak...@sigmoidanalytics.com> >>>> wrote: >>>> >>>>> Here's a piece of code which works well for us (spark 1.4.1) >>>>> >>>>> Configuration bsonDataConfig = new Configuration(); >>>>> bsonDataConfig.set("mongo.job.input.format", >>>>> "com.mongodb.hadoop.BSONFileInputFormat"); >>>>> >>>>> Configuration predictionsConfig = new Configuration(); >>>>> predictionsConfig.set("mongo.output.uri", mongodbUri); >>>>> >>>>> JavaPairRDD<Object,BSONObject> bsonRatingsData = >>>>> sc.newAPIHadoopFile( >>>>> ratingsUri, BSONFileInputFormat.class, Object.class, >>>>> BSONObject.class, bsonDataConfig); >>>>> >>>>> >>>>> Thanks >>>>> Best Regards >>>>> >>>>> On Mon, Aug 31, 2015 at 12:59 PM, Deepesh Maheshwari < >>>>> deepesh.maheshwar...@gmail.com> wrote: >>>>> >>>>>> Hi, I am using <spark.version>1.3.0</spark.version> >>>>>> >>>>>> I am not getting constructor for above values >>>>>> >>>>>> [image: Inline image 1] >>>>>> >>>>>> So, i tried to shuffle the values in constructor . >>>>>> [image: Inline image 2] >>>>>> >>>>>> But, it is giving this error.Please suggest >>>>>> [image: Inline image 3] >>>>>> >>>>>> Best Regards >>>>>> >>>>>> On Mon, Aug 31, 2015 at 12:43 PM, Akhil Das < >>>>>> ak...@sigmoidanalytics.com> wrote: >>>>>> >>>>>>> Can you try with these key value classes and see the performance? >>>>>>> >>>>>>> inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat" >>>>>>> >>>>>>> >>>>>>> keyClassName = "org.apache.hadoop.io.Text" >>>>>>> valueClassName = "org.apache.hadoop.io.MapWritable" >>>>>>> >>>>>>> >>>>>>> Taken from databricks blog >>>>>>> <https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html> >>>>>>> >>>>>>> Thanks >>>>>>> Best Regards >>>>>>> >>>>>>> On Mon, Aug 31, 2015 at 12:26 PM, Deepesh Maheshwari < >>>>>>> deepesh.maheshwar...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi, I am trying to read mongodb in Spark newAPIHadoopRDD. >>>>>>>> >>>>>>>> /**** Code *****/ >>>>>>>> >>>>>>>> config.set("mongo.job.input.format", >>>>>>>> "com.mongodb.hadoop.MongoInputFormat"); >>>>>>>> config.set("mongo.input.uri",SparkProperties.MONGO_OUTPUT_URI); >>>>>>>> config.set("mongo.input.query","{host: 'abc.com'}"); >>>>>>>> >>>>>>>> JavaSparkContext sc=new JavaSparkContext("local", "MongoOps"); >>>>>>>> >>>>>>>> JavaPairRDD<Object, BSONObject> mongoRDD = >>>>>>>> sc.newAPIHadoopRDD(config, >>>>>>>> com.mongodb.hadoop.MongoInputFormat.class, >>>>>>>> Object.class, >>>>>>>> BSONObject.class); >>>>>>>> >>>>>>>> long count=mongoRDD.count(); >>>>>>>> >>>>>>>> There are about 1.5million record. >>>>>>>> Though i am getting data but read operation took around 15min to >>>>>>>> read whole. >>>>>>>> >>>>>>>> Is this Api really too slow or am i missing something. >>>>>>>> Please suggest if there is an alternate approach to read data from >>>>>>>> Mongo faster. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Deepesh >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >