Re: Slow Mongo Read from Spark

2015-09-03 Thread Jörn Franke
You might think about another storage layer not being mongodb (hdfs+orc+compression or hdfs+parquet+compression) to improve performance Le jeu. 3 sept. 2015 à 9:15, Akhil Das a écrit : > On SSD you will get around 30-40MB/s on a single machine (on 4 cores). > >

Re: Slow Mongo Read from Spark

2015-09-03 Thread Deepesh Maheshwari
Because of existing architecture , i am bound to use mongodb. Please suggest for this On Thu, Sep 3, 2015 at 9:10 PM, Jörn Franke wrote: > You might think about another storage layer not being mongodb > (hdfs+orc+compression or hdfs+parquet+compression) to improve

Re: Slow Mongo Read from Spark

2015-09-03 Thread Akhil Das
On SSD you will get around 30-40MB/s on a single machine (on 4 cores). Thanks Best Regards On Mon, Aug 31, 2015 at 3:13 PM, Deepesh Maheshwari < deepesh.maheshwar...@gmail.com> wrote: > tried it,,gives the same above exception > > Exception in thread "main" java.io.IOException: No FileSystem

Slow Mongo Read from Spark

2015-08-31 Thread Deepesh Maheshwari
Hi, I am trying to read mongodb in Spark newAPIHadoopRDD. / Code */ config.set("mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat"); config.set("mongo.input.uri",SparkProperties.MONGO_OUTPUT_URI); config.set("mongo.input.query","{host: 'abc.com'}"); JavaSparkContext sc=new

Re: Slow Mongo Read from Spark

2015-08-31 Thread Akhil Das
Can you try with these key value classes and see the performance? inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat" keyClassName = "org.apache.hadoop.io.Text" valueClassName = "org.apache.hadoop.io.MapWritable" Taken from databricks blog

Re: Slow Mongo Read from Spark

2015-08-31 Thread Akhil Das
Here's a piece of code which works well for us (spark 1.4.1) Configuration bsonDataConfig = new Configuration(); bsonDataConfig.set("mongo.job.input.format", "com.mongodb.hadoop.BSONFileInputFormat"); Configuration predictionsConfig = new Configuration();

Re: Slow Mongo Read from Spark

2015-08-31 Thread Akhil Das
FYI, newAPIHadoopFile and newAPIHadoopRDD uses the NewHadoopRDD class itself underneath and it doesnt mean it will only read from HDFS. Give it a shot if you haven't tried it already (it just the inputformat and the reader which are different from your approach). Thanks Best Regards On Mon, Aug

Re: Slow Mongo Read from Spark

2015-08-31 Thread Deepesh Maheshwari
Hi Akhil, This code snippet is from below link https://github.com/crcsmnky/mongodb-spark-demo/blob/master/src/main/java/com/mongodb/spark/demo/Recommender.java Here it reading data from HDFS file system but in our case i need to read from mongodb. I have tried it earlier and now again tried it