Re: Slow Mongo Read from Spark

Jörn Franke Thu, 03 Sep 2015 08:41:46 -0700

You might think about another storage layer not being mongodb
(hdfs+orc+compression or hdfs+parquet+compression)  to improve performance


Le jeu. 3 sept. 2015 à 9:15, Akhil Das <ak...@sigmoidanalytics.com> a
écrit :

> On SSD you will get around 30-40MB/s on a single machine (on 4 cores).
>
> Thanks
> Best Regards
>
> On Mon, Aug 31, 2015 at 3:13 PM, Deepesh Maheshwari <
> deepesh.maheshwar...@gmail.com> wrote:
>
>> tried it,,gives the same above exception
>>
>> Exception in thread "main" java.io.IOException: No FileSystem for scheme:
>> mongodb
>>
>> In you case, do you have used above code.
>> What read throughput , you get?
>>
>> On Mon, Aug 31, 2015 at 2:04 PM, Akhil Das <ak...@sigmoidanalytics.com>
>> wrote:
>>
>>> FYI, newAPIHadoopFile and newAPIHadoopRDD uses the NewHadoopRDD class
>>> itself underneath and it doesnt mean it will only read from HDFS. Give it a
>>> shot if you haven't tried it already (it just the inputformat and the
>>> reader which are different from your approach).
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Mon, Aug 31, 2015 at 1:14 PM, Deepesh Maheshwari <
>>> deepesh.maheshwar...@gmail.com> wrote:
>>>
>>>> Hi Akhil,
>>>>
>>>> This code snippet is from below link
>>>>
>>>> https://github.com/crcsmnky/mongodb-spark-demo/blob/master/src/main/java/com/mongodb/spark/demo/Recommender.java
>>>>
>>>> Here it reading data from HDFS file system but in our case i need to
>>>> read from mongodb.
>>>>
>>>> I have tried it earlier and now again tried it but is giving below
>>>> error which is self explanantory.
>>>>
>>>> Exception in thread "main" java.io.IOException: No FileSystem for
>>>> scheme: mongodb
>>>>
>>>> On Mon, Aug 31, 2015 at 1:03 PM, Akhil Das <ak...@sigmoidanalytics.com>
>>>> wrote:
>>>>
>>>>> Here's a piece of code which works well for us (spark 1.4.1)
>>>>>
>>>>>         Configuration bsonDataConfig = new Configuration();
>>>>>         bsonDataConfig.set("mongo.job.input.format",
>>>>> "com.mongodb.hadoop.BSONFileInputFormat");
>>>>>
>>>>>         Configuration predictionsConfig = new Configuration();
>>>>>         predictionsConfig.set("mongo.output.uri", mongodbUri);
>>>>>
>>>>>         JavaPairRDD<Object,BSONObject> bsonRatingsData =
>>>>> sc.newAPIHadoopFile(
>>>>>             ratingsUri, BSONFileInputFormat.class, Object.class,
>>>>>                 BSONObject.class, bsonDataConfig);
>>>>>
>>>>>
>>>>> Thanks
>>>>> Best Regards
>>>>>
>>>>> On Mon, Aug 31, 2015 at 12:59 PM, Deepesh Maheshwari <
>>>>> deepesh.maheshwar...@gmail.com> wrote:
>>>>>
>>>>>> Hi, I am using <spark.version>1.3.0</spark.version>
>>>>>>
>>>>>> I am not getting constructor for above values
>>>>>>
>>>>>> [image: Inline image 1]
>>>>>>
>>>>>> So, i tried to shuffle the values in constructor .
>>>>>> [image: Inline image 2]
>>>>>>
>>>>>> But, it is giving this error.Please suggest
>>>>>> [image: Inline image 3]
>>>>>>
>>>>>> Best Regards
>>>>>>
>>>>>> On Mon, Aug 31, 2015 at 12:43 PM, Akhil Das <
>>>>>> ak...@sigmoidanalytics.com> wrote:
>>>>>>
>>>>>>> Can you try with these key value classes and see the performance?
>>>>>>>
>>>>>>> inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat"
>>>>>>>
>>>>>>>
>>>>>>> keyClassName = "org.apache.hadoop.io.Text"
>>>>>>> valueClassName = "org.apache.hadoop.io.MapWritable"
>>>>>>>
>>>>>>>
>>>>>>> Taken from databricks blog
>>>>>>> <https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html>
>>>>>>>
>>>>>>> Thanks
>>>>>>> Best Regards
>>>>>>>
>>>>>>> On Mon, Aug 31, 2015 at 12:26 PM, Deepesh Maheshwari <
>>>>>>> deepesh.maheshwar...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi, I am trying to read mongodb in Spark newAPIHadoopRDD.
>>>>>>>>
>>>>>>>> /**** Code *****/
>>>>>>>>
>>>>>>>> config.set("mongo.job.input.format",
>>>>>>>> "com.mongodb.hadoop.MongoInputFormat");
>>>>>>>> config.set("mongo.input.uri",SparkProperties.MONGO_OUTPUT_URI);
>>>>>>>> config.set("mongo.input.query","{host: 'abc.com'}");
>>>>>>>>
>>>>>>>> JavaSparkContext sc=new JavaSparkContext("local", "MongoOps");
>>>>>>>>
>>>>>>>>         JavaPairRDD<Object, BSONObject> mongoRDD =
>>>>>>>> sc.newAPIHadoopRDD(config,
>>>>>>>>                 com.mongodb.hadoop.MongoInputFormat.class,
>>>>>>>> Object.class,
>>>>>>>>                 BSONObject.class);
>>>>>>>>
>>>>>>>>         long count=mongoRDD.count();
>>>>>>>>
>>>>>>>> There are about 1.5million record.
>>>>>>>> Though i am getting data but read operation took around 15min to
>>>>>>>> read whole.
>>>>>>>>
>>>>>>>> Is this Api really too slow or am i missing something.
>>>>>>>> Please suggest if there is an alternate approach to read data from
>>>>>>>> Mongo faster.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Deepesh
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Slow Mongo Read from Spark

Reply via email to