Re: Slow Mongo Read from Spark

Deepesh Maheshwari Thu, 03 Sep 2015 09:09:29 -0700

Because of existing architecture , i am bound to use mongodb.

Please suggest for this


On Thu, Sep 3, 2015 at 9:10 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> You might think about another storage layer not being mongodb
> (hdfs+orc+compression or hdfs+parquet+compression)  to improve performance
>
> Le jeu. 3 sept. 2015 à 9:15, Akhil Das <ak...@sigmoidanalytics.com> a
> écrit :
>
>> On SSD you will get around 30-40MB/s on a single machine (on 4 cores).
>>
>> Thanks
>> Best Regards
>>
>> On Mon, Aug 31, 2015 at 3:13 PM, Deepesh Maheshwari <
>> deepesh.maheshwar...@gmail.com> wrote:
>>
>>> tried it,,gives the same above exception
>>>
>>> Exception in thread "main" java.io.IOException: No FileSystem for
>>> scheme: mongodb
>>>
>>> In you case, do you have used above code.
>>> What read throughput , you get?
>>>
>>> On Mon, Aug 31, 2015 at 2:04 PM, Akhil Das <ak...@sigmoidanalytics.com>
>>> wrote:
>>>
>>>> FYI, newAPIHadoopFile and newAPIHadoopRDD uses the NewHadoopRDD class
>>>> itself underneath and it doesnt mean it will only read from HDFS. Give it a
>>>> shot if you haven't tried it already (it just the inputformat and the
>>>> reader which are different from your approach).
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Mon, Aug 31, 2015 at 1:14 PM, Deepesh Maheshwari <
>>>> deepesh.maheshwar...@gmail.com> wrote:
>>>>
>>>>> Hi Akhil,
>>>>>
>>>>> This code snippet is from below link
>>>>>
>>>>> https://github.com/crcsmnky/mongodb-spark-demo/blob/master/src/main/java/com/mongodb/spark/demo/Recommender.java
>>>>>
>>>>> Here it reading data from HDFS file system but in our case i need to
>>>>> read from mongodb.
>>>>>
>>>>> I have tried it earlier and now again tried it but is giving below
>>>>> error which is self explanantory.
>>>>>
>>>>> Exception in thread "main" java.io.IOException: No FileSystem for
>>>>> scheme: mongodb
>>>>>
>>>>> On Mon, Aug 31, 2015 at 1:03 PM, Akhil Das <ak...@sigmoidanalytics.com
>>>>> > wrote:
>>>>>
>>>>>> Here's a piece of code which works well for us (spark 1.4.1)
>>>>>>
>>>>>>         Configuration bsonDataConfig = new Configuration();
>>>>>>         bsonDataConfig.set("mongo.job.input.format",
>>>>>> "com.mongodb.hadoop.BSONFileInputFormat");
>>>>>>
>>>>>>         Configuration predictionsConfig = new Configuration();
>>>>>>         predictionsConfig.set("mongo.output.uri", mongodbUri);
>>>>>>
>>>>>>         JavaPairRDD<Object,BSONObject> bsonRatingsData =
>>>>>> sc.newAPIHadoopFile(
>>>>>>             ratingsUri, BSONFileInputFormat.class, Object.class,
>>>>>>                 BSONObject.class, bsonDataConfig);
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Best Regards
>>>>>>
>>>>>> On Mon, Aug 31, 2015 at 12:59 PM, Deepesh Maheshwari <
>>>>>> deepesh.maheshwar...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi, I am using <spark.version>1.3.0</spark.version>
>>>>>>>
>>>>>>> I am not getting constructor for above values
>>>>>>>
>>>>>>> [image: Inline image 1]
>>>>>>>
>>>>>>> So, i tried to shuffle the values in constructor .
>>>>>>> [image: Inline image 2]
>>>>>>>
>>>>>>> But, it is giving this error.Please suggest
>>>>>>> [image: Inline image 3]
>>>>>>>
>>>>>>> Best Regards
>>>>>>>
>>>>>>> On Mon, Aug 31, 2015 at 12:43 PM, Akhil Das <
>>>>>>> ak...@sigmoidanalytics.com> wrote:
>>>>>>>
>>>>>>>> Can you try with these key value classes and see the performance?
>>>>>>>>
>>>>>>>> inputFormatClassName = "com.mongodb.hadoop.MongoInputFormat"
>>>>>>>>
>>>>>>>>
>>>>>>>> keyClassName = "org.apache.hadoop.io.Text"
>>>>>>>> valueClassName = "org.apache.hadoop.io.MapWritable"
>>>>>>>>
>>>>>>>>
>>>>>>>> Taken from databricks blog
>>>>>>>> <https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Best Regards
>>>>>>>>
>>>>>>>> On Mon, Aug 31, 2015 at 12:26 PM, Deepesh Maheshwari <
>>>>>>>> deepesh.maheshwar...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi, I am trying to read mongodb in Spark newAPIHadoopRDD.
>>>>>>>>>
>>>>>>>>> /**** Code *****/
>>>>>>>>>
>>>>>>>>> config.set("mongo.job.input.format",
>>>>>>>>> "com.mongodb.hadoop.MongoInputFormat");
>>>>>>>>> config.set("mongo.input.uri",SparkProperties.MONGO_OUTPUT_URI);
>>>>>>>>> config.set("mongo.input.query","{host: 'abc.com'}");
>>>>>>>>>
>>>>>>>>> JavaSparkContext sc=new JavaSparkContext("local", "MongoOps");
>>>>>>>>>
>>>>>>>>>         JavaPairRDD<Object, BSONObject> mongoRDD =
>>>>>>>>> sc.newAPIHadoopRDD(config,
>>>>>>>>>                 com.mongodb.hadoop.MongoInputFormat.class,
>>>>>>>>> Object.class,
>>>>>>>>>                 BSONObject.class);
>>>>>>>>>
>>>>>>>>>         long count=mongoRDD.count();
>>>>>>>>>
>>>>>>>>> There are about 1.5million record.
>>>>>>>>> Though i am getting data but read operation took around 15min to
>>>>>>>>> read whole.
>>>>>>>>>
>>>>>>>>> Is this Api really too slow or am i missing something.
>>>>>>>>> Please suggest if there is an alternate approach to read data from
>>>>>>>>> Mongo faster.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Deepesh
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: Slow Mongo Read from Spark

Reply via email to