Thanks Aaron. I will test it out today and let you know the result. Regards, Dibyendu
On Wed, Oct 29, 2014 at 6:59 AM, Aaron McCurry <[email protected]> wrote: > Dibyendu you were right. After looking into the problem (and hitting it > myself) the maybeMerge call was missing. Here's the commit that fixes the > issue. > > > https://github.com/apache/incubator-blur/commit/ce2179ed5cb5a534275678ad0982fef4e99fbb49 > > Also we have made a few fixes to resource management over the past few days > so when you get a chance I would recommend that you upgrade to the latest > version on master. Thanks! > > Aaron > > On Sun, Oct 26, 2014 at 9:29 PM, Aaron McCurry <[email protected]> wrote: > > > > > > > On Sun, Oct 26, 2014 at 1:48 PM, Dibyendu Bhattacharya < > > [email protected]> wrote: > > > >> I am using following settings when indexing via Spark... > >> > >> BlurOutputFormat.setIndexLocally(conf, false); > >> > >> Because of this merge is not happening ? I can see > GenericBlurRecordWriter > >> flush method is calling maybeMerge only when it is using local temp > index > >> .' > >> > > > > The index locally setting is used to index directly in HDFS from MR. The > > normal behavior is to index locally the update through the > > BlurOutputFormat. Once complete the index is then copied and optimized > in > > flight to HDFS where the Blur table is being served. So when Index > Local = > > false one or more segments are added to the index. When Index Local = > true > > only one segment is added to the index regardless of size. > > > > Aaron > > > > > >> > >> Dibyendu > >> > >> > >> > >> > >> > >> On Sun, Oct 26, 2014 at 9:47 PM, Aaron McCurry <[email protected]> > >> wrote: > >> > >> > On Sun, Oct 26, 2014 at 1:21 AM, Dibyendu Bhattacharya < > >> > [email protected]> wrote: > >> > > >> > > Some more observation. > >> > > > >> > > As I said, when I index using Spark RDD saveAsHaddop* API , there > are > >> > > bunch of .lnk files and inuse folders got created which never got > >> > > merged/deleted. But when I stopped the Spark Job which uses Hadoop > API > >> > and > >> > > started another Spark Job for same table which uses Blur Thrift > >> enqueue > >> > > mutate call, I can see all the previous .lnk files and inuse folders > >> are > >> > > eventually merged and deleted. The index counts is fine and new > >> documents > >> > > also keep added to index. > >> > > > >> > > >> > Ok, we are in the process of testing this issue. We will let you know > >> what > >> > we find. > >> > > >> > > >> > > > >> > > So I do not think there is any issue with Permissions . Probably the > >> > merge > >> > > logic not getting started when indexing is happening using > >> > BlurOutputFormat. > >> > > > >> > > >> > Not sure when the mergeMaybe call was integrated into the external > index > >> > loading, but it should be merging the segments. > >> > > >> > Aaron > >> > > >> > > >> > > > >> > > Regards, > >> > > Dibyendu > >> > > > >> > > On Sat, Oct 25, 2014 at 4:28 AM, Tim Williams <[email protected] > > > >> > > wrote: > >> > > > >> > >> On Fri, Oct 24, 2014 at 3:48 PM, Aaron McCurry <[email protected] > > > >> > >> wrote: > >> > >> > >> > >> > Didn't reply to all. > >> > >> > > >> > >> > ---------- Forwarded message ---------- > >> > >> > From: Aaron McCurry <[email protected]> > >> > >> > Date: Fri, Oct 24, 2014 at 3:47 PM > >> > >> > Subject: Re: Some Performance number of Spark Blur Connector > >> > >> > To: Dibyendu Bhattacharya <[email protected]> > >> > >> > > >> > >> > > >> > >> > > >> > >> > > >> > >> > On Fri, Oct 24, 2014 at 2:19 PM, Dibyendu Bhattacharya < > >> > >> > [email protected]> wrote: > >> > >> > > >> > >> >> Hi Aaron, > >> > >> >> > >> > >> >> here are some performance number between enqueue mutate and RDD > >> > >> >> saveAsHadoopFile both using Spark Streaming. > >> > >> >> > >> > >> >> Set up I used not very optimized one , but can give a idea about > >> both > >> > >> >> method of indexing via Spark Streaming. > >> > >> >> > >> > >> >> I used 4 Node EMR M1.Xlarge cluster, and installed Blur as 1 > >> > Controller > >> > >> >> and 3 Shard Server. My blur table has 9 partitions. > >> > >> >> > >> > >> >> On the same cluster, I was running Spark with 1 Master and 3 > >> Worker. > >> > >> This > >> > >> >> is not a good setup but anyway, here are the numbers. > >> > >> >> > >> > >> >> The enqueMutate index rate is around 800 messages / Second. > >> > >> >> > >> > >> >> The RDD saveAsHadoopFile index rate is around 12,000 message > >> /second. > >> > >> >> > >> > >> >> This is few order of magnitude faster. > >> > >> >> > >> > >> > > >> > >> > That's awesome, thank you for sharing! > >> > >> > > >> > >> > > >> > >> >> > >> > >> >> > >> > >> >> Not sure if this is a issue with saveAsHadoopFile approach, but > I > >> can > >> > >> see > >> > >> >> in Shard folder in HDFS has lots of small Lucene *.lnk files are > >> > >> getting > >> > >> >> created ( probably for each saveAsHadoopFile call) and there are > >> that > >> > >> many > >> > >> >> "insue" folders as you see in screen shot. > >> > >> >> > >> > >> >> And these entries keep increasing to huge number if this Spark > >> > >> streaming > >> > >> >> keep running for some time . Not sure if this has any impact on > >> > >> indexing > >> > >> >> and search performance ? > >> > >> >> > >> > >> > > >> > >> > They should be merged and removed over time however if there is a > >> > >> > permission problem blur might not be able to remove the inuse > >> folders. > >> > >> > > >> > >> > >> > >> In a situation where permissions are the problem the .lnk files are > >> > >> properly cleaned while the .inuse dirs hang around. If both are > >> hanging > >> > >> around I suspect it's not permissions. A lesson freshly learned > over > >> > >> here:) > >> > >> > >> > >> --tim > >> > >> > >> > > > >> > > > >> > > >> > > > > >
