Streaming K Means exception without any reason

2014-10-09 Thread Marko Dinić
Hello everyone, I'm using Mahout Streaming K Means multiple times in a loop, every time for same input data, and output path is always different. Concretely, I'm increasing number of clusters in each iteration. Currently it is run on a single machine. A couple of times (maybe 3 of 20 runs)

Re: Streaming K Means exception without any reason

2014-10-09 Thread Suneel Marthi
Seen this issue happen a few times before, there are few edge conditions that need to be fixed in the Streaming KMeans code and you are right that the generated clusters are different on successive runs given the same input. IIRC this stacktrace is due to BallKMeans failing to read any input

Re: Streaming K Means exception without any reason

2014-10-09 Thread Marko Dinić
Suneel, Thank you for your answer, this was rather strange to me. The number of points is 942. I have multiple runs, in each run I have a loop in which number of clusters is increased in each iteration and I multiple that number by 3, since I'm expecting log(n) initial centroids, before Ball

Re: Streaming K Means exception without any reason

2014-10-09 Thread Suneel Marthi
Heh u r data size is tiny indeed. One of the edge conditions I was alluding to was the failures of this implementation on tiny datasets. Do u see any output clusters? If so how many points? possible to share ur dataset to troubleshoot ? On Thu, Oct 9, 2014 at 9:18 AM, Marko Dinić

Re: Streaming K Means exception without any reason

2014-10-09 Thread Marko Dinić
Yes it is small, but it is just a sample, so the dataset will probably be much bigger. So you think that this was the problem? Will this problem be avoided in case of larger dataset? I think that there were no output clusters, as I remember. I'm sending the dataset, if you want to take a

Re: Streaming K Means exception without any reason

2014-10-09 Thread Marko Dinić
Here is the dataset. On четвртак, 09. октобар 2014. 16:53:25 CEST, Marko Dinić wrote: Yes it is small, but it is just a sample, so the dataset will probably be much bigger. So you think that this was the problem? Will this problem be avoided in case of larger dataset? I think that there were

Re: Streaming K Means exception without any reason

2014-10-09 Thread Marko Dinić
Here is the dataset, I've just checked to be sure it is the right one. On 09.10.2014. 15:34, Suneel Marthi wrote: Heh u r data size is tiny indeed. One of the edge conditions I was alluding to was the failures of this implementation on tiny datasets. Do u see any output clusters? If so how

Re: SSVD: lease conflict due to 2 attempts using the same dir

2014-10-09 Thread Yang
yes, that's what I'm saying, I disabled speculative, it works for now (kind of is a hack) also yes, this is hadoop 2.0 with YARN this has nothing to do with overwrite mode. the 2 attempts are run simultaneously because they are speculative runs On Wed, Oct 8, 2014 at 12:07 AM, Serega Sheypak

SSVD Q-Job taking very long even after 100% ?

2014-10-09 Thread Yang
my Q-Job MR job shows as 100% mapper complete (it's a map-only job) very quickly, but the job itself does not finish, until about 10 minutes later. this is rather surprising. my input is a sparse vector of 37000 rows, and the column count is 8000, with each row usually having 10 elements set to

Re: SSVD: lease conflict due to 2 attempts using the same dir

2014-10-09 Thread Dmitriy Lyubimov
wow. good to know. however, it would seem to me like a bug in MultipleOutputs? either way, it doesn't seem anything to do with the Mahout code itself. On Thu, Oct 9, 2014 at 10:32 AM, Yang tedd...@gmail.com wrote: yes, that's what I'm saying, I disabled speculative, it works for now (kind

Re: SSVD Q-Job taking very long even after 100% ?

2014-10-09 Thread Yang
it's possible that they are compressing the output, I'm now rebuilding the code after commenting out the setOutputCompress(true) in the code also will run with compression param set to false but still it's quite surprising why compression should take so long (8--10minutes) On Thu, Oct 9, 2014

RE: SSVD: lease conflict due to 2 attempts using the same dir

2014-10-09 Thread Ken Krugler
The BtJob's BtMapper has some interesting logic in its setup routine, where it looks like it's creating a side-channel: /* * actually this is kind of dangerous because this routine thinks we need * to create file name for our current job and this will use -m- so it's *

Re: SSVD Q-Job taking very long even after 100% ?

2014-10-09 Thread Yang
I commented out the code about compression, but the actual job console still shows mapreduce.output.fileoutputformat.compress as true On Thu, Oct 9, 2014 at 11:40 AM, Yang tedd...@gmail.com wrote: it's possible that they are compressing the output, I'm now rebuilding the code after

Re: Mahout 1.0: is DRM too file-bound?

2014-10-09 Thread Andrew Butkus
Correct me if wrong but This is done for distributed processing on large data sets and using map reduce principle and a common file type to do distributed processing. Sent from my iPhone On 9 Oct 2014, at 20:56, Reinis Vicups mah...@orbit-x.de wrote: Hello, I am currently looking into

RE: SSVD: lease conflict due to 2 attempts using the same dir

2014-10-09 Thread Dmitriy Lyubimov
This is using side input, yes. But it is standard practice. For example, map side joins do basically the same. Specifically wrt opportunistic execution this should be fine. Hdfs does not disallow opening the same file for reading by multiple tasks iirc. Sent from my phone. On Oct 9, 2014 1:02

Re: SSVD Q-Job taking very long even after 100% ?

2014-10-09 Thread Dmitriy Lyubimov
I don't remember the code that well already to give you details, but a lot of jobs are actually reduce bound. Sent from my phone. On Oct 9, 2014 11:07 AM, Yang tedd...@gmail.com wrote: my Q-Job MR job shows as 100% mapper complete (it's a map-only job) very quickly, but the job itself does

Re: Mahout 1.0: is DRM too file-bound?

2014-10-09 Thread Dmitriy Lyubimov
Matrix defines structure. Not necessarily where it can be imported from. You're right in the sense that framework itself avoids defining apis for custom partition formation. But you're wrong in implying you cannot do it if you wanted, our that you d have to do anything that complex as you say. As

Re: Mahout 1.0: is DRM too file-bound?

2014-10-09 Thread Dmitriy Lyubimov
Bottom line, some very smart people decided to do all that work in Spark and give us for free. Not sure why, but that did. If the capability already found in Spark, there's no need for us to replicate it. WRT specifically NoSql, Spark can read HBase trivially. I also did a bit more advanced

Re: Mahout 1.0: is DRM too file-bound?

2014-10-09 Thread Pat Ferrel
There is also the mahout Reader and Writer traits and classes that currently work with text delimited file I/O. These were imagined as a general framework to support parallelized read/write to any format and store using whatever method is expedient, including the ones Dmitriy mentions. I

Re: Mahout 1.0: is DRM too file-bound?

2014-10-09 Thread Reinis Vicups
Guys, thank you very much for your feedback. I have already my own vanilla spark-based implementation of row similarity that reads and writes into NoSQL (in my case HBase). My intention is to profit from your effort to abstract algebraic layer from physical backend because I find it a great