Guys, thank you very much for your feedback.
I have already my own vanilla spark-based implementation of row
similarity that reads and writes into NoSQL (in my case HBase).
My intention is to profit from your effort to abstract algebraic layer
from physical backend because I find it a great i
Hi,
You want to upgrade your hadoop to version 2.4.
Peng Zhang
On Oct 10, 2014, at 11:01 AM, slee wrote:
> Hi all:
>
> When I try to run the 20news-group
> example(http://mahout.apache.org/users/classification/twenty-newsgroups.html),I
> get the error below:
>
>
> mahout seqdirectory -i $W
Hi all:
When I try to run the 20news-group
example(http://mahout.apache.org/users/classification/twenty-newsgroups.html),I
get the error below:
mahout seqdirectory -i $WORK_DIR/20news-all -o $WORK_DIR/20news-seq
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Error: Could not find
There is also the mahout Reader and Writer traits and classes that currently
work with text delimited file I/O. These were imagined as a general framework
to support parallelized read/write to any format and store using whatever
method is expedient, including the ones Dmitriy mentions. I persona
Bottom line, some very smart people decided to do all that work in Spark
and give us for free. Not sure why, but that did. If the capability already
found in Spark, there's no need for us to replicate it.
WRT specifically NoSql, Spark can read HBase trivially. I also did a bit
more advanced thing
Matrix defines structure. Not necessarily where it can be imported from.
You're right in the sense that framework itself avoids defining apis for
custom partition formation. But you're wrong in implying you cannot do it
if you wanted, our that you d have to do anything that complex as you say.
As
I don't remember the code that well already to give you details, but a lot
of jobs are actually reduce bound.
Sent from my phone.
On Oct 9, 2014 11:07 AM, "Yang" wrote:
> my Q-Job MR job shows as 100% mapper complete (it's a map-only job) very
> quickly, but the job itself does not finish, until
This is using side input, yes. But it is standard practice. For example,
map side joins do basically the same.
Specifically wrt opportunistic execution this should be fine. Hdfs does not
disallow opening the same file for reading by multiple tasks iirc.
Sent from my phone.
On Oct 9, 2014 1:02 PM,
Correct me if wrong but This is done for distributed processing on large data
sets and using map reduce principle and a common file type to do distributed
processing.
Sent from my iPhone
> On 9 Oct 2014, at 20:56, Reinis Vicups wrote:
>
> Hello,
>
> I am currently looking into the new (DRM)
I commented out the code about compression, but the actual job console
still shows mapreduce.output.fileoutputformat.compress
as true
On Thu, Oct 9, 2014 at 11:40 AM, Yang wrote:
> it's possible that they are compressing the output, I'm now rebuilding
> the code after commenting out the setOut
The BtJob's BtMapper has some "interesting" logic in its setup routine, where
it looks like it's creating a side-channel:
/*
* actually this is kind of dangerous because this routine thinks we need
* to create file name for our current job and this will use -m- so it's
Hello,
I am currently looking into the new (DRM) mahout framework.
I find myself wondering why is it so that from one side there is a lot
of thought, effort and design complexity being invested into abstracting
engines, contexts or algebraic operations,
but from the other side, even abstract in
it's possible that they are compressing the output, I'm now rebuilding the
code after commenting out the setOutputCompress(true) in the code
also will run with compression param set to false
but still it's quite surprising why compression should take so long
(8--10minutes)
On Thu, Oct 9, 2014
wow. good to know. however, it would seem to me like a bug in
MultipleOutputs? either way, it doesn't seem anything to do with the Mahout
code itself.
On Thu, Oct 9, 2014 at 10:32 AM, Yang wrote:
> yes, that's what I'm saying, I disabled speculative, it works for now (kind
> of is a hack)
>
>
>
my Q-Job MR job shows as 100% mapper complete (it's a map-only job) very
quickly, but the job itself does not finish, until about 10 minutes later.
this is rather surprising. my input is a sparse vector of 37000 rows, and
the column count is 8000, with each row usually having < 10 elements set to
n
yes, that's what I'm saying, I disabled speculative, it works for now (kind
of is a hack)
also yes, this is hadoop 2.0 with YARN
this has nothing to do with overwrite mode. the 2 attempts are run
simultaneously because they are speculative runs
On Wed, Oct 8, 2014 at 12:07 AM, Serega Sheypak
w
Here is the dataset, I've just checked to be sure it is the right one.
On 09.10.2014. 15:34, Suneel Marthi wrote:
Heh u r data size is tiny indeed. One of the edge conditions I was
alluding to was the failures of this implementation on tiny datasets.
Do u see any output clusters? If so how
Here is the dataset.
On четвртак, 09. октобар 2014. 16:53:25 CEST, Marko Dinić wrote:
Yes it is small, but it is just a sample, so the dataset will probably
be much bigger. So you think that this was the problem? Will this
problem be avoided in case of larger dataset?
I think that there were no
Yes it is small, but it is just a sample, so the dataset will probably
be much bigger. So you think that this was the problem? Will this
problem be avoided in case of larger dataset?
I think that there were no output clusters, as I remember. I'm sending
the dataset, if you want to take a look.
Heh u r data size is tiny indeed. One of the edge conditions I was
alluding to was the failures of this implementation on tiny datasets.
Do u see any output clusters? If so how many points?
possible to share ur dataset to troubleshoot ?
On Thu, Oct 9, 2014 at 9:18 AM, Marko Dinić
wrote:
>
Suneel,
Thank you for your answer, this was rather strange to me.
The number of points is 942. I have multiple runs, in each run I have a
loop in which number of clusters is increased in each iteration and I
multiple that number by 3, since I'm expecting log(n) initial
centroids, before Ball
Seen this issue happen a few times before, there are few edge conditions
that need to be fixed in the Streaming KMeans code and you are right that
the generated clusters are different on successive runs given the same
input.
IIRC this stacktrace is due to BallKMeans failing to read any input
centr
Hello everyone,
I'm using Mahout Streaming K Means multiple times in a loop, every time
for same input data, and output path is always different. Concretely,
I'm increasing number of clusters in each iteration. Currently it is run
on a single machine.
A couple of times (maybe 3 of 20 runs) I
23 matches
Mail list logo