BTW if you are getting a failure that you don’t understand, can you share it?
On Apr 14, 2015, at 11:55 AM, Pat Ferrel wrote:
>
> On Apr 14, 2015, at 11:44 AM, Michael Kelly wrote:
>
> Hi Pat,
>
> Had some work to do on the input side for the item-similarity, so just
> getting around to ru
>
> On Apr 14, 2015, at 11:44 AM, Michael Kelly wrote:
>
> Hi Pat,
>
> Had some work to do on the input side for the item-similarity, so just
> getting around to running it on a cluster now.
> I'm building from source from the 0.10 release against version 2.4.0 hadoop.
> The version of spark
Hi Pat,
Had some work to do on the input side for the item-similarity, so just
getting around to running it on a cluster now.
I'm building from source from the 0.10 release against version 2.4.0 hadoop.
The version of spark we are using is 1.1, we use the emr install-spark
script to install spark.
Michael,
There is a fix in the latest source on Github.
If you’d like to try it
mvn clean install -DskipTests #there is a failing test at present so skip
them
Add your version of hadopp if needed, consult here:
http://mahout.apache.org/developers/buildingmahout.html
From my side, spark-
Michael,
The problem is in partitioning the data and if you start with one file the
partitions are created fine. With a bunch of small files, the optimizer trips
up by not catching a range of size=0. This will be fixed in 0.10.1 but for now
(0.10.0) you can:
1) concat files into one
2) I can r
no, i don't think there's a workaround. it needs a fix; however, in public
version there are much more fixes needed so I think this part will be
refactored completely in 0.10.1
On Fri, Apr 3, 2015 at 12:38 PM, Pat Ferrel wrote:
> OK, it was. Is there a workaround I can try?
>
>
> On Apr 3, 2015,
nrow and ncol that is
On Apr 3, 2015, at 12:45 PM, Pat Ferrel wrote:
Could I leave them unspecified, would that work around the problem?
On Apr 3, 2015, at 12:38 PM, Dmitriy Lyubimov wrote:
Ah. yes i believe it is a bug in non-slim A'A similar to one I fixed for
AB' some time ago. It makes er
Could I leave them unspecified, would that work around the problem?
On Apr 3, 2015, at 12:38 PM, Dmitriy Lyubimov wrote:
Ah. yes i believe it is a bug in non-slim A'A similar to one I fixed for
AB' some time ago. It makes error in computing parallelism and split ranges
of the final product.
On
OK, it was. Is there a workaround I can try?
On Apr 3, 2015, at 12:22 PM, Dmitriy Lyubimov wrote:
Although... i am not aware of one in A'A
could be faulty vector length in a matrix if matrix was created by drmWrap
with explicit specification of ncol
On Fri, Apr 3, 2015 at 12:20 PM, Dmitriy Ly
Ah. yes i believe it is a bug in non-slim A'A similar to one I fixed for
AB' some time ago. It makes error in computing parallelism and split ranges
of the final product.
On Fri, Apr 3, 2015 at 12:22 PM, Dmitriy Lyubimov wrote:
> Although... i am not aware of one in A'A
>
> could be faulty vecto
Although... i am not aware of one in A'A
could be faulty vector length in a matrix if matrix was created by drmWrap
with explicit specification of ncol
On Fri, Apr 3, 2015 at 12:20 PM, Dmitriy Lyubimov wrote:
> it's a bug. There's a number of similar ones in operator A'B.
>
> On Fri, Apr 3, 20
it's a bug. There's a number of similar ones in operator A'B.
On Fri, Apr 3, 2015 at 6:23 AM, Michael Kelly wrote:
> Hi Pat,
>
> I've done some further digging and it looks like the problem is
> occurring when the input files are split up to into parts. The input
> to the item-similarity matrix
good, looking at it now
On Apr 3, 2015, at 11:36 AM, Michael Kelly wrote:
Yes, I updated recently, when running on a cluster I checked out the
latest master of mahout, and locally I've probably updated in the last
week or so.
On Fri, Apr 3, 2015 at 7:04 PM, Pat Ferrel wrote:
> OK, got it to r
Yes, I updated recently, when running on a cluster I checked out the
latest master of mahout, and locally I've probably updated in the last
week or so.
On Fri, Apr 3, 2015 at 7:04 PM, Pat Ferrel wrote:
> OK, got it to reproduce. This is not what I expected. Its too many columns in
> a vector-hmm
OK, got it to reproduce. This is not what I expected. Its too many columns in a
vector-hmm. Found the other user’s issue which was null input, not a bug.
BTW when did you update Mahout? Just put in the ability to point to dirs so I
assume recently?
On Apr 3, 2015, at 9:08 AM, Pat Ferrel wrote
Yeah, that’s exactly what the other user is doing. This should be a common
architecture in the future. I’m already looking at the other so will add this
too. Thanks a bunch for the data.
On Apr 3, 2015, at 8:58 AM, Michael Kelly wrote:
Yes, we are using a spark streaming job to create the inp
Yes, we are using a spark streaming job to create the input, and I
wasn't repartitioning it, so there were a lot of parts. I'm testing it
out now with repartitioning to see if that works.
This is just a single interaction type.
Thanks again,
Michael
On Fri, Apr 3, 2015 at 4:52 PM, Pat Ferrel wr
This sounds like a bug. Thanks for the sample input and narrowing it down. I’ll
look at it today.
I got a similar question from another user with a lot of part files. A Spark
streaming job creates the part files. Is that what you are doing?
Is this a single interaction type?
On Apr 3, 2015,
Hi Pat,
I've done some further digging and it looks like the problem is
occurring when the input files are split up to into parts. The input
to the item-similarity matrix is the output from a spark job and it
ends up in about 2000 parts (on the hadoop file system). I have
reproduced the error loca
The input must be tuples (if not using a filter) so the CLI you have expects
user and item ids that are
user-id1,item-id1
user-id500,item-id3000
…
The ids must be tokenized because it doesn’t use a full csv parser, only lines
of delimited text.
If this doesn’t help can you supply a snippet of
Hi all,
I'm running the spark-itemsimilarity job from the cli on an AWS emr
cluster, and I'm running into an exception.
The input file format is
UserIdItemId1ItemId2ItemId3..
There is only one row per user, and a total of 97,000 rows.
I also tried input with one row per UserId/ItemId pair,
21 matches
Mail list logo