Hi Guys,
Per you advice I did upgrade to Mahout .6 and did a bunch of API
changes and in the meantime realized I had a bug with my input matrix,
zero rows read from Solr b/c multiple fields in Solr were index and
not just the one I was interested in, that issues is fixed and I have
a matrix with these dimensions: (.numCols mat) 1000 (.numRows mat)
15932 (or the transpose)
Unfortunately I'm getting the below error now, in the context of some
other Mahout algorithm there was a mention of '/tmp' vs '/_tmp'
causing this issue but in this particular case the matrix is in
memory!! I'm using this google package: guava-r09.jar
SEVERE: java.util.NoSuchElementException
at
com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
at
org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190)
at
org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238)
at
org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
at lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165)
Any suggestion?
Thanks,
Peyman
On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov <[email protected]> wrote:
> Peyman,
>
>
> Yes, what Ted said. Please take 0.6 release. Also try ssvd, it may
> benefit you in some regards compared to Lanczos.
>
> -d
>
> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian <[email protected]>
> wrote:
>> Hi Dmitriy & Others,
>>
>> Dmitriy thanks for your previous response.
>> I have a follow up question to my LSA project. I have managed to
>> upload 1,500 documents from two different news groups (one about
>> graphics and one about Atheism
>> http://people.csail.mit.edu/jrennie/20Newsgroups/) to Solr. However my
>> LanczosSolver in Mahout.4 does not find any eigenvalues (there are
>> eigenvectors as you see in the follow up logs).
>> The only things I'm doing different from
>> (https://github.com/algoriffic/lsa4solr) is that I'm not using the
>> 'Summary' field but rather the actual 'text' field in Solr. I'm
>> assuming the issue is that Summary field already removes the noise and
>> make the clustering work and the raw index data does not do that, am I
>> correct or there are other potential explanations? For the desired
>> rank I'm using values between 10-100 and looking for #clusters between
>> 2-10 (different values for different trials), but always the same
>> result comes out, no clusters found.
>> If my issue is related to not having summarization done, how can that
>> be done in Solr? I wasn't able to fine a Summary field in Solr.
>>
>> Thanks
>> Peyman
>>
>>
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Lanczos iteration complete - now to diagonalize the tri-diagonal
>> auxiliary matrix.
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 0 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 1 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 2 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 3 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 4 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 5 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 6 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 7 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 8 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 9 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: Eigenvector 10 found with eigenvalue 0.0
>> Feb 19, 2012 3:25:20 AM
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve
>> INFO: LanczosSolver finished.
>>
>>
>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov <[email protected]> wrote:
>>> In Mahout lsa pipeline is possible with seqdirectory, seq2sparse and ssvd
>>> commands. Nuances are understanding dictionary format and llr anaylysis of
>>> n-grams and perhaps use a slightly better lemmatizer than the default one.
>>>
>>> With indexing part you are on your own at this point.
>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <[email protected]> wrote:
>>>
>>>> Hi Guys,
>>>>
>>>> I'm interested in this work:
>>>>
>>>> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html
>>>>
>>>> I looked at some of the comments and notices that there was interest
>>>> in incorporating it into Mahout, back in 2010. I'm also having issues
>>>> running this code due to dependencies on older version of Mahout.
>>>>
>>>> I was wondering if LSA is now directly available in Mahout? Also if I
>>>> upgrade to the latest Mahout would this Clojure code work?
>>>>
>>>> Thanks
>>>> Peyman
>>>>