also you are printing your input path -- how does it look like in reality? because this path that it complains about, SSVDOutput/data, in fact should be the input path. That's what's perplexing.
We are talking hadoop job setup process here, nothing specific to the solution itself. And job setup/directory management fails for some reason. On Thu, Apr 5, 2012 at 11:45 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > Any chance you could test it with its current dependency, 0.20.204? or > that would be hard to stage? > > Newer hadoop version is frankly all i can think of here for the reason of > this. > > On Thu, Apr 5, 2012 at 11:35 AM, Peyman Mohajerian <mohaj...@gmail.com> wrote: >> Hi Dmitriy, >> >> It is a Clojure code from: https://github.com/algoriffic/lsa4solr >> Of course I modified it to use Mahout .6 distribution, also running on >> hadoop-0.20.205.0, here is the Closure code that I changed, >> the lines after ' decomposer (doto (.run ssvdSolver)) ' still need >> modification b/c I'm not reading the eigenValue/Vector from the solver >> correctly. Originally this code was based on Mahout .4. I'm creating the >> Matrix from Solr 3.1.0, very similar to what was done on: ' >> https://github.com/algoriffic/lsa4solr' >> >> Thanks, >> >> (defn decompose-svd >> [mat k] >> ;(println "input path " (.getRowPath mat)) >> ;(println "dd " (into-array [(.getRowPath mat)])) >> ;(println "numCol " (.numCols mat)) >> ;(println "numrow " (.numRows mat)) >> (let [eigenvalues (new java.util.ArrayList) >> eigenvectors (DenseMatrix. (+ k 2) (.numCols mat)) >> numCol (.numCols mat) >> config (.getConf mat) >> rawPath (.getRowPath mat) >> outputPath (Path. (str (.toString rawPath) "/SSVD-out")) >> inputPath (into-array [rawPath]) >> ssvdSolver (SSVDSolver. config inputPath outputPath 1000 k 60 3) >> decomposer (doto (.run ssvdSolver)) >> V (normalize-matrix-columns (.viewPart (.transpose eigenvectors) >> (int-array [0 0]) >> (int-array [(.numCols mat) k]))) >> U (mmult mat V) >> S (diag (take k (reverse eigenvalues)))] >> {:U U >> :S S >> :V V})) >> >> >> >> >> >> On Thu, Apr 5, 2012 at 11:10 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: >> >>> Yeah. i don't see how it may have arrived at that error. >>> >>> >>> Peyman, >>> >>> I need to know more -- it looks like you are using embedded api, not a >>> command line, so i need to see how you you initialize the solver and >>> also which version of Mahout libraries you are using (your stack trace >>> numbers do not correspond to anything reasonable on current trunk). >>> >>> thanks. >>> >>> -d >>> >>> On Thu, Apr 5, 2012 at 10:55 AM, Dmitriy Lyubimov <dlie...@gmail.com> >>> wrote: >>> > Hm. i never saw that and not sure where this folder comes from. Which >>> > hadoop version are you using? This may be a result of incompatible >>> > support for multiple outputs in the newer hadoop versions . I tested >>> > it with CDH3u0/u3 and it was fine. This folder should normally appear >>> > in the conversation, i suspect it is an internal hadoop thing. >>> > >>> > This is without me actually looking at the code per stack trace. >>> > >>> > >>> > On Thu, Apr 5, 2012 at 5:22 AM, Peyman Mohajerian <mohaj...@gmail.com> >>> wrote: >>> >> Hi Guys, >>> >> I'm now using ssvd for my LSA code and get the following error, at the >>> time >>> >> of error all I have under 'SSVD-out' folder: >>> >> Q-job/QHat-m-00000< >>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FQHat-m-00000&namenodeInfoPort=50070 >>> >& >>> >> R-m-00000< >>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2FR-m-00000&namenodeInfoPort=50070 >>> >& >>> >> _SUCCESS< >>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2F_SUCCESS&namenodeInfoPort=50070 >>> >& >>> >> part-m-00000.deflate< >>> http://localhost:50075/browseDirectory.jsp?dir=%2Flsa4solr%2Fmatrix%2F14099700861483%2Ftranspose-213%2FSSVD-out%2FQ-job%2Fpart-m-00000.deflate&namenodeInfoPort=50070 >>> > >>> >> >>> >> I'm not clear where '/data' folder is supposed to be set, is it part of >>> the >>> >> output of the QJob, I don't see any error in the QJob*? >>> >> >>> >> *Thanks,* >>> >> * >>> >> SEVERE: java.io.FileNotFoundException: File does not exist: >>> >> >>> hdfs://localhost:9000/lsa4solr/matrix/15835804941333/transpose-120/SSVD-out/data >>> >> at >>> >> >>> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:534) >>> >> at >>> >> >>> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63) >>> >> at >>> >> >>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252) >>> >> at >>> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:954) >>> >> at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:971) >>> >> at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:172) >>> >> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889) >>> >> at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:842) >>> >> at java.security.AccessController.doPrivileged(Native Method) >>> >> at javax.security.auth.Subject.doAs(Subject.java:396) >>> >> at >>> >> >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) >>> >> at >>> >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:842) >>> >> at org.apache.hadoop.mapreduce.Job.submit(Job.java:465) >>> >> at >>> org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:505) >>> >> at >>> >> >>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:347) >>> >> at lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:188) >>> >> at >>> >> >>> lsa4solr.clustering_protocol$decompose_term_doc_matrix.invoke(clustering_protocol.clj:125) >>> >> at >>> >> >>> lsa4solr.clustering_protocol$cluster_kmeans_docs.invoke(clustering_protocol.clj:142) >>> >> at lsa4solr.cluster$cluster_dispatch.invoke(cluster.clj:72) >>> >> at lsa4solr.cluster$_cluster.invoke(cluster.clj:103) >>> >> at lsa4solr.cluster.LSAClusteringEngine.cluster(Unknown Source) >>> >> at >>> >> >>> org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91) >>> >> at >>> >> >>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) >>> >> at >>> >> >>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) >>> >> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360) >>> >> at >>> >> >>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) >>> >> at >>> >> >>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) >>> >> at >>> >> >>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) >>> >> at >>> >> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) >>> >> >>> >> On Sun, Feb 26, 2012 at 4:56 PM, Dmitriy Lyubimov <dlie...@gmail.com> >>> wrote: >>> >> >>> >>> for the third time, in context of lsa, faster and hence perhaps better >>> >>> alternative to lanczos is ssvd. Is there any specific reason you want >>> >>> to use lanczos solver in context of LSA? >>> >>> >>> >>> -d >>> >>> >>> >>> On Sun, Feb 26, 2012 at 6:40 AM, Peyman Mohajerian <mohaj...@gmail.com >>> > >>> >>> wrote: >>> >>> > Hi Guys, >>> >>> > >>> >>> > Per you advice I did upgrade to Mahout .6 and did a bunch of API >>> >>> > changes and in the meantime realized I had a bug with my input >>> matrix, >>> >>> > zero rows read from Solr b/c multiple fields in Solr were index and >>> >>> > not just the one I was interested in, that issues is fixed and I have >>> >>> > a matrix with these dimensions: (.numCols mat) 1000 (.numRows mat) >>> >>> > 15932 (or the transpose) >>> >>> > Unfortunately I'm getting the below error now, in the context of some >>> >>> > other Mahout algorithm there was a mention of '/tmp' vs '/_tmp' >>> >>> > causing this issue but in this particular case the matrix is in >>> >>> > memory!! I'm using this google package: guava-r09.jar >>> >>> > >>> >>> > SEVERE: java.util.NoSuchElementException >>> >>> > at >>> >>> >>> com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152) >>> >>> > at >>> >>> >>> org.apache.mahout.math.hadoop.TimesSquaredJob.retrieveTimesSquaredOutputVector(TimesSquaredJob.java:190) >>> >>> > at >>> >>> >>> org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:238) >>> >>> > at >>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104) >>> >>> > at >>> >>> lsa4solr.mahout_matrix$decompose_svd.invoke(mahout_matrix.clj:165) >>> >>> > >>> >>> > >>> >>> > Any suggestion? >>> >>> > Thanks, >>> >>> > Peyman >>> >>> > >>> >>> > >>> >>> > >>> >>> > On Mon, Feb 20, 2012 at 10:38 AM, Dmitriy Lyubimov < >>> dlie...@gmail.com> >>> >>> wrote: >>> >>> >> Peyman, >>> >>> >> >>> >>> >> >>> >>> >> Yes, what Ted said. Please take 0.6 release. Also try ssvd, it may >>> >>> >> benefit you in some regards compared to Lanczos. >>> >>> >> >>> >>> >> -d >>> >>> >> >>> >>> >> On Sun, Feb 19, 2012 at 10:34 AM, Peyman Mohajerian < >>> mohaj...@gmail.com> >>> >>> wrote: >>> >>> >>> Hi Dmitriy & Others, >>> >>> >>> >>> >>> >>> Dmitriy thanks for your previous response. >>> >>> >>> I have a follow up question to my LSA project. I have managed to >>> >>> >>> upload 1,500 documents from two different news groups (one about >>> >>> >>> graphics and one about Atheism >>> >>> >>> http://people.csail.mit.edu/jrennie/20Newsgroups/) to Solr. >>> However my >>> >>> >>> LanczosSolver in Mahout.4 does not find any eigenvalues (there are >>> >>> >>> eigenvectors as you see in the follow up logs). >>> >>> >>> The only things I'm doing different from >>> >>> >>> (https://github.com/algoriffic/lsa4solr) is that I'm not using the >>> >>> >>> 'Summary' field but rather the actual 'text' field in Solr. I'm >>> >>> >>> assuming the issue is that Summary field already removes the noise >>> and >>> >>> >>> make the clustering work and the raw index data does not do that, >>> am I >>> >>> >>> correct or there are other potential explanations? For the desired >>> >>> >>> rank I'm using values between 10-100 and looking for #clusters >>> between >>> >>> >>> 2-10 (different values for different trials), but always the same >>> >>> >>> result comes out, no clusters found. >>> >>> >>> If my issue is related to not having summarization done, how can >>> that >>> >>> >>> be done in Solr? I wasn't able to fine a Summary field in Solr. >>> >>> >>> >>> >>> >>> Thanks >>> >>> >>> Peyman >>> >>> >>> >>> >>> >>> >>> >>> >>> Feb 19, 2012 3:25:20 AM >>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve >>> >>> >>> INFO: Lanczos iteration complete - now to diagonalize the >>> tri-diagonal >>> >>> >>> auxiliary matrix. >>> >>> >>> Feb 19, 2012 3:25:20 AM >>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve >>> >>> >>> INFO: Eigenvector 0 found with eigenvalue 0.0 >>> >>> >>> Feb 19, 2012 3:25:20 AM >>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve >>> >>> >>> INFO: Eigenvector 1 found with eigenvalue 0.0 >>> >>> >>> Feb 19, 2012 3:25:20 AM >>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve >>> >>> >>> INFO: Eigenvector 2 found with eigenvalue 0.0 >>> >>> >>> Feb 19, 2012 3:25:20 AM >>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve >>> >>> >>> INFO: Eigenvector 3 found with eigenvalue 0.0 >>> >>> >>> Feb 19, 2012 3:25:20 AM >>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve >>> >>> >>> INFO: Eigenvector 4 found with eigenvalue 0.0 >>> >>> >>> Feb 19, 2012 3:25:20 AM >>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve >>> >>> >>> INFO: Eigenvector 5 found with eigenvalue 0.0 >>> >>> >>> Feb 19, 2012 3:25:20 AM >>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve >>> >>> >>> INFO: Eigenvector 6 found with eigenvalue 0.0 >>> >>> >>> Feb 19, 2012 3:25:20 AM >>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve >>> >>> >>> INFO: Eigenvector 7 found with eigenvalue 0.0 >>> >>> >>> Feb 19, 2012 3:25:20 AM >>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve >>> >>> >>> INFO: Eigenvector 8 found with eigenvalue 0.0 >>> >>> >>> Feb 19, 2012 3:25:20 AM >>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve >>> >>> >>> INFO: Eigenvector 9 found with eigenvalue 0.0 >>> >>> >>> Feb 19, 2012 3:25:20 AM >>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve >>> >>> >>> INFO: Eigenvector 10 found with eigenvalue 0.0 >>> >>> >>> Feb 19, 2012 3:25:20 AM >>> >>> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver solve >>> >>> >>> INFO: LanczosSolver finished. >>> >>> >>> >>> >>> >>> >>> >>> >>> On Sun, Jan 1, 2012 at 10:06 PM, Dmitriy Lyubimov < >>> dlie...@gmail.com> >>> >>> wrote: >>> >>> >>>> In Mahout lsa pipeline is possible with seqdirectory, seq2sparse >>> and >>> >>> ssvd >>> >>> >>>> commands. Nuances are understanding dictionary format and llr >>> >>> anaylysis of >>> >>> >>>> n-grams and perhaps use a slightly better lemmatizer than the >>> default >>> >>> one. >>> >>> >>>> >>> >>> >>>> With indexing part you are on your own at this point. >>> >>> >>>> On Jan 1, 2012 2:28 PM, "Peyman Mohajerian" <mohaj...@gmail.com> >>> >>> wrote: >>> >>> >>>> >>> >>> >>>>> Hi Guys, >>> >>> >>>>> >>> >>> >>>>> I'm interested in this work: >>> >>> >>>>> >>> >>> >>>>> >>> >>> >>> http://www.ccri.com/blog/2010/4/2/latent-semantic-analysis-in-solr-using-clojure.html >>> >>> >>>>> >>> >>> >>>>> I looked at some of the comments and notices that there was >>> interest >>> >>> >>>>> in incorporating it into Mahout, back in 2010. I'm also having >>> issues >>> >>> >>>>> running this code due to dependencies on older version of Mahout. >>> >>> >>>>> >>> >>> >>>>> I was wondering if LSA is now directly available in Mahout? Also >>> if I >>> >>> >>>>> upgrade to the latest Mahout would this Clojure code work? >>> >>> >>>>> >>> >>> >>>>> Thanks >>> >>> >>>>> Peyman >>> >>> >>>>> >>> >>> >>>