Re: matrix inversion in plan ?

2015-10-04 Thread Allen McIntosh
1) Is m sparse?
2) Once you have computed "inverse", what are you going to do with it?

On 10/04/2015 10:31 PM, go canal wrote:
> Thank you all, the solver is something like this, am I correct:
> Matrix m = 
> Matrix inverse = new QRDecomposition(m).solve(new DiagonalMatrix(1, 
> m.rowSize()));
> 
> The problem I have is that the matrix is too big, I need distributed, or 
> out-of-core solution.
> 
>  thanks, canal 
> 
> 
>  On Monday, October 5, 2015 6:25 AM, Peter Jaumann 
>  wrote:
>
> 
>  This should be done with a matrix solver indeed!!!
> 
> 
> 
> On Oct 4, 2015 11:53 AM, "Ted Dunning"  wrote:
>>
>>
>> It is almost certain that starting with an inversion is a serious error.
>>
>> Are you sure you don't want a matrix solver instead?
>>
>> Sent from my iPhone
>>
>>> On Oct 3, 2015, at 20:09, go canal  wrote:
>>>
>>> oh, it is so unfortunate that the first step of my project requires the
> inversion of a very large matrix. will have to revert back to scalapack or
> MR based solutions I guess.
>>>   thanks, canal
>>>
>>>
>>> On Saturday, October 3, 2015 11:31 PM, Ted Dunning <
> ted.dunn...@gmail.com> wrote:
>>>
>>>
>>> I doubt seriously that Samsara will support matrix inversion per se. The
>>> problem is
>>>
>>> a) it densifies sparse matrices
>>>
>>> b) it is much more costly than solving a linear system
>>>
>>> Samsara is roughly memory based, but different back-ends will try to
> spill
>>> to disk if necessary.  It is likely that the resulting degradation in
>>> performance would be dramatic and thus unacceptable to most users.
>>>
>>>
>>>
 On Fri, Oct 2, 2015 at 8:47 PM, go canal 
> wrote:

 HiI saw some distributed matrix functions included in Samsara now.
 Wondering if we have a plan to support matrix inversion ?BTW, am I
> correct
 that it is distributed memory based, not out-of-core ? thanks, canal
>>>
>>>
> 
> 
>   
> 



Re: seq2sparse dropping tokens

2015-06-02 Thread Allen McIntosh
I suspect that this was not a bug, but rather that the words were
singletons.  I'm a little surprised, given that these were some
financial documents (check a singleton?) but it looks like that's what
happened.

When tomorrow's demo for the client is over I will run seq2sparse with
the floor set to 1 instead of 2 and see if that changes things.

On 05/29/2015 03:13 PM, Suneel Marthi wrote:
 Allen, could u please file a JIRA for this?
 
 On Fri, May 29, 2015 at 8:58 AM, Allen McIntosh amcint...@appcomsci.com
 wrote:
 
 This shows up with Mahout 0.10.0 (the distribution archive) and Hadoop
 2.2.0

 When I run seq2sparse on a document containing the following tokens:

 cash cash equival cash cash equival consist highli liquid instrument
 commerci paper time deposit other monei market instrument which origin
 matur three month less aggreg cash balanc bank reclassifi neg balanc
 consist mainli unclear check account payabl neg balanc reclassifi
 account payabl decemb

 the tokens mainli, check and unclear are dropped on the floor (they do
 not appear in the dictionary file).  The issue persists if I change the
 analyzer to SimpleAnalyzer (-a
 org.apache.lucene.analysis.core.SimpleAnalyzer).  I can understand an
 English analyzer doing something like this, but it seems a little
 strange that it would happen with SimpleAnalyzer.  (I wonder if it is
 coincidence that these tokens appear consecutively in the input.)

 What I am trying to do:  The standard analyzers don't do enough, and I
 have no access to the client's cluster to preload a custom analyzer.
 Processing the text before stuffing it into the initial sequence file
 seemed to be the cleanest alternative, since there doesn't seem to be
 any way to add a custom jar when using a stock Mahout app.

 Why dropped or mangled tokens matter, other than as missing information:
  Ultimately what I need to do is calculate topic weights for an
 arbitrary chunk of text.  (See next post.)  If I can't get the tokens
 right, I don't think I can do this.




 



seq2sparse dropping tokens

2015-05-29 Thread Allen McIntosh
This shows up with Mahout 0.10.0 (the distribution archive) and Hadoop 2.2.0

When I run seq2sparse on a document containing the following tokens:

cash cash equival cash cash equival consist highli liquid instrument
commerci paper time deposit other monei market instrument which origin
matur three month less aggreg cash balanc bank reclassifi neg balanc
consist mainli unclear check account payabl neg balanc reclassifi
account payabl decemb

the tokens mainli, check and unclear are dropped on the floor (they do
not appear in the dictionary file).  The issue persists if I change the
analyzer to SimpleAnalyzer (-a
org.apache.lucene.analysis.core.SimpleAnalyzer).  I can understand an
English analyzer doing something like this, but it seems a little
strange that it would happen with SimpleAnalyzer.  (I wonder if it is
coincidence that these tokens appear consecutively in the input.)

What I am trying to do:  The standard analyzers don't do enough, and I
have no access to the client's cluster to preload a custom analyzer.
Processing the text before stuffing it into the initial sequence file
seemed to be the cleanest alternative, since there doesn't seem to be
any way to add a custom jar when using a stock Mahout app.

Why dropped or mangled tokens matter, other than as missing information:
 Ultimately what I need to do is calculate topic weights for an
arbitrary chunk of text.  (See next post.)  If I can't get the tokens
right, I don't think I can do this.





LDA topic weights for an arbitrary document

2015-05-29 Thread Allen McIntosh
Is there any way to calculate topic weights for an arbitrary new
document?  Calculating the sum (term weight * term count) doesn't
reproduce the existing topic weights, or a constant multiple of them.


Re: seq2sparse dropping tokens

2015-05-29 Thread Allen McIntosh

On 05/29/2015 03:13 PM, Suneel Marthi wrote:

Allen, could u please file a JIRA for this?


Sure.  Do you have any idea what it is?

On the other question I had - after getting a few hours of sleep I was 
able to formulate the right Google query :-) and got directed to 
http://jayaniwithanawasam.blogspot.com which directed me to TopicModel, 
as well as provided me with a running start on the coding.


However, I ran into a tiny problem.  TopicModel seems to expect to read 
an existing model from a single file, or from several files passed in 
via varargs.  Since the model is now spread out over several files, it 
would be save some trauma if the documentation warned about this.




On Fri, May 29, 2015 at 8:58 AM, Allen McIntosh amcint...@appcomsci.com
wrote:


This shows up with Mahout 0.10.0 (the distribution archive) and Hadoop
2.2.0

When I run seq2sparse on a document containing the following tokens:

cash cash equival cash cash equival consist highli liquid instrument
commerci paper time deposit other monei market instrument which origin
matur three month less aggreg cash balanc bank reclassifi neg balanc
consist mainli unclear check account payabl neg balanc reclassifi
account payabl decemb

the tokens mainli, check and unclear are dropped on the floor (they do
not appear in the dictionary file).  The issue persists if I change the
analyzer to SimpleAnalyzer (-a
org.apache.lucene.analysis.core.SimpleAnalyzer).  I can understand an
English analyzer doing something like this, but it seems a little
strange that it would happen with SimpleAnalyzer.  (I wonder if it is
coincidence that these tokens appear consecutively in the input.)

What I am trying to do:  The standard analyzers don't do enough, and I
have no access to the client's cluster to preload a custom analyzer.
Processing the text before stuffing it into the initial sequence file
seemed to be the cleanest alternative, since there doesn't seem to be
any way to add a custom jar when using a stock Mahout app.

Why dropped or mangled tokens matter, other than as missing information:
  Ultimately what I need to do is calculate topic weights for an
arbitrary chunk of text.  (See next post.)  If I can't get the tokens
right, I don't think I can do this.