Re: Automated close of PR's ?

2015-12-30 Thread Mridul Muralidharan
I am not sure of others, but I had a PR close from under me where
ongoing discussion was as late as 2 weeks back.
Given this, I assumed it was automated close and not manual !

When the change was opened is not a good metric about viability of the
change (particularly when it touches code which is rarely modified;
and so will merge after multiple releases).

Regards,
Mridul

On Wed, Dec 30, 2015 at 7:12 PM, Reynold Xin  wrote:
> No there is not. I actually manually closed them to cut down the number of
> open pull requests. Feel free to reopen individual ones.
>
>
> On Wednesday, December 30, 2015, Mridul Muralidharan 
> wrote:
>>
>> Is there a script running to close "old" PR's ? I was not aware of any
>> discussion about this in dev list.
>>
>> - Mridul
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Automated close of PR's ?

2015-12-30 Thread Mridul Muralidharan
Is there a script running to close "old" PR's ? I was not aware of any
discussion about this in dev list.

- Mridul

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: problem with reading source code-pull out nondeterministic expresssions

2015-12-30 Thread Michael Armbrust
The goal here is to ensure that the non-deterministic value is evaluated
only once, so the result won't change for a given row (i.e. when sorting).

On Tue, Dec 29, 2015 at 10:57 PM, 汪洋  wrote:

> Hi fellas,
> I am new to spark and I have a newbie question. I am currently reading the
> source code in spark sql catalyst analyzer. I not quite understand the
> partial function in PullOutNondeterministric. What does it mean by "pull
> out”? Why do we have to do the "pulling out”?
> I would really appreciate it if somebody explain it to me.
> Thanks.
>


New processes / tools for changing dependencies in Spark

2015-12-30 Thread Josh Rosen
I just merged https://github.com/apache/spark/pull/10461, a PR that adds
new automated tooling to help us reason about dependency changes in Spark.
Here's a summary of the changes:

   - The dev/run-tests script (used in the SBT Jenkins builds and for
   testing Spark pull requests) now generates a file which contains Spark's
   resolved runtime classpath for each Hadoop profile, then compares that file
   to a copy which is checked into the repository. These dependency lists are
   found at https://github.com/apache/spark/tree/master/dev/deps; there is
   a separate list for each Hadoop profile.

   - If a pull request changes dependencies without updating these manifest
   files, our test script will fail the build
    and
   the build console output will list the dependency diff
   

   .

   - If you are intentionally changing dependencies, run
./dev/test-dependencies.sh
   --replace-manifest to re-generate these dependency manifests then commit
   the changed files and include them with your pull request.

The goal of this change is to make it simpler to reason about build
changes: it should now be much easier to verify whether dependency
exclusions worked properly or determine whether transitive dependencies
changed in a way that affects the final classpath.

Let me know if you have any questions about this change and, as always,
feel free to submit pull requests if you would like to make any
enhancements to this script.

Thanks,
Josh


Re: IndentationCheck of checkstyle

2015-12-30 Thread Ted Yu
Right. 

Pardon my carelessness. 

> On Dec 29, 2015, at 9:58 PM, Reynold Xin  wrote:
> 
> OK to close the loop - this thread has nothing to do with Spark?
> 
> 
>> On Tue, Dec 29, 2015 at 9:55 PM, Ted Yu  wrote:
>> Oops, wrong list :-)
>> 
>>> On Dec 29, 2015, at 9:48 PM, Reynold Xin  wrote:
>>> 
>>> +Herman
>>> 
>>> Is this coming from the newly merged Hive parser?
>>> 
>>> 
>>> 
 On Tue, Dec 29, 2015 at 9:46 PM, Allen Zhang  wrote:
 
 
 format issue I think, go ahead
 
 
 
 
 At 2015-12-30 13:36:05, "Ted Yu"  wrote:
 Hi,
 I noticed that there are a lot of checkstyle warnings in the following 
 form:
 
 >>> source="com.puppycrawl.tools.checkstyle.   
 checks.indentation.IndentationCheck"/>
 
 To my knowledge, we use two spaces for each tab. Not sure why all of a 
 sudden we have so many IndentationCheck warnings:
 
 grep 'hild have incorrect indentati' trunkCheckstyle.xml | wc
 3133   52645  678294
 
 If there is no objection, I would create a JIRA and relax IndentationCheck 
 warning.
 
 Cheers
> 


Re: running lda in spark throws exception

2015-12-30 Thread Li Li
I use a small data and reproduce the problem.
But I don't know my codes are correct or not because I am not familiar
with spark.
So I first post my codes here. If it's correct, then I will post the data.
one line of my data like:

{ "time":"08-09-17","cmtUrl":"2094361"
,"rvId":"rev_1020","webpageUrl":"http://www.dianping.com/shop/2094361","word_vec":[0,1,2,3,4,5,6,2,7,8,9

,10,11,12,13,14,15,16,8,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,32,35,36,37,38,15,39,40,41,42,5,43,44,17,45,46,42,47,26,48,49]}

it's a json file which contains webpageUrl and word_vec which is the
encoded words.
The first step is to prase the input rdd to a rdd of VectorUrl.
BTW, if public VectorUrl call(String s) return null, is it ok?
Then follow the example Index documents with unique IDs
Then I create a rdd to map id to url so after lda training, I can find
the url of the document. Then save this rdd to hdfs.
Then create corpus rdd and train

The exception stack is

15/12/30 20:45:42 ERROR yarn.ApplicationMaster: User class threw
exception: java.lang.IndexOutOfBoundsException: (454,0) not in
[-58,58) x [-100,100)
java.lang.IndexOutOfBoundsException: (454,0) not in [-58,58) x [-100,100)
at breeze.linalg.DenseMatrix$mcD$sp.update$mcD$sp(DenseMatrix.scala:112)
at 
org.apache.spark.mllib.clustering.DistributedLDAModel$$anonfun$topicsMatrix$1.apply(LDAModel.scala:534)
at 
org.apache.spark.mllib.clustering.DistributedLDAModel$$anonfun$topicsMatrix$1.apply(LDAModel.scala:531)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix$lzycompute(LDAModel.scala:531)
at 
org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix(LDAModel.scala:523)
at com.mobvoi.knowledgegraph.textmining.lda.ReviewLDA.main(ReviewLDA.java:89)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:525)


==here is my codes==

SparkConf conf = new SparkConf().setAppName(ReviewLDA.class.getName());

JavaSparkContext sc = new JavaSparkContext(conf);


// Load and parse the data

JavaRDD data = sc.textFile(inputDir + "/*");

JavaRDD parsedData = data.map(new Function() {

  public VectorUrl call(String s) {

JsonParser parser = new JsonParser();

JsonObject jo = parser.parse(s).getAsJsonObject();

if (!jo.has("word_vec") || !jo.has("webpageUrl")) {

  return null;

}

JsonArray word_vec = jo.get("word_vec").getAsJsonArray();

String url = jo.get("webpageUrl").getAsString();

double[] values = new double[word_vec.size()];

for (int i = 0; i < values.length; i++)

  values[i] = word_vec.get(i).getAsInt();

return new VectorUrl(Vectors.dense(values), url);

  }

});



// Index documents with unique IDs

JavaPairRDD id2doc =
JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(

new Function, Tuple2>() {

  public Tuple2 call(Tuple2 doc_id) {

return doc_id.swap();

  }

}));

JavaPairRDD id2Url = JavaPairRDD.fromJavaRDD(id2doc

.map(new Function, Tuple2>() {

  @Override

  public Tuple2 call(Tuple2
id2doc) throws Exception {

return new Tuple2(id2doc._1, id2doc._2.url);

  }

}));

id2Url.saveAsTextFile(id2UrlPath);

JavaPairRDD corpus = JavaPairRDD.fromJavaRDD(id2doc

.map(new Function, Tuple2>() {

  @Override

  public Tuple2 call(Tuple2
id2doc) throws Exception {

return new Tuple2(id2doc._1, id2doc._2.vec);

  }

}));

corpus.cache();


// Cluster the documents into three topics using LDA

DistributedLDAModel ldaModel = (DistributedLDAModel) new
LDA().setMaxIterations(iterNumber)

.setK(topicNumber).run(corpus);

On Wed, Dec 30, 2015 at 3:34 PM, Li Li  wrote:
> I will use a portion of data and try. will the hdfs block affect
> spark?(if so, it's hard to reproduce)
>
> On Wed, Dec 30, 2015 at 3:22 AM, Joseph Bradley  wrote:
>> Hi Li,
>>
>> I'm wondering if you're running into the same bug reported here:
>> https://issues.apache.org/jira/browse/SPARK-12488
>>
>> I haven't figured out yet what is causing it.  Do you have a 

Re: Data and Model Parallelism in MLPC

2015-12-30 Thread Disha Shrivastava
Hi,

I went through the code for implementation of MLPC and couldn't understand
why stacking/unstacking of the input data has been done. The description
says " Block size for stacking input data in matrices to speed up the
computation. Data is stacked within partitions. If block size is more than
remaining data in a partition then it is adjusted to the size of this
data. Recommended
size is between 10 and 1000. Default: 128". I am not pretty sure what this
means and how does this attain speed in computation?

Also, I couldn't find exactly how data parallelism as depicted in
http://static.googleusercontent.com/media/research.google.com/hi//archive/large_deep_networks_nips2012.pdf
is incorporated in the existing code. There seems to be no notion of
parameter server and optimization routine is also normal LBFGS not
Sandblaster LBFGS. The only parallelism seems to be coming from the way
input data is read and stored.

Please correct me if I am wrong and clarify my doubt.

Thanks and Regards,
Disha

On Tue, Dec 29, 2015 at 5:40 PM, Disha Shrivastava 
wrote:

> Hi Alexander,
>
> Thanks a lot for your response.Yes, I am considering the use case when the
> weight matrix is too large to fit into the main memory of a single machine.
>
> Can you tell me ways of dividing the weight matrix? According to my
> investigations so far, we can do this by two ways:
>
> 1. By parallelizing the weight matrix RDD using sc.parallelize and then
> using suitable map functions in the forward and backward pass.
> 2. By using RowMatrix / BlockMatrix to represent the weight matrix and do
> calculations on it.
>
> Which of these methods will be efficient to use ? Also, I came across an
> implementation using Akka where layer-by-layer partitioning of the network
> has been done (
> http://alexminnaar.com/implementing-the-distbelief-deep-neural-network-training-framework-with-akka.html)
> which I believe is model parallelism in the true sense.
>
> Please suggest any other ways/implementation that can help. I would love
> to hear your remarks on the above.
>
> Thanks and Regards,
> Disha
>
> On Wed, Dec 9, 2015 at 1:29 AM, Ulanov, Alexander <
> alexander.ula...@hpe.com> wrote:
>
>> Hi Disha,
>>
>>
>>
>> Which use case do you have in mind that would require model parallelism?
>> It should have large number of weights, so it could not fit into the memory
>> of a single machine. For example, multilayer perceptron topologies, that
>> are used for speech recognition, have up to 100M of weights. Present
>> hardware is capable of accommodating this in the main memory. That might be
>> a problem for GPUs, but this is a different topic.
>>
>>
>>
>> The straightforward way of model parallelism for fully connected neural
>> networks is to distribute horizontal (or vertical) blocks of weight
>> matrices across several nodes. That means that the input data has to be
>> reproduced on all these nodes. The forward and the backward passes will
>> require re-assembling the outputs and the errors on each of the nodes after
>> each layer, because each of the node can produce only partial results since
>> it holds a part of weights. According to my estimations, this is
>> inefficient due to large intermediate traffic between the nodes and should
>> be used only if the model does not fit in memory of a single machine.
>> Another way of model parallelism would be to represent the network as the
>> graph and use GraphX to write forward and back propagation. However, this
>> option does not seem very practical to me.
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>> *From:* Disha Shrivastava [mailto:dishu@gmail.com]
>> *Sent:* Tuesday, December 08, 2015 11:19 AM
>> *To:* Ulanov, Alexander
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: Data and Model Parallelism in MLPC
>>
>>
>>
>> Hi Alexander,
>>
>> Thanks for your response. Can you suggest ways to incorporate Model
>> Parallelism in MPLC? I am trying to do the same in Spark. I got hold of
>> your post
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Model-parallelism-with-RDD-td13141.html
>> where you have divided the weight matrix into different worker machines. I
>> have two basic questions in this regard:
>>
>> 1. How to actually visualize/analyze and control how nodes of the neural
>> network/ weights are divided across different workers?
>>
>> 2. Is there any alternate way to achieve model parallelism for MPLC in
>> Spark? I believe we need to have some kind of synchronization and control
>> for the updation of weights shared across different workers during
>> backpropagation.
>>
>> Looking forward for your views on this.
>>
>> Thanks and Regards,
>>
>> Disha
>>
>>
>>
>> On Wed, Dec 9, 2015 at 12:36 AM, Ulanov, Alexander <
>> alexander.ula...@hpe.com> wrote:
>>
>> Hi Disha,
>>
>>
>>
>> Multilayer perceptron classifier in Spark implements data parallelism.
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>> *From:* Disha Shrivastava [mailto:dishu@gmail.com]
>> *Sent:*