RE: LDA and Maximum Iterations

2016-09-20 Thread Yang, Yuhao
Hi Frank,

Which version of Spark are you using? Also can you share more information about 
the exception.

If it’s not confidential, you can send the data sample to me 
(yuhao.y...@intel.com) and I can try to investigate.

Regards,
Yuhao

From: Frank Zhang [mailto:dataminin...@yahoo.com.INVALID]
Sent: Monday, September 19, 2016 9:20 PM
To: user@spark.apache.org
Subject: LDA and Maximum Iterations

Hi all,

   I have a question about parameter setting for LDA model. When I tried to set 
a large number like 500 for
setMaxIterations, the program always fails.  There is a very straightforward 
LDA tutorial using an example data set in the mllib 
package:http://stackoverflow.com/questions/36631991/latent-dirichlet-allocation-lda-algorithm-not-printing-results-in-spark-scala.
  The codes are here:

import org.apache.spark.mllib.clustering.LDA
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("/data/mllib/sample_lda_data.txt") // you might need to 
change the path for the data set
val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble)))
// Index documents with unique IDs
val corpus = parsedData.zipWithIndex.map(_.swap).cache()
// Cluster the documents into three topics using LDA
val ldaModel = new LDA().setK(3).run(corpus)

But if I change the last line to
val ldaModel = new LDA().setK(3).setMaxIterations(500).run(corpus), the program 
fails.

I greatly appreciate your help!

Best,

Frank





RE: MLLIB - Storing the Trained Model

2015-06-23 Thread Yang, Yuhao
Hi Samsudhin,

  If possible, can you please provide a part of the code? Or perhaps try with 
the ut in RandomForestSuite to see if the issue repros.

Regards,
yuhao

-Original Message-
From: samsudhin [mailto:samsud...@pigstick.com] 
Sent: Tuesday, June 23, 2015 2:14 PM
To: user@spark.apache.org
Subject: MLLIB - Storing the Trained Model

HI All,

I was trying to store a trained model to the local hard disk. i am able to save 
it using save() function. while i am trying to retrieve the stored model using 
load() function i am end up with following error. kindly help me on this.

scala val sameModel = 
scala RandomForestModel.load(sc,/home/ec2-user/myModel)
15/06/23 02:04:25 INFO MemoryStore: ensureFreeSpace(255260) called with 
curMem=592097, maxMem=278302556
15/06/23 02:04:25 INFO MemoryStore: Block broadcast_6 stored as values in 
memory (estimated size 249.3 KB, free 264.6 MB)
15/06/23 02:04:25 INFO MemoryStore: ensureFreeSpace(36168) called with 
curMem=847357, maxMem=278302556
15/06/23 02:04:25 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in 
memory (estimated size 35.3 KB, free 264.6 MB)
15/06/23 02:04:25 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on 
localhost:42290 (size: 35.3 KB, free: 265.3 MB)
15/06/23 02:04:25 INFO BlockManagerMaster: Updated info of block
broadcast_6_piece0
15/06/23 02:04:25 INFO SparkContext: Created broadcast 6 from textFile at
modelSaveLoad.scala:125
15/06/23 02:04:25 INFO FileInputFormat: Total input paths to process : 1
15/06/23 02:04:25 INFO SparkContext: Starting job: first at
modelSaveLoad.scala:125
15/06/23 02:04:25 INFO DAGScheduler: Got job 3 (first at
modelSaveLoad.scala:125) with 1 output partitions (allowLocal=true)
15/06/23 02:04:25 INFO DAGScheduler: Final stage: Stage 3(first at
modelSaveLoad.scala:125)
15/06/23 02:04:25 INFO DAGScheduler: Parents of final stage: List()
15/06/23 02:04:25 INFO DAGScheduler: Missing parents: List()
15/06/23 02:04:25 INFO DAGScheduler: Submitting Stage 3 
(/home/ec2-user/myModel/metadata MapPartitionsRDD[7] at textFile at 
modelSaveLoad.scala:125), which has no missing parents
15/06/23 02:04:25 INFO MemoryStore: ensureFreeSpace(2680) called with 
curMem=883525, maxMem=278302556
15/06/23 02:04:25 INFO MemoryStore: Block broadcast_7 stored as values in 
memory (estimated size 2.6 KB, free 264.6 MB)
15/06/23 02:04:25 INFO MemoryStore: ensureFreeSpace(1965) called with 
curMem=886205, maxMem=278302556
15/06/23 02:04:25 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in 
memory (estimated size 1965.0 B, free 264.6 MB)
15/06/23 02:04:25 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on 
localhost:42290 (size: 1965.0 B, free: 265.3 MB)
15/06/23 02:04:25 INFO BlockManagerMaster: Updated info of block
broadcast_7_piece0
15/06/23 02:04:25 INFO SparkContext: Created broadcast 7 from broadcast at
DAGScheduler.scala:839
15/06/23 02:04:25 INFO DAGScheduler: Submitting 1 missing tasks from Stage 3 
(/home/ec2-user/myModel/metadata MapPartitionsRDD[7] at textFile at
modelSaveLoad.scala:125)
15/06/23 02:04:25 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
15/06/23 02:04:25 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, 
localhost, PROCESS_LOCAL, 1311 bytes)
15/06/23 02:04:25 INFO Executor: Running task 0.0 in stage 3.0 (TID 3)
15/06/23 02:04:25 INFO HadoopRDD: Input split:
file:/home/ec2-user/myModel/metadata/part-0:0+97
15/06/23 02:04:25 INFO Executor: Finished task 0.0 in stage 3.0 (TID 3).
1989 bytes result sent to driver
15/06/23 02:04:25 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID
3) in 10 ms on localhost (1/1)
15/06/23 02:04:25 INFO DAGScheduler: Stage 3 (first at
modelSaveLoad.scala:125) finished in 0.010 s
15/06/23 02:04:25 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have 
all completed, from pool
15/06/23 02:04:25 INFO DAGScheduler: Job 3 finished: first at 
modelSaveLoad.scala:125, took 0.016193 s
15/06/23 02:04:25 WARN FSInputChecker: Problem opening checksum file:
file:/home/ec2-user/myModel/data/_temporary/0/_temporary/attempt_201506230149_0027_r_01_0/part-r-2.parquet.
 
Ignoring exception:
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:149)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
at
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:402)
at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:298)
at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:297)
at
scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
at

RE: The explanation of input text format using LDA in Spark

2015-05-08 Thread Yang, Yuhao
Hi Cui,

Try to read the scala version of LDAExample, 
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala
 

The matrix you're referring to is the corpus after vectorization. 

One example, given a dict, [apple, orange, banana]
3 documents:
Apple orange
Orange banana
Apple banana
Can be represented by dense vectors:
1, 1, 0
0, 1, 1
1, 0, 1

Cheers,
Yuhao


-Original Message-
From: Cui xp [mailto:lifeiniao...@gmail.com] 
Sent: Wednesday, May 6, 2015 4:28 PM
To: user@spark.apache.org
Subject: The explanation of input text format using LDA in Spark

Hi all,
   After I read the example code using LDA in Spark, I found the input text in 
the code is a matrix. the format of the text is as follows:
1 2 6 0 2 3 1 1 0 0 3
1 3 0 1 3 0 0 2 0 0 1
1 4 1 0 0 4 9 0 1 2 0
2 1 0 3 0 0 5 0 2 3 9
3 1 1 9 3 0 2 0 0 1 3
4 2 0 3 4 5 1 1 1 4 0
2 1 0 3 0 0 5 0 2 2 9
1 1 1 9 2 1 2 0 0 1 3
4 4 0 3 4 2 1 3 0 0 0
2 8 2 0 3 0 2 0 2 7 2
1 1 1 9 0 2 2 0 0 3 3
4 1 0 0 4 5 1 3 0 1 0
But I don't know the explanation of each line or each column. And if I have 
several text documents, how do I process them to use LDA in Spark? Thanks.


Cui xp



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/The-explanation-of-input-text-format-using-LDA-in-Spark-tp22781.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Yang, Yuhao
Check spark/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala

It can be used through sliding(windowSize: Int) in
spark/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala

Yuhao

From: Mark Hamstra [mailto:m...@clearstorydata.com]
Sent: Thursday, February 12, 2015 7:00 AM
To: Corey Nolet
Cc: user
Subject: Re: Easy way to partition an RDD into chunks like Guava's 
Iterables.partition

No, only each group should need to fit.

On Wed, Feb 11, 2015 at 2:56 PM, Corey Nolet 
cjno...@gmail.commailto:cjno...@gmail.com wrote:
Doesn't iter still need to fit entirely into memory?

On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra 
m...@clearstorydata.commailto:m...@clearstorydata.com wrote:
rdd.mapPartitions { iter =
  val grouped = iter.grouped(batchSize)
  for (group - grouped) { ... }
}

On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet 
cjno...@gmail.commailto:cjno...@gmail.com wrote:
I think the word partition here is a tad different than the term partition 
that we use in Spark. Basically, I want something similar to Guava's 
Iterables.partition [1], that is, If I have an RDD[People] and I want to run an 
algorithm that can be optimized by working on 30 people at a time, I'd like to 
be able to say:

val rdd: RDD[People] = .
val partitioned: RDD[Seq[People]] = rdd.partition(30)

I also don't want any shuffling- everything can still be processed locally.


[1] 
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Iterables.html#partition(java.lang.Iterable,%20int)