Re: Why don't we imp some adaptive learning rate methods, such as adadelat, adam?

2016-11-30 Thread WangJianfei
yes, thank you, i know this imp is very simple, but i want to know why spark
mllib imp this?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-don-t-we-imp-some-adaptive-learning-rate-methods-such-as-adadelat-adam-tp20057p20060.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Why don't we imp some adaptive learning rate methods, such as adadelat, adam?

2016-11-30 Thread WangJianfei
Hi devs:
Normally, the adaptive learning rate methods can have a fast convergence
then standard SGD, so why don't we imp them?
see the link for more details 
http://sebastianruder.com/optimizing-gradient-descent/index.html#adadelta



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-don-t-we-imp-some-adaptive-learning-rate-methods-such-as-adadelat-adam-tp20057.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Question about spark.mllib.GradientDescent

2016-11-29 Thread WangJianfei
Hi devs:
   I think it's unnecessary to use c1._1 += c2.1 in combOp operation, I
think it's the same if we use c1._1+c2._1, see the code below :
in GradientDescent.scala

   val (gradientSum, lossSum, miniBatchSize) = data.sample(false,
miniBatchFraction, 42 + i)
.treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
  seqOp = (c, v) => {
// c: (grad, loss, count), v: (label, features)
// c._1 即 grad will be updated in gradient.compute
val l = gradient.compute(v._2, v._1, bcWeights.value,
Vectors.fromBreeze(c._1))
(c._1, c._2 + l, c._3 + 1)
  },
  combOp = (c1, c2) => {
// c: (grad, loss, count)
(c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
  })



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-spark-mllib-GradientDescent-tp20052.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Reduce the memory usage if we do same first in GradientBoostedTrees if subsamplingRate< 1.0

2016-11-15 Thread WangJianfei
with predError.zip(input) ,we get RDD data,  so we can just do a sample on
predError or input, if so, we can't use zip(the elements number must be the
same in each partition),thank you!



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Reduce-the-memory-usage-if-we-do-sample-first-in-GradientBoostedTrees-with-the-condition-that-subsam0-tp19826p19905.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



回复: Reduce the memory usage if we do same first inGradientBoostedTrees if subsamplingRate< 1.0

2016-11-15 Thread WangJianfei
with predError.zip(input) ,we get RDD data,  so we can just do a sample on 
predError or input, if so, we can't use zip(the elements number must be the 
same in each partition),thank you!




-- 原始邮件 --
发件人: "Joseph Bradley [via Apache Spark Developers 
List]";;
发送时间: 2016年11月16日(星期三) 凌晨3:54
收件人: "WangJianfei"; 

主题: Re: Reduce the memory usage if we do same first inGradientBoostedTrees if 
subsamplingRate< 1.0



Thanks for the suggestion.  That would be faster, but less accurate in 
most cases.  It's generally better to use a new random sample on each 
iteration, based on literature and results I've seen.Joseph


On Fri, Nov 11, 2016 at 5:13 AM, WangJianfei <[hidden email]> wrote:
when we train the mode, we will use the data with a subSampleRate, so if the
 subSampleRate < 1.0 , we can do a sample first to reduce the memory usage.
 se the code below in GradientBoostedTrees.boost()
 
  while (m < numIterations && !doneLearning) {
   // Update data with pseudo-residuals 剩余误差
   val data = predError.zip(input).map { case ((pred, _), point) =>
 LabeledPoint(-loss.gradient(pred, point.label), point.features)
   }
 
   timer.start(s"building tree $m")
   logDebug("###")
   logDebug("Gradient boosting tree iteration " + m)
   logDebug("###")
   val dt = new DecisionTreeRegressor().setSeed(seed + m)
   val model = dt.train(data, treeStrategy)
 
 
 
 
 
 --
 View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Reduce-the-memory-usage-if-we-do-same-first-in-GradientBoostedTrees-if-subsamplingRate-1-0-tp19826.html
 Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
 
 -
 To unsubscribe e-mail: [hidden email]
 
 





If you reply to this email, your message will be added 
to the discussion below:

http://apache-spark-developers-list.1001551.n3.nabble.com/Reduce-the-memory-usage-if-we-do-sample-first-in-GradientBoostedTrees-with-the-condition-that-subsam0-tp19826p19899.html
  
To unsubscribe from Reduce the memory 
usage if we do sample first in GradientBoostedTrees with the condition that 
subsamplingRate< 1.0, click here.
NAML



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Reduce-the-memory-usage-if-we-do-same-first-inGradientBoostedTrees-if-subsamplingRate-1-0-tp19904.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

does The Design of spark consider the scala parallelize collections?

2016-11-12 Thread WangJianfei
Hi devs:
   According to scala doc, we can see the scala has parallelize collections,
according to my experient, surely, parallelize collections can accelerate
the operation,such as(map). so i want to know does spark has used the scala
parallelize collections and even will spark consider thant? thank you!



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/does-The-Design-of-spark-consider-the-scala-parallelize-collections-tp19833.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Reduce the memory usage if we do same first in GradientBoostedTrees if subsamplingRate< 1.0

2016-11-11 Thread WangJianfei
when we train the mode, we will use the data with a subSampleRate, so if the
subSampleRate < 1.0 , we can do a sample first to reduce the memory usage.
se the code below in GradientBoostedTrees.boost()

 while (m < numIterations && !doneLearning) {
  // Update data with pseudo-residuals 剩余误差
  val data = predError.zip(input).map { case ((pred, _), point) =>
LabeledPoint(-loss.gradient(pred, point.label), point.features)
  }

  timer.start(s"building tree $m")
  logDebug("###")
  logDebug("Gradient boosting tree iteration " + m)
  logDebug("###")
  val dt = new DecisionTreeRegressor().setSeed(seed + m)
  val model = dt.train(data, treeStrategy)
   




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Reduce-the-memory-usage-if-we-do-same-first-in-GradientBoostedTrees-if-subsamplingRate-1-0-tp19826.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



If we run sc.textfile(path,xxx) many times, will the elements be the same in each partition

2016-11-10 Thread WangJianfei
Hi Devs:
If  i run sc.textFile(path,xxx) many times, will the elements be the
same(same element,same order)in each partitions?
My experiment show that it's the same, but which may not cover all the
cases. Thank you!



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/If-we-run-sc-textfile-path-xxx-many-times-will-the-elements-be-the-same-in-each-partition-tp19814.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-16 Thread WangJianfei
thank you!
 But I think is's user unfriendly to process standard json file with
DataFrame. Need we provide a new overrided method to do this?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-the-json-file-used-by-sparkSession-read-json-must-be-a-valid-json-object-per-line-tp19464p19468.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-16 Thread WangJianfei
Thank you very much! I will have a look about your link.




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-the-json-file-used-by-sparkSession-read-json-must-be-a-valid-json-object-per-line-tp19464p19466.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-15 Thread WangJianfei
Hi devs:
   I'm doubt about the design of spark.read.json,  why the json file is not
a standard json file, who can tell me the internal reason. Any advice is
appreciated.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-the-json-file-used-by-sparkSession-read-json-must-be-a-valid-json-object-per-line-tp19464.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Broadcast big dataset

2016-09-28 Thread WangJianfei
First thank you very much!
  My executor memeory is also 4G, but my spark version is 1.5. Does spark
version make a trouble?




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Broadcast-big-dataset-tp19127p19143.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Broadcast big dataset

2016-09-28 Thread WangJianfei
Hi Devs
 In my application, i just broadcast a dataset(about 500M) to  the
ececutors(100+), I got a java heap error
Jmartad-7219.hadoop.jd.local:53591 (size: 4.0 MB, free: 3.3 GB)
16/09/28 15:56:48 INFO BlockManagerInfo: Added broadcast_9_piece19 in memory
on BJHC-Jmartad-9012.hadoop.jd.local:53197 (size: 4.0 MB, free: 3.3 GB)
16/09/28 15:56:49 INFO BlockManagerInfo: Added broadcast_9_piece8 in memory
on BJHC-Jmartad-84101.hadoop.jd.local:52044 (size: 4.0 MB, free: 3.3 GB)
16/09/28 15:56:58 INFO BlockManagerInfo: Removed broadcast_8_piece0 on
172.22.176.114:37438 in memory (size: 2.7 KB, free: 3.1 GB)
16/09/28 15:56:58 WARN TaskSetManager: Lost task 125.0 in stage 7.0 (TID
130, BJHC-Jmartad-9376.hadoop.jd.local): java.lang.OutOfMemoryError: Java
heap space
at 
java.io.ObjectInputStream$HandleTable.grow(ObjectInputStream.java:3465)
at
java.io.ObjectInputStream$HandleTable.assign(ObjectInputStream.java:3271)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1789)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)

My configuration is 4G memory in driver.  Any advice is appreciated.
Thank you!



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Broadcast-big-dataset-tp19127.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: What's the use of RangePartitioner.hashCode

2016-09-24 Thread WangJianfei
thank you!



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/What-s-the-use-of-RangePartitioner-hashCode-tp18953p19037.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: What's the use of RangePartitioner.hashCode

2016-09-21 Thread WangJianfei
Than you very much sir!  but what i want to know is whether the hashcode
overflow will make a trouble. thank you!



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/What-s-the-use-of-RangePartitioner-hashCode-tp18953p18996.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: java.lang.NoClassDefFoundError, is this a bug?

2016-09-17 Thread WangJianfei
if I remove this abstract class A[T : Encoder] {}  it's ok!



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-NoClassDefFoundError-is-this-a-bug-tp18972p18980.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: java.lang.NoClassDefFoundError, is this a bug?

2016-09-17 Thread WangJianfei
do you run this on yarn mode or else?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-NoClassDefFoundError-is-this-a-bug-tp18972p18978.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Fwd: Question regarding merging to two RDDs

2016-09-17 Thread WangJianfei
maybe you can use dataframe ,with the header file as a schema 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Fwd-Question-regarding-merging-to-two-RDDs-tp18971p18977.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Doubt about ExternalSorter.spillMemoryIteratorToDisk

2016-09-16 Thread WangJianfei
We can see that when the number of been written objects equals
serializerBatchSize, the flush() will be called.  But if the objects written
exceeds the  default buffer size, what will happen? if this situation
happens,will the flush() be called automatelly?

private[this] def spillMemoryIteratorToDisk(inMemoryIterator:
WritablePartitionedIterator)
  : SpilledFile = {

// ignore some code here
try {
  while (inMemoryIterator.hasNext) {
val partitionId = inMemoryIterator.nextPartition()
require(partitionId >= 0 && partitionId < numPartitions,
  s"partition Id: ${partitionId} should be in the range [0,
${numPartitions})")
inMemoryIterator.writeNext(writer)
elementsPerPartition(partitionId) += 1
objectsWritten += 1
if (objectsWritten == serializerBatchSize) {
  flush()
}
  }
 
 // ignore some code here
SpilledFile(file, blockId, batchSizes.toArray, elementsPerPartition)
  }



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Doubt-about-ExternalSorter-spillMemoryIteratorToDisk-tp18969.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: What's the meaning when the partitions is zero?

2016-09-16 Thread WangJianfei
if so, we will get exception when the numPartitions is 0.
 def getPartition(key: Any): Int = key match {
case null => 0
//case None => 0
case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
  }



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/What-s-the-meaning-when-the-partitions-is-zero-tp18957p18967.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



What's the meaning when the partitions is zero?

2016-09-15 Thread WangJianfei
class HashPartitioner(partitions: Int) extends Partitioner {
  require(partitions >= 0, s"Number of partitions ($partitions) cannot be
negative.")

the soruce code require(partitions >=0) ,but I don't know why it makes sense
when the partitions is 0.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/What-s-the-meaning-when-the-partitions-is-zero-tp18957.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Why we get 0 when the key is null?

2016-09-15 Thread WangJianfei
When the key is not In the rdd, I can also get an value , I just feel a
little strange.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-we-get-0-when-the-key-is-null-tp18952p18955.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



What's the use of RangePartitioner.hashCode

2016-09-15 Thread WangJianfei
who can give me an example of the use of RangePartitioner.hashCode, thank
you!



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/What-s-the-use-of-RangePartitioner-hashCode-tp18953.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Why we get 0 when the key is null?

2016-09-15 Thread WangJianfei
this func is in Partitioner
  def getPartition(key: Any): Int = key match {
case null => 0
//case None => 0
case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
  }



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Why-we-get-0-when-the-key-is-null-tp18952.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org