[mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?

2015-03-15 Thread Yu Ishikawa
Hi all,

Is there any bugs to divide a Breeze sparse vector at Spark v1.3.0-rc3? When
I tried to divide a sparse vector at Spark v1.3.0-rc3, I got a wrong result
if the target vector has any zero values.

Spark v1.3.0-rc3 depends on Breeze v0.11.1. And Breeze v0.11.1 seems to have
any bugs to divide a sparse vector by a scalar value. When dividing a breeze
sparse vector which has any zero values, the result seems to be a zero
vector. However, we can run the same code on Spark v1.2.x.

However, there is no problem to multiply a breeze sparse vector. I asked the
breeze community this problem on the below issue.
https://github.com/scalanlp/breeze/issues/382

For example,
```
test(dividing a breeze spark vector) {
val vec = Vectors.sparse(6, Array(0, 4), Array(0.0, 10.0)).toBreeze
val n = 60.0
val answer1 = vec :/ n
val answer2 = vec.toDenseVector :/ n
println(vec)
println(answer1)
println(answer2)
assert(answer1.toDenseVector === answer2)
}

SparseVector((0,0.0), (4,10.0))
SparseVector()
DenseVector(0.0, 0.0, 0.0, 0.0, 0.1, 0.0)

DenseVector(0.0, 0.0, 0.0, 0.0, 0.0, 0.0) did not equal DenseVector(0.0,
0.0, 0.0, 0.0, 0.1, 0.0)
org.scalatest.exceptions.TestFailedException: DenseVector(0.0, 0.0, 0.0,
0.0, 0.0, 0.0) did not equal DenseVector(0.0, 0.0, 0.0, 0.0,
0.1, 0.0)
```

Thanks,
Yu Ishikawa



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Is-there-any-bugs-to-divide-a-Breeze-sparse-vectors-at-Spark-v1-3-0-rc3-tp11056.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?

2015-03-15 Thread Yu Ishikawa
David Hall who is a breeze creator told me that it's a bug. So, I made a jira
ticket about this issue. We need to upgrade breeze from 0.11.1 to 0.11.2 or
later in order to fix the bug, when the new version of breeze will be
released.

[SPARK-6341] Upgrade breeze from 0.11.1 to 0.11.2 or later - ASF JIRA
https://issues.apache.org/jira/browse/SPARK-6341

Thanks,
Yu Ishikawa



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Is-there-any-bugs-to-divide-a-Breeze-sparse-vectors-at-Spark-v1-3-0-rc3-tp11056p11058.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?

2015-03-15 Thread DB Tsai
It's a bug in breeze's side. Once David fixes it and publishes it to
maven, we can upgrade to breeze 0.11.2. Please file a jira ticket for
this issue. thanks.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Sun, Mar 15, 2015 at 12:45 AM, Yu Ishikawa
yuu.ishikawa+sp...@gmail.com wrote:
 Hi all,

 Is there any bugs to divide a Breeze sparse vector at Spark v1.3.0-rc3? When
 I tried to divide a sparse vector at Spark v1.3.0-rc3, I got a wrong result
 if the target vector has any zero values.

 Spark v1.3.0-rc3 depends on Breeze v0.11.1. And Breeze v0.11.1 seems to have
 any bugs to divide a sparse vector by a scalar value. When dividing a breeze
 sparse vector which has any zero values, the result seems to be a zero
 vector. However, we can run the same code on Spark v1.2.x.

 However, there is no problem to multiply a breeze sparse vector. I asked the
 breeze community this problem on the below issue.
 https://github.com/scalanlp/breeze/issues/382

 For example,
 ```
 test(dividing a breeze spark vector) {
 val vec = Vectors.sparse(6, Array(0, 4), Array(0.0, 10.0)).toBreeze
 val n = 60.0
 val answer1 = vec :/ n
 val answer2 = vec.toDenseVector :/ n
 println(vec)
 println(answer1)
 println(answer2)
 assert(answer1.toDenseVector === answer2)
 }

 SparseVector((0,0.0), (4,10.0))
 SparseVector()
 DenseVector(0.0, 0.0, 0.0, 0.0, 0.1, 0.0)

 DenseVector(0.0, 0.0, 0.0, 0.0, 0.0, 0.0) did not equal DenseVector(0.0,
 0.0, 0.0, 0.0, 0.1, 0.0)
 org.scalatest.exceptions.TestFailedException: DenseVector(0.0, 0.0, 0.0,
 0.0, 0.0, 0.0) did not equal DenseVector(0.0, 0.0, 0.0, 0.0,
 0.1, 0.0)
 ```

 Thanks,
 Yu Ishikawa



 -
 -- Yu Ishikawa
 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Is-there-any-bugs-to-divide-a-Breeze-sparse-vectors-at-Spark-v1-3-0-rc3-tp11056.html
 Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Wrong version on the Spark documentation page

2015-03-15 Thread Ted Yu
When I enter  http://spark.apache.org/docs/latest/ into Chrome address bar,
I saw 1.3.0

Cheers

On Sun, Mar 15, 2015 at 11:12 AM, Patrick Wendell pwend...@gmail.com
wrote:

 Cheng - what if you hold shift+refresh? For me the /latest link
 correctly points to 1.3.0

 On Sun, Mar 15, 2015 at 10:40 AM, Cheng Lian lian.cs@gmail.com
 wrote:
  It's still marked as 1.2.1 here http://spark.apache.org/docs/latest/
 
  But this page is updated (1.3.0)
  http://spark.apache.org/docs/latest/index.html
 
  Cheng
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Wrong version on the Spark documentation page

2015-03-15 Thread Cheng Lian

It's still marked as 1.2.1 here http://spark.apache.org/docs/latest/

But this page is updated (1.3.0) 
http://spark.apache.org/docs/latest/index.html


Cheng

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Wrong version on the Spark documentation page

2015-03-15 Thread Patrick Wendell
Cheng - what if you hold shift+refresh? For me the /latest link
correctly points to 1.3.0

On Sun, Mar 15, 2015 at 10:40 AM, Cheng Lian lian.cs@gmail.com wrote:
 It's still marked as 1.2.1 here http://spark.apache.org/docs/latest/

 But this page is updated (1.3.0)
 http://spark.apache.org/docs/latest/index.html

 Cheng

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: broadcast hang out

2015-03-15 Thread Mridul Muralidharan
Cross region as in different data centers ?

- Mridul

On Sun, Mar 15, 2015 at 8:08 PM, lonely Feb lonely8...@gmail.com wrote:
 Hi all, i meet up with a problem that torrent broadcast hang out in my
 spark cluster (1.2, standalone) , particularly serious when driver and
 executors are cross-region. when i read the code of broadcast i found that
 a sync block read here:

   def fetchBlockSync(host: String, port: Int, execId: String, blockId:
 String): ManagedBuffer = {
 // A monitor for the thread to wait on.
 val result = Promise[ManagedBuffer]()
 fetchBlocks(host, port, execId, Array(blockId),
   new BlockFetchingListener {
 override def onBlockFetchFailure(blockId: String, exception:
 Throwable): Unit = {
   result.failure(exception)
 }
 override def onBlockFetchSuccess(blockId: String, data:
 ManagedBuffer): Unit = {
   val ret = ByteBuffer.allocate(data.size.toInt)
   ret.put(data.nioByteBuffer())
   ret.flip()
   result.success(new NioManagedBuffer(ret))
 }
   })

 Await.result(result.future, Duration.Inf)
   }

 it seems that fetchBlockSync method does not have a timeout limit but wait
 forever ? Anybody can show me how to control the timeout here?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: broadcast hang out

2015-03-15 Thread lonely Feb
yes

2015-03-16 11:43 GMT+08:00 Mridul Muralidharan mri...@gmail.com:

 Cross region as in different data centers ?

 - Mridul

 On Sun, Mar 15, 2015 at 8:08 PM, lonely Feb lonely8...@gmail.com wrote:
  Hi all, i meet up with a problem that torrent broadcast hang out in my
  spark cluster (1.2, standalone) , particularly serious when driver and
  executors are cross-region. when i read the code of broadcast i found
 that
  a sync block read here:
 
def fetchBlockSync(host: String, port: Int, execId: String, blockId:
  String): ManagedBuffer = {
  // A monitor for the thread to wait on.
  val result = Promise[ManagedBuffer]()
  fetchBlocks(host, port, execId, Array(blockId),
new BlockFetchingListener {
  override def onBlockFetchFailure(blockId: String, exception:
  Throwable): Unit = {
result.failure(exception)
  }
  override def onBlockFetchSuccess(blockId: String, data:
  ManagedBuffer): Unit = {
val ret = ByteBuffer.allocate(data.size.toInt)
ret.put(data.nioByteBuffer())
ret.flip()
result.success(new NioManagedBuffer(ret))
  }
})
 
  Await.result(result.future, Duration.Inf)
}
 
  it seems that fetchBlockSync method does not have a timeout limit but
 wait
  forever ? Anybody can show me how to control the timeout here?



Re: broadcast hang out

2015-03-15 Thread lonely Feb
Thx. But this method is in BlockTransferService.scala of spark which i can
not replace unless i rewrite the core code. I wonder if it is handled
somewhere already.

2015-03-16 11:27 GMT+08:00 Chester Chen ches...@alpinenow.com:

 can you just replace Duration.Inf with a shorter duration  ? how about

   import scala.concurrent.duration._
   val timeout = new Timeout(10 seconds)
   Await.result(result.future, timeout.duration)

   or

   val timeout = new FiniteDuration(10, TimeUnit.SECONDS)
   Await.result(result.future, timeout)

   or simply
   import scala.concurrent.duration._
   Await.result(result.future, 10 seconds)



 On Sun, Mar 15, 2015 at 8:08 PM, lonely Feb lonely8...@gmail.com wrote:

 Hi all, i meet up with a problem that torrent broadcast hang out in my
 spark cluster (1.2, standalone) , particularly serious when driver and
 executors are cross-region. when i read the code of broadcast i found that
 a sync block read here:

   def fetchBlockSync(host: String, port: Int, execId: String, blockId:
 String): ManagedBuffer = {
 // A monitor for the thread to wait on.
 val result = Promise[ManagedBuffer]()
 fetchBlocks(host, port, execId, Array(blockId),
   new BlockFetchingListener {
 override def onBlockFetchFailure(blockId: String, exception:
 Throwable): Unit = {
   result.failure(exception)
 }
 override def onBlockFetchSuccess(blockId: String, data:
 ManagedBuffer): Unit = {
   val ret = ByteBuffer.allocate(data.size.toInt)
   ret.put(data.nioByteBuffer())
   ret.flip()
   result.success(new NioManagedBuffer(ret))
 }
   })

 Await.result(result.future, Duration.Inf)
   }

 it seems that fetchBlockSync method does not have a timeout limit but wait
 forever ? Anybody can show me how to control the timeout here?





Re: [mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?

2015-03-15 Thread David Hall
snapshot is pushed. If you verify I'll publish the new artifacts.

On Sun, Mar 15, 2015 at 1:14 AM, Yu Ishikawa yuu.ishikawa+sp...@gmail.com
wrote:

 David Hall who is a breeze creator told me that it's a bug. So, I made a
 jira
 ticket about this issue. We need to upgrade breeze from 0.11.1 to 0.11.2 or
 later in order to fix the bug, when the new version of breeze will be
 released.

 [SPARK-6341] Upgrade breeze from 0.11.1 to 0.11.2 or later - ASF JIRA
 https://issues.apache.org/jira/browse/SPARK-6341

 Thanks,
 Yu Ishikawa



 -
 -- Yu Ishikawa
 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Is-there-any-bugs-to-divide-a-Breeze-sparse-vectors-at-Spark-v1-3-0-rc3-tp11056p11058.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: SparkSQL 1.3.0 (RC3) failed to read parquet file generated by 1.1.1

2015-03-15 Thread Pei-Lun Lee
Thanks!

On Sat, Mar 14, 2015 at 3:31 AM, Michael Armbrust mich...@databricks.com
wrote:

 Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-6315

 On Thu, Mar 12, 2015 at 11:00 PM, Michael Armbrust mich...@databricks.com
 
 wrote:

  We are looking at the issue and will likely fix it for Spark 1.3.1.
 
  On Thu, Mar 12, 2015 at 8:25 PM, giive chen thegi...@gmail.com wrote:
 
  Hi all
 
  My team has the same issue. It looks like Spark 1.3's sparkSQL cannot
 read
  parquet file generated by Spark 1.1. It will cost a lot of migration
 work
  when we wanna to upgrade Spark 1.3.
 
  Is there  anyone can help me?
 
 
  Thanks
 
  Wisely Chen
 
 
  On Tue, Mar 10, 2015 at 5:06 PM, Pei-Lun Lee pl...@appier.com wrote:
 
   Hi,
  
   I found that if I try to read parquet file generated by spark 1.1.1
  using
   1.3.0-rc3 by default settings, I got this error:
  
   com.fasterxml.jackson.core.JsonParseException: Unrecognized token
   'StructType': was expecting ('true', 'false' or 'null')
at [Source: StructType(List(StructField(a,IntegerType,false))); line:
  1,
   column: 11]
   at
  
 
 com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1419)
   at
  
  
 
 com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:508)
   at
  
  
 
 com.fasterxml.jackson.core.json.ReaderBasedJsonParser._reportInvalidToken(ReaderBasedJsonParser.java:2300)
   at
  
  
 
 com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1459)
   at
  
  
 
 com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:683)
   at
  
  
 
 com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3105)
   at
  
  
 
 com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3051)
   at
  
  
 
 com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2161)
   at
  org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:19)
   at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:44)
   at
   org.apache.spark.sql.types.DataType$.fromJson(dataTypes.scala:41)
   at
  
  
 
 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$readSchema$1$$anonfun$25.apply(newParquet.scala:675)
   at
  
  
 
 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$readSchema$1$$anonfun$25.apply(newParquet.scala:675)
  
  
  
   this is how I save parquet file with 1.1.1:
  
   sql(select 1 as a).saveAsParquetFile(/tmp/foo)
  
  
  
   and this is the meta data of the 1.1.1 parquet file:
  
   creator: parquet-mr version 1.4.3
   extra:   org.apache.spark.sql.parquet.row.metadata =
   StructType(List(StructField(a,IntegerType,false)))
  
  
  
   by comparison, this is 1.3.0 meta:
  
   creator: parquet-mr version 1.6.0rc3
   extra:   org.apache.spark.sql.parquet.row.metadata =
   {type:struct,fields:[{name:a,type:integer,nullable:t
   [more]...
  
  
  
   It looks like now ParquetRelation2 is used to load parquet file by
  default
   and it only recognizes JSON format schema but 1.1.1 schema was case
  class
   string format.
  
   Setting spark.sql.parquet.useDataSourceApi to false will fix it, but I
   don't know the differences.
   Is this considered a bug? We have a lot of parquet files from 1.1.1,
  should
   we disable data source api in order to read them if we want to upgrade
  to
   1.3?
  
   Thanks,
   --
   Pei-Lun
  
 
 
 



broadcast hang out

2015-03-15 Thread lonely Feb
Hi all, i meet up with a problem that torrent broadcast hang out in my
spark cluster (1.2, standalone) , particularly serious when driver and
executors are cross-region. when i read the code of broadcast i found that
a sync block read here:

  def fetchBlockSync(host: String, port: Int, execId: String, blockId:
String): ManagedBuffer = {
// A monitor for the thread to wait on.
val result = Promise[ManagedBuffer]()
fetchBlocks(host, port, execId, Array(blockId),
  new BlockFetchingListener {
override def onBlockFetchFailure(blockId: String, exception:
Throwable): Unit = {
  result.failure(exception)
}
override def onBlockFetchSuccess(blockId: String, data:
ManagedBuffer): Unit = {
  val ret = ByteBuffer.allocate(data.size.toInt)
  ret.put(data.nioByteBuffer())
  ret.flip()
  result.success(new NioManagedBuffer(ret))
}
  })

Await.result(result.future, Duration.Inf)
  }

it seems that fetchBlockSync method does not have a timeout limit but wait
forever ? Anybody can show me how to control the timeout here?


Re: broadcast hang out

2015-03-15 Thread Chester Chen
can you just replace Duration.Inf with a shorter duration  ? how about

  import scala.concurrent.duration._
  val timeout = new Timeout(10 seconds)
  Await.result(result.future, timeout.duration)

  or

  val timeout = new FiniteDuration(10, TimeUnit.SECONDS)
  Await.result(result.future, timeout)

  or simply
  import scala.concurrent.duration._
  Await.result(result.future, 10 seconds)



On Sun, Mar 15, 2015 at 8:08 PM, lonely Feb lonely8...@gmail.com wrote:

 Hi all, i meet up with a problem that torrent broadcast hang out in my
 spark cluster (1.2, standalone) , particularly serious when driver and
 executors are cross-region. when i read the code of broadcast i found that
 a sync block read here:

   def fetchBlockSync(host: String, port: Int, execId: String, blockId:
 String): ManagedBuffer = {
 // A monitor for the thread to wait on.
 val result = Promise[ManagedBuffer]()
 fetchBlocks(host, port, execId, Array(blockId),
   new BlockFetchingListener {
 override def onBlockFetchFailure(blockId: String, exception:
 Throwable): Unit = {
   result.failure(exception)
 }
 override def onBlockFetchSuccess(blockId: String, data:
 ManagedBuffer): Unit = {
   val ret = ByteBuffer.allocate(data.size.toInt)
   ret.put(data.nioByteBuffer())
   ret.flip()
   result.success(new NioManagedBuffer(ret))
 }
   })

 Await.result(result.future, Duration.Inf)
   }

 it seems that fetchBlockSync method does not have a timeout limit but wait
 forever ? Anybody can show me how to control the timeout here?



Re: broadcast hang out

2015-03-15 Thread lonely Feb
Anyone can help? Thanks a lot !

2015-03-16 11:45 GMT+08:00 lonely Feb lonely8...@gmail.com:

 yes

 2015-03-16 11:43 GMT+08:00 Mridul Muralidharan mri...@gmail.com:

 Cross region as in different data centers ?

 - Mridul

 On Sun, Mar 15, 2015 at 8:08 PM, lonely Feb lonely8...@gmail.com wrote:
  Hi all, i meet up with a problem that torrent broadcast hang out in my
  spark cluster (1.2, standalone) , particularly serious when driver and
  executors are cross-region. when i read the code of broadcast i found
 that
  a sync block read here:
 
def fetchBlockSync(host: String, port: Int, execId: String, blockId:
  String): ManagedBuffer = {
  // A monitor for the thread to wait on.
  val result = Promise[ManagedBuffer]()
  fetchBlocks(host, port, execId, Array(blockId),
new BlockFetchingListener {
  override def onBlockFetchFailure(blockId: String, exception:
  Throwable): Unit = {
result.failure(exception)
  }
  override def onBlockFetchSuccess(blockId: String, data:
  ManagedBuffer): Unit = {
val ret = ByteBuffer.allocate(data.size.toInt)
ret.put(data.nioByteBuffer())
ret.flip()
result.success(new NioManagedBuffer(ret))
  }
})
 
  Await.result(result.future, Duration.Inf)
}
 
  it seems that fetchBlockSync method does not have a timeout limit but
 wait
  forever ? Anybody can show me how to control the timeout here?