date:20140716

Reynold Xin created SPARK-2517:
--

 Summary: Remove as many compilation warning messages as possible
 Key: SPARK-2517
 URL: https://issues.apache.org/jira/browse/SPARK-2517
 Project: Spark
  Issue Type: Improvement
Reporter: Reynold Xin


Some examples:

{code}
[warn] 
/scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala:138:
 abstract type ExpectedAppender is unchecked since it is eliminated by erasure
[warn]   assert(appender.isInstanceOf[ExpectedAppender])
[warn]   ^
[warn] 
/scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala:143:
 abstract type ExpectedRollingPolicy is unchecked since it is eliminated by 
erasure
[warn] rollingPolicy.isInstanceOf[ExpectedRollingPolicy]
[warn]   ^
{code}

{code}
[warn] 
/scratch/rxin/spark/streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala:386:
 method connect in class IOManager is deprecated: use the new implementation in 
package akka.io instead
[warn]   override def preStart = IOManager(context.system).connect(new 
InetSocketAddress(port))
[warn] ^
[warn] 
/scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:207:
 non-variable type argument String in type pattern Map[String,Any] is unchecked 
since it is eliminated by erasure
[warn]   case (key: String, struct: Map[String, Any]) = {
[warn]  ^
[warn] 
/scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:238:
 non-variable type argument String in type pattern java.util.Map[String,Object] 
is unchecked since it is eliminated by erasure
[warn] case map: java.util.Map[String, Object] =
[warn] ^
[warn] 
/scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:243:
 non-variable type argument Object in type pattern java.util.List[Object] is 
unchecked since it is eliminated by erasure
[warn] case list: java.util.List[Object] =
[warn]  ^
[warn] 
/scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:323:
 non-variable type argument String in type pattern Map[String,Any] is unchecked 
since it is eliminated by erasure
[warn]   case value: Map[String, Any] = toJsonObjectString(value)
[warn]   ^
[info] Compiling 2 Scala sources to 
/scratch/rxin/spark/repl/target/scala-2.10/test-classes...
[warn] 
/scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:382:
 method mapWith in class RDD is deprecated: use mapPartitionsWithIndex
[warn] val randoms = ones.mapWith(
[warn]^
[warn] 
/scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:400:
 method flatMapWith in class RDD is deprecated: use mapPartitionsWithIndex and 
flatMap
[warn] val randoms = ones.flatMapWith(
[warn]^
[warn] 
/scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:421:
 method filterWith in class RDD is deprecated: use mapPartitionsWithIndex and 
filter
[warn] val sample = ints.filterWith(
[warn]   ^
[warn] 
/scratch/rxin/spark/core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala:76:
 method mapWith in class RDD is deprecated: use mapPartitionsWithIndex
[warn] x.mapWith(x = x.toString)((x,y)=x + uc.op(y))
[warn]   ^
[warn] 
/scratch/rxin/spark/core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala:82:
 method filterWith in class RDD is deprecated: use mapPartitionsWithIndex and 
filter
[warn] x.filterWith(x = x.toString)((x,y)=uc.pred(y))
[warn]   ^
[warn] 
/scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/VectorSuite.scala:29:
 class Vector in package util is deprecated: Use Vectors.dense from Spark's 
mllib.linalg package instead.
[warn]   def verifyVector(vector: Vector, expectedLength: Int) = {
[warn]^
[warn] one warning found
{code}

{code}
[warn] 
/scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:238:
 non-variable type argument String in type pattern java.util.Map[String,Object] 
is unchecked since it is eliminated by erasure
[warn] case map: java.util.Map[String, Object] =
[warn] ^
[warn] 
/scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:243:
 non-variable type argument Object in type pattern java.util.List[Object] is 
unchecked since it is eliminated by erasure
[warn] case list: java.util.List[Object] =
[warn]  ^
[warn] 
/scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:323:

[jira] [Updated] (SPARK-2517) Remove as many compilation warning messages as possible


 [ 
https://issues.apache.org/jira/browse/SPARK-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2517:
---

Assignee: Yin Huai

 Remove as many compilation warning messages as possible
 ---

 Key: SPARK-2517
 URL: https://issues.apache.org/jira/browse/SPARK-2517
 Project: Spark
  Issue Type: Improvement
Reporter: Reynold Xin
Assignee: Yin Huai

 Some examples:
 {code}
 [warn] 
 /scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala:138:
  abstract type ExpectedAppender is unchecked since it is eliminated by erasure
 [warn]   assert(appender.isInstanceOf[ExpectedAppender])
 [warn]   ^
 [warn] 
 /scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala:143:
  abstract type ExpectedRollingPolicy is unchecked since it is eliminated by 
 erasure
 [warn] rollingPolicy.isInstanceOf[ExpectedRollingPolicy]
 [warn]   ^
 {code}
 {code}
 [warn] 
 /scratch/rxin/spark/streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala:386:
  method connect in class IOManager is deprecated: use the new implementation 
 in package akka.io instead
 [warn]   override def preStart = IOManager(context.system).connect(new 
 InetSocketAddress(port))
 [warn] ^
 [warn] 
 /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:207:
  non-variable type argument String in type pattern Map[String,Any] is 
 unchecked since it is eliminated by erasure
 [warn]   case (key: String, struct: Map[String, Any]) = {
 [warn]  ^
 [warn] 
 /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:238:
  non-variable type argument String in type pattern 
 java.util.Map[String,Object] is unchecked since it is eliminated by erasure
 [warn] case map: java.util.Map[String, Object] =
 [warn] ^
 [warn] 
 /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:243:
  non-variable type argument Object in type pattern java.util.List[Object] is 
 unchecked since it is eliminated by erasure
 [warn] case list: java.util.List[Object] =
 [warn]  ^
 [warn] 
 /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:323:
  non-variable type argument String in type pattern Map[String,Any] is 
 unchecked since it is eliminated by erasure
 [warn]   case value: Map[String, Any] = toJsonObjectString(value)
 [warn]   ^
 [info] Compiling 2 Scala sources to 
 /scratch/rxin/spark/repl/target/scala-2.10/test-classes...
 [warn] 
 /scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:382:
  method mapWith in class RDD is deprecated: use mapPartitionsWithIndex
 [warn] val randoms = ones.mapWith(
 [warn]^
 [warn] 
 /scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:400:
  method flatMapWith in class RDD is deprecated: use mapPartitionsWithIndex 
 and flatMap
 [warn] val randoms = ones.flatMapWith(
 [warn]^
 [warn] 
 /scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:421:
  method filterWith in class RDD is deprecated: use mapPartitionsWithIndex and 
 filter
 [warn] val sample = ints.filterWith(
 [warn]   ^
 [warn] 
 /scratch/rxin/spark/core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala:76:
  method mapWith in class RDD is deprecated: use mapPartitionsWithIndex
 [warn] x.mapWith(x = x.toString)((x,y)=x + uc.op(y))
 [warn]   ^
 [warn] 
 /scratch/rxin/spark/core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala:82:
  method filterWith in class RDD is deprecated: use mapPartitionsWithIndex and 
 filter
 [warn] x.filterWith(x = x.toString)((x,y)=uc.pred(y))
 [warn]   ^
 [warn] 
 /scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/VectorSuite.scala:29:
  class Vector in package util is deprecated: Use Vectors.dense from Spark's 
 mllib.linalg package instead.
 [warn]   def verifyVector(vector: Vector, expectedLength: Int) = {
 [warn]^
 [warn] one warning found
 {code}
 {code}
 [warn] 
 /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:238:
  non-variable type argument String in type pattern 
 java.util.Map[String,Object] is unchecked since it is eliminated by erasure
 [warn] case map: java.util.Map[String, Object] =
 [warn] ^
 [warn] 
 /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:243:

[jira] [Created] (SPARK-2518) Fix foldability of Substring expression.

Takuya Ueshin created SPARK-2518:


 Summary: Fix foldability of Substring expression.
 Key: SPARK-2518
 URL: https://issues.apache.org/jira/browse/SPARK-2518
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin


This is a follow-up of [#1428|https://github.com/apache/spark/pull/1428].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2518) Fix foldability of Substring expression.


[ 
https://issues.apache.org/jira/browse/SPARK-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063212#comment-14063212
 ] 

Takuya Ueshin commented on SPARK-2518:
--

PRed: https://github.com/apache/spark/pull/1432

 Fix foldability of Substring expression.
 

 Key: SPARK-2518
 URL: https://issues.apache.org/jira/browse/SPARK-2518
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin

 This is a follow-up of [#1428|https://github.com/apache/spark/pull/1428].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2518) Fix foldability of Substring expression.


 [ 
https://issues.apache.org/jira/browse/SPARK-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2518:
---

Assignee: Takuya Ueshin

 Fix foldability of Substring expression.
 

 Key: SPARK-2518
 URL: https://issues.apache.org/jira/browse/SPARK-2518
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin

 This is a follow-up of [#1428|https://github.com/apache/spark/pull/1428].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2517) Remove as many compilation warning messages as possible


 [ 
https://issues.apache.org/jira/browse/SPARK-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2517:
---

Description: 
We should probably treat warnings as failures in Jenkins.

Some examples:

{code}
[warn] 
/scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala:138:
 abstract type ExpectedAppender is unchecked since it is eliminated by erasure
[warn]   assert(appender.isInstanceOf[ExpectedAppender])
[warn]   ^
[warn] 
/scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala:143:
 abstract type ExpectedRollingPolicy is unchecked since it is eliminated by 
erasure
[warn] rollingPolicy.isInstanceOf[ExpectedRollingPolicy]
[warn]   ^
{code}

{code}
[warn] 
/scratch/rxin/spark/streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala:386:
 method connect in class IOManager is deprecated: use the new implementation in 
package akka.io instead
[warn]   override def preStart = IOManager(context.system).connect(new 
InetSocketAddress(port))
[warn] ^
[warn] 
/scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:207:
 non-variable type argument String in type pattern Map[String,Any] is unchecked 
since it is eliminated by erasure
[warn]   case (key: String, struct: Map[String, Any]) = {
[warn]  ^
[warn] 
/scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:238:
 non-variable type argument String in type pattern java.util.Map[String,Object] 
is unchecked since it is eliminated by erasure
[warn] case map: java.util.Map[String, Object] =
[warn] ^
[warn] 
/scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:243:
 non-variable type argument Object in type pattern java.util.List[Object] is 
unchecked since it is eliminated by erasure
[warn] case list: java.util.List[Object] =
[warn]  ^
[warn] 
/scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:323:
 non-variable type argument String in type pattern Map[String,Any] is unchecked 
since it is eliminated by erasure
[warn]   case value: Map[String, Any] = toJsonObjectString(value)
[warn]   ^
[info] Compiling 2 Scala sources to 
/scratch/rxin/spark/repl/target/scala-2.10/test-classes...
[warn] 
/scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:382:
 method mapWith in class RDD is deprecated: use mapPartitionsWithIndex
[warn] val randoms = ones.mapWith(
[warn]^
[warn] 
/scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:400:
 method flatMapWith in class RDD is deprecated: use mapPartitionsWithIndex and 
flatMap
[warn] val randoms = ones.flatMapWith(
[warn]^
[warn] 
/scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:421:
 method filterWith in class RDD is deprecated: use mapPartitionsWithIndex and 
filter
[warn] val sample = ints.filterWith(
[warn]   ^
[warn] 
/scratch/rxin/spark/core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala:76:
 method mapWith in class RDD is deprecated: use mapPartitionsWithIndex
[warn] x.mapWith(x = x.toString)((x,y)=x + uc.op(y))
[warn]   ^
[warn] 
/scratch/rxin/spark/core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala:82:
 method filterWith in class RDD is deprecated: use mapPartitionsWithIndex and 
filter
[warn] x.filterWith(x = x.toString)((x,y)=uc.pred(y))
[warn]   ^
[warn] 
/scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/VectorSuite.scala:29:
 class Vector in package util is deprecated: Use Vectors.dense from Spark's 
mllib.linalg package instead.
[warn]   def verifyVector(vector: Vector, expectedLength: Int) = {
[warn]^
[warn] one warning found
{code}

{code}
[warn] 
/scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:238:
 non-variable type argument String in type pattern java.util.Map[String,Object] 
is unchecked since it is eliminated by erasure
[warn] case map: java.util.Map[String, Object] =
[warn] ^
[warn] 
/scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:243:
 non-variable type argument Object in type pattern java.util.List[Object] is 
unchecked since it is eliminated by erasure
[warn] case list: java.util.List[Object] =
[warn]  ^
[warn] 
/scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:323:
 non-variable type argument String in type pattern Map[String,Any] is unchecked 
since

[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support

2014-07-16 Thread Chris Fregly (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063247#comment-14063247
 ] 

Chris Fregly commented on SPARK-1981:
-

PR:  https://github.com/apache/spark/pull/1434

 Add AWS Kinesis streaming support
 -

 Key: SPARK-1981
 URL: https://issues.apache.org/jira/browse/SPARK-1981
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Chris Fregly
Assignee: Chris Fregly

 Add AWS Kinesis support to Spark Streaming.
 Initial discussion occured here:  https://github.com/apache/spark/pull/223
 I discussed this with Parviz from AWS recently and we agreed that I would 
 take this over.
 Look for a new PR that takes into account all the feedback from the earlier 
 PR including spark-1.0-compliant implementation, AWS-license-aware build 
 support, tests, comments, and style guide compliance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1761) Add broadcast information on SparkUI storage tab


[ 
https://issues.apache.org/jira/browse/SPARK-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063250#comment-14063250
 ] 

Reynold Xin commented on SPARK-1761:


This would be very useful actually. 

 Add broadcast information on SparkUI storage tab
 

 Key: SPARK-1761
 URL: https://issues.apache.org/jira/browse/SPARK-1761
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or
 Fix For: 1.1.0


 It would be nice to know where the broadcast blocks are persisted. More 
 details coming.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2274) spark SQL query hang up sometimes


 [ 
https://issues.apache.org/jira/browse/SPARK-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2274:
---

Component/s: SQL

 spark SQL query hang up sometimes
 -

 Key: SPARK-2274
 URL: https://issues.apache.org/jira/browse/SPARK-2274
 Project: Spark
  Issue Type: Question
  Components: SQL
 Environment: spark 1.0.0
Reporter: jackielihf

 when I run spark SQL query, it hang up sometimes:
 1) simple SQL query works, such as select * from a left out join b on 
 a.id=b.id
 2) BUT if it has more joins, such as select * from a left out join b on 
 a.id=b.id left out join c on a.id=c.id..., spark shell seems to hang up.
 spark shell prints:
 scala hc.hql(select 
 A.id,A.tit,A.sub_tit,B.abst,B.cont,A.aut,A.com_name,A.med_name,A.pub_dt,A.upd_time,A.ent_time,A.info_lvl,A.is_pic,A.lnk_addr,A.is_ann,A.info_open_lvl,A.keyw_name,C.typ_code
  as 
 type,D.evt_dir,D.evt_st,E.secu_id,E.typ_codeii,E.exch_code,E.trd_code,E.secu_sht
  from txt_nws_bas_update A left outer join txt_nws_bas_txt B on 
 A.id=B.orig_id left outer join txt_nws_typ C on A.id=C.orig_id left outer 
 join txt_nws_secu D on A.id=D.orig_id left outer join bas_secu_info E on 
 D.secu_id=E.secu_id where D.secu_id is not null limit 5).foreach(println)
 14/06/25 13:32:25 INFO ParseDriver: Parsing command: select 
 A.id,A.tit,A.sub_tit,B.abst,B.cont,A.aut,A.com_name,A.med_name,A.pub_dt,A.upd_time,A.ent_time,A.info_lvl,A.is_pic,A.lnk_addr,A.is_ann,A.info_open_lvl,A.keyw_name,C.typ_code
  as 
 type,D.evt_dir,D.evt_st,E.secu_id,E.typ_codeii,E.exch_code,E.trd_code,E.secu_sht
  from txt_nws_bas_update A left outer join txt_nws_bas_txt B on 
 A.id=B.orig_id left outer join txt_nws_typ C on A.id=C.orig_id left outer 
 join txt_nws_secu D on A.id=D.orig_id left outer join bas_secu_info E on 
 D.secu_id=E.secu_id where D.secu_id is not null limit 5
 14/06/25 13:32:25 INFO ParseDriver: Parse Completed
 14/06/25 13:32:25 INFO Analyzer: Max iterations (2) reached for batch 
 MultiInstanceRelations
 14/06/25 13:32:25 INFO Analyzer: Max iterations (2) reached for batch 
 CaseInsensitiveAttributeReferences
 14/06/25 13:32:27 INFO MemoryStore: ensureFreeSpace(220923) called with 
 curMem=0, maxMem=311387750
 14/06/25 13:32:27 INFO MemoryStore: Block broadcast_0 stored as values to 
 memory (estimated size 215.7 KB, free 296.8 MB)
 14/06/25 13:32:27 INFO MemoryStore: ensureFreeSpace(220971) called with 
 curMem=220923, maxMem=311387750
 14/06/25 13:32:27 INFO MemoryStore: Block broadcast_1 stored as values to 
 memory (estimated size 215.8 KB, free 296.5 MB)
 14/06/25 13:32:28 INFO MemoryStore: ensureFreeSpace(220971) called with 
 curMem=441894, maxMem=311387750
 14/06/25 13:32:28 INFO MemoryStore: Block broadcast_2 stored as values to 
 memory (estimated size 215.8 KB, free 296.3 MB)
 14/06/25 13:32:28 INFO MemoryStore: ensureFreeSpace(220971) called with 
 curMem=662865, maxMem=311387750
 14/06/25 13:32:28 INFO MemoryStore: Block broadcast_3 stored as values to 
 memory (estimated size 215.8 KB, free 296.1 MB)
 14/06/25 13:32:28 INFO MemoryStore: ensureFreeSpace(220971) called with 
 curMem=883836, maxMem=311387750
 14/06/25 13:32:28 INFO MemoryStore: Block broadcast_4 stored as values to 
 memory (estimated size 215.8 KB, free 295.9 MB)
 14/06/25 13:32:29 INFO SQLContext$$anon$1: Max iterations (2) reached for 
 batch Add exchange
 14/06/25 13:32:29 INFO SQLContext$$anon$1: Max iterations (2) reached for 
 batch Prepare Expressions
 14/06/25 13:32:30 INFO FileInputFormat: Total input paths to process : 1
 14/06/25 13:32:30 INFO SparkContext: Starting job: collect at joins.scala:184
 14/06/25 13:32:30 INFO DAGScheduler: Got job 0 (collect at joins.scala:184) 
 with 2 output partitions (allowLocal=false)
 14/06/25 13:32:30 INFO DAGScheduler: Final stage: Stage 0(collect at 
 joins.scala:184)
 14/06/25 13:32:30 INFO DAGScheduler: Parents of final stage: List()
 14/06/25 13:32:30 INFO DAGScheduler: Missing parents: List()
 14/06/25 13:32:30 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[7] at map 
 at joins.scala:184), which has no missing parents
 14/06/25 13:32:30 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 
 (MappedRDD[7] at map at joins.scala:184)
 14/06/25 13:32:30 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
 14/06/25 13:32:30 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on 
 executor 0: 192.168.56.100 (PROCESS_LOCAL)
 14/06/25 13:32:30 INFO TaskSetManager: Serialized task 0.0:0 as 4088 bytes in 
 3 ms
 14/06/25 13:32:30 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on 
 executor 1: 192.168.56.100 (PROCESS_LOCAL)
 14/06/25 13:32:30 INFO TaskSetManager: Serialized task 0.0:1 as 4088 bytes in 
 2 ms
 14/06/25 13:32:44 INFO BlockManagerInfo: Added taskresult_1 in memory on 
 192.168.56.100:47102 (size: 15.0 MB, free: 281.9 MB)
 14/06/25

[jira] [Updated] (SPARK-2274) spark SQL query hang up sometimes


 [ 
https://issues.apache.org/jira/browse/SPARK-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2274:
---

Assignee: Michael Armbrust

 spark SQL query hang up sometimes
 -

 Key: SPARK-2274
 URL: https://issues.apache.org/jira/browse/SPARK-2274
 Project: Spark
  Issue Type: Question
  Components: SQL
 Environment: spark 1.0.0
Reporter: jackielihf
Assignee: Michael Armbrust

 when I run spark SQL query, it hang up sometimes:
 1) simple SQL query works, such as select * from a left out join b on 
 a.id=b.id
 2) BUT if it has more joins, such as select * from a left out join b on 
 a.id=b.id left out join c on a.id=c.id..., spark shell seems to hang up.
 spark shell prints:
 scala hc.hql(select 
 A.id,A.tit,A.sub_tit,B.abst,B.cont,A.aut,A.com_name,A.med_name,A.pub_dt,A.upd_time,A.ent_time,A.info_lvl,A.is_pic,A.lnk_addr,A.is_ann,A.info_open_lvl,A.keyw_name,C.typ_code
  as 
 type,D.evt_dir,D.evt_st,E.secu_id,E.typ_codeii,E.exch_code,E.trd_code,E.secu_sht
  from txt_nws_bas_update A left outer join txt_nws_bas_txt B on 
 A.id=B.orig_id left outer join txt_nws_typ C on A.id=C.orig_id left outer 
 join txt_nws_secu D on A.id=D.orig_id left outer join bas_secu_info E on 
 D.secu_id=E.secu_id where D.secu_id is not null limit 5).foreach(println)
 14/06/25 13:32:25 INFO ParseDriver: Parsing command: select 
 A.id,A.tit,A.sub_tit,B.abst,B.cont,A.aut,A.com_name,A.med_name,A.pub_dt,A.upd_time,A.ent_time,A.info_lvl,A.is_pic,A.lnk_addr,A.is_ann,A.info_open_lvl,A.keyw_name,C.typ_code
  as 
 type,D.evt_dir,D.evt_st,E.secu_id,E.typ_codeii,E.exch_code,E.trd_code,E.secu_sht
  from txt_nws_bas_update A left outer join txt_nws_bas_txt B on 
 A.id=B.orig_id left outer join txt_nws_typ C on A.id=C.orig_id left outer 
 join txt_nws_secu D on A.id=D.orig_id left outer join bas_secu_info E on 
 D.secu_id=E.secu_id where D.secu_id is not null limit 5
 14/06/25 13:32:25 INFO ParseDriver: Parse Completed
 14/06/25 13:32:25 INFO Analyzer: Max iterations (2) reached for batch 
 MultiInstanceRelations
 14/06/25 13:32:25 INFO Analyzer: Max iterations (2) reached for batch 
 CaseInsensitiveAttributeReferences
 14/06/25 13:32:27 INFO MemoryStore: ensureFreeSpace(220923) called with 
 curMem=0, maxMem=311387750
 14/06/25 13:32:27 INFO MemoryStore: Block broadcast_0 stored as values to 
 memory (estimated size 215.7 KB, free 296.8 MB)
 14/06/25 13:32:27 INFO MemoryStore: ensureFreeSpace(220971) called with 
 curMem=220923, maxMem=311387750
 14/06/25 13:32:27 INFO MemoryStore: Block broadcast_1 stored as values to 
 memory (estimated size 215.8 KB, free 296.5 MB)
 14/06/25 13:32:28 INFO MemoryStore: ensureFreeSpace(220971) called with 
 curMem=441894, maxMem=311387750
 14/06/25 13:32:28 INFO MemoryStore: Block broadcast_2 stored as values to 
 memory (estimated size 215.8 KB, free 296.3 MB)
 14/06/25 13:32:28 INFO MemoryStore: ensureFreeSpace(220971) called with 
 curMem=662865, maxMem=311387750
 14/06/25 13:32:28 INFO MemoryStore: Block broadcast_3 stored as values to 
 memory (estimated size 215.8 KB, free 296.1 MB)
 14/06/25 13:32:28 INFO MemoryStore: ensureFreeSpace(220971) called with 
 curMem=883836, maxMem=311387750
 14/06/25 13:32:28 INFO MemoryStore: Block broadcast_4 stored as values to 
 memory (estimated size 215.8 KB, free 295.9 MB)
 14/06/25 13:32:29 INFO SQLContext$$anon$1: Max iterations (2) reached for 
 batch Add exchange
 14/06/25 13:32:29 INFO SQLContext$$anon$1: Max iterations (2) reached for 
 batch Prepare Expressions
 14/06/25 13:32:30 INFO FileInputFormat: Total input paths to process : 1
 14/06/25 13:32:30 INFO SparkContext: Starting job: collect at joins.scala:184
 14/06/25 13:32:30 INFO DAGScheduler: Got job 0 (collect at joins.scala:184) 
 with 2 output partitions (allowLocal=false)
 14/06/25 13:32:30 INFO DAGScheduler: Final stage: Stage 0(collect at 
 joins.scala:184)
 14/06/25 13:32:30 INFO DAGScheduler: Parents of final stage: List()
 14/06/25 13:32:30 INFO DAGScheduler: Missing parents: List()
 14/06/25 13:32:30 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[7] at map 
 at joins.scala:184), which has no missing parents
 14/06/25 13:32:30 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 
 (MappedRDD[7] at map at joins.scala:184)
 14/06/25 13:32:30 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
 14/06/25 13:32:30 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on 
 executor 0: 192.168.56.100 (PROCESS_LOCAL)
 14/06/25 13:32:30 INFO TaskSetManager: Serialized task 0.0:0 as 4088 bytes in 
 3 ms
 14/06/25 13:32:30 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on 
 executor 1: 192.168.56.100 (PROCESS_LOCAL)
 14/06/25 13:32:30 INFO TaskSetManager: Serialized task 0.0:1 as 4088 bytes in 
 2 ms
 14/06/25 13:32:44 INFO BlockManagerInfo: Added taskresult_1 in memory on

[jira] [Commented] (SPARK-2519) Eliminate pattern-matching on Tuple2 in performance-critical aggregation code

2014-07-16 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063251#comment-14063251
 ] 

Sandy Ryza commented on SPARK-2519:
---

https://github.com/apache/spark/pull/1435

 Eliminate pattern-matching on Tuple2 in performance-critical aggregation code
 -

 Key: SPARK-2519
 URL: https://issues.apache.org/jira/browse/SPARK-2519
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Reporter: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.


 [ 
https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2433:
-

Target Version/s: 0.9.2

 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an 
 implementation bug.
 

 Key: SPARK-2433
 URL: https://issues.apache.org/jira/browse/SPARK-2433
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 0.9.1
 Environment: Any 
Reporter: Rahul K Bhojwani
  Labels: easyfix, test
   Original Estimate: 1h
  Remaining Estimate: 1h

 Don't have much experience with reporting errors. This is first time. If 
 something is not clear please feel free to contact me (details given below)
 In the pyspark mllib library. 
 Path : \spark-0.9.1\python\pyspark\mllib\classification.py
 Class: NaiveBayesModel
 Method:  self.predict
 Earlier Implementation:
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 
 New Implementation:
 No:1
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 No:2
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + dot(x,self.theta.T))
 Explanation:
 No:1 is correct according to me. Don't know about No:2.
 Error one:
 The matrix self.theta is of dimension [n_classes , n_features]. 
 while the matrix x is of dimension [1 , n_features].
 Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
 It will always give error:  ValueError: matrices are not aligned
 In the commented example given in the classification.py, n_classes = 
 n_features = 2. That's why no error.
 Both Implementation no.1 and Implementation no. 2 takes care of it.
 Error 2:
 As basic implementation of naive bayes is: P(class_n | sample) = 
 count_feature_1 * P(feature_1 | class_n ) * count_feature_n * 
 P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)
 and taking the class with max value.
 That's what implementation 1 is doing.
 In Implementation 2: 
 Its basically class with max value :
 ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * 
 P(feature_n|class_n) * P(class_n))
 Don't know if it gives the exact result.
 Thanks
 Rahul Bhojwani
 rahulbhojwani2...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2521) Broadcast RDD object once per TaskSet (instead of sending it for every task)


[ 
https://issues.apache.org/jira/browse/SPARK-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063256#comment-14063256
 ] 

Reynold Xin commented on SPARK-2521:


cc [~matei]

 Broadcast RDD object once per TaskSet (instead of sending it for every task)
 

 Key: SPARK-2521
 URL: https://issues.apache.org/jira/browse/SPARK-2521
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 This can substantially reduce task size, as well as being able to support 
 very large closures (e.g. closures that reference large variables).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2521) Broadcast RDD object once per TaskSet (instead of sending it for every task)


 [ 
https://issues.apache.org/jira/browse/SPARK-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2521:
---

Component/s: Spark Core

 Broadcast RDD object once per TaskSet (instead of sending it for every task)
 

 Key: SPARK-2521
 URL: https://issues.apache.org/jira/browse/SPARK-2521
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 This can substantially reduce task size, as well as being able to support 
 very large closures (e.g. closures that reference large variables).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2521) Broadcast RDD object once per TaskSet (instead of sending it for every task)


 [ 
https://issues.apache.org/jira/browse/SPARK-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2521:
---

Description: 
This can substantially reduce task size, as well as being able to support very 
large closures (e.g. closures that reference large variables).

Once this is in, we can also remove broadcasting the Hadoop JobConf.

  was:
This can substantially reduce task size, as well as being able to support very 
large closures (e.g. closures that reference large variables).



 Broadcast RDD object once per TaskSet (instead of sending it for every task)
 

 Key: SPARK-2521
 URL: https://issues.apache.org/jira/browse/SPARK-2521
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 This can substantially reduce task size, as well as being able to support 
 very large closures (e.g. closures that reference large variables).
 Once this is in, we can also remove broadcasting the Hadoop JobConf.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.


[ 
https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063268#comment-14063268
 ] 

Xiangrui Meng commented on SPARK-2433:
--

[~rahul1993] Thanks for reporting this bug! It was fixed in branch-1.0 but not 
in branch-0.9. So you can send a PR similar to 
https://github.com/apache/spark/pull/463 to branch-0.9, following 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark . I'll 
try to cut a release candidate for v0.9.2 tomorrow. So if you don't have time 
to send the PR, I may have to do that myself.

 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an 
 implementation bug.
 

 Key: SPARK-2433
 URL: https://issues.apache.org/jira/browse/SPARK-2433
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 0.9.1
 Environment: Any 
Reporter: Rahul K Bhojwani
  Labels: easyfix, test
   Original Estimate: 1h
  Remaining Estimate: 1h

 Don't have much experience with reporting errors. This is first time. If 
 something is not clear please feel free to contact me (details given below)
 In the pyspark mllib library. 
 Path : \spark-0.9.1\python\pyspark\mllib\classification.py
 Class: NaiveBayesModel
 Method:  self.predict
 Earlier Implementation:
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 
 New Implementation:
 No:1
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 No:2
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + dot(x,self.theta.T))
 Explanation:
 No:1 is correct according to me. Don't know about No:2.
 Error one:
 The matrix self.theta is of dimension [n_classes , n_features]. 
 while the matrix x is of dimension [1 , n_features].
 Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
 It will always give error:  ValueError: matrices are not aligned
 In the commented example given in the classification.py, n_classes = 
 n_features = 2. That's why no error.
 Both Implementation no.1 and Implementation no. 2 takes care of it.
 Error 2:
 As basic implementation of naive bayes is: P(class_n | sample) = 
 count_feature_1 * P(feature_1 | class_n ) * count_feature_n * 
 P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)
 and taking the class with max value.
 That's what implementation 1 is doing.
 In Implementation 2: 
 Its basically class with max value :
 ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * 
 P(feature_n|class_n) * P(class_n))
 Don't know if it gives the exact result.
 Thanks
 Rahul Bhojwani
 rahulbhojwani2...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2522) Use TorrentBroadcastFactory as the default broadcast factory

Xiangrui Meng created SPARK-2522:


 Summary: Use TorrentBroadcastFactory as the default broadcast 
factory
 Key: SPARK-2522
 URL: https://issues.apache.org/jira/browse/SPARK-2522
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


HttpBroadcastFactory is the current default broadcast factory. It sends the 
broadcast data to each worker one by one, which is slow when the cluster is 
big. TorrentBroadcastFactory scales much better than http. Maybe we should make 
torrent the default broadcast method.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2523) Potential Bugs if SerDe is not the identical among partitions and table

Cheng Hao created SPARK-2523:


 Summary: Potential Bugs if SerDe is not the identical among 
partitions and table
 Key: SPARK-2523
 URL: https://issues.apache.org/jira/browse/SPARK-2523
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao


In HiveTableScan.scala, ObjectInspector was created for all of the partition 
based records, which probably causes ClassCastException if the object inspector 
is not identical among table  partitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2523) Potential Bugs if SerDe is not the identical among partitions and table


[ 
https://issues.apache.org/jira/browse/SPARK-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063288#comment-14063288
 ] 

Cheng Hao commented on SPARK-2523:
--

This is the follow up for 
https://github.com/apache/spark/pull/1408
https://github.com/apache/spark/pull/1390

 Potential Bugs if SerDe is not the identical among partitions and table
 ---

 Key: SPARK-2523
 URL: https://issues.apache.org/jira/browse/SPARK-2523
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao

 In HiveTableScan.scala, ObjectInspector was created for all of the partition 
 based records, which probably causes ClassCastException if the object 
 inspector is not identical among table  partitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2523) Potential Bugs if SerDe is not the identical among partitions and table


[ 
https://issues.apache.org/jira/browse/SPARK-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063288#comment-14063288
 ] 

Cheng Hao edited comment on SPARK-2523 at 7/16/14 8:42 AM:
---

This is the follow up for 
https://github.com/apache/spark/pull/1408
https://github.com/apache/spark/pull/1390

The PR is https://github.com/apache/spark/pull/1439


was (Author: chenghao):
This is the follow up for 
https://github.com/apache/spark/pull/1408
https://github.com/apache/spark/pull/1390

 Potential Bugs if SerDe is not the identical among partitions and table
 ---

 Key: SPARK-2523
 URL: https://issues.apache.org/jira/browse/SPARK-2523
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao

 In HiveTableScan.scala, ObjectInspector was created for all of the partition 
 based records, which probably causes ClassCastException if the object 
 inspector is not identical among table  partitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2523) Potential Bugs if SerDe is not the identical among partitions and table


[ 
https://issues.apache.org/jira/browse/SPARK-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063289#comment-14063289
 ] 

Cheng Hao commented on SPARK-2523:
--

[~yhuai] Can you review the code for me?

 Potential Bugs if SerDe is not the identical among partitions and table
 ---

 Key: SPARK-2523
 URL: https://issues.apache.org/jira/browse/SPARK-2523
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao

 In HiveTableScan.scala, ObjectInspector was created for all of the partition 
 based records, which probably causes ClassCastException if the object 
 inspector is not identical among table  partitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2520) the executor is thrown java.io.StreamCorruptedException

2014-07-16 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2520:
---

Description: 
This issue occurs with a very small probability. I can not reproduce it.
The executor log: 
{code}
14/07/15 21:54:50 INFO spark.MapOutputTrackerMasterActor: Asked to send map 
output locations for shuffle 0 to spark@sanshan:34429
14/07/15 21:54:50 INFO spark.MapOutputTrackerMasterActor: Asked to send map 
output locations for shuffle 0 to spark@sanshan:31934
14/07/15 21:54:50 INFO spark.MapOutputTrackerMasterActor: Asked to send map 
output locations for shuffle 0 to spark@sanshan:30557
14/07/15 21:54:50 INFO spark.MapOutputTrackerMasterActor: Asked to send map 
output locations for shuffle 0 to spark@sanshan:42606
14/07/15 21:54:50 INFO spark.MapOutputTrackerMasterActor: Asked to send map 
output locations for shuffle 0 to spark@sanshan:37314
14/07/15 21:54:50 INFO scheduler.TaskSetManager: Starting task 0.0:166 as TID 
4948 on executor 20: tuan221 (PROCESS_LOCAL)
14/07/15 21:54:50 INFO scheduler.TaskSetManager: Serialized task 0.0:166 as 
3129 bytes in 1 ms
14/07/15 21:54:50 WARN scheduler.TaskSetManager: Lost TID 4868 (task 0.0:86)
14/07/15 21:54:50 WARN scheduler.TaskSetManager: Loss was due to 
java.io.StreamCorruptedException
java.io.StreamCorruptedException: invalid type code: AC
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1377)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at 
org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:87)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$3.apply(PairRDDFunctions.scala:101)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$3.apply(PairRDDFunctions.scala:100)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
14/07/15 21:54:50 INFO scheduler.TaskSetManager: Starting task 0.0:86 as TID 
4949 on executor 20: tuan221 (PROCESS_LOCAL)
14/07/15 21:54:50 INFO scheduler.TaskSetManager: Serialized task 0.0:86 as 3129 
bytes in 0 ms
14/07/15 21:54:50 WARN scheduler.TaskSetManager: Lost TID 4785 (task 0.0:3)
{code}

  was:
This issue occurs with a very small probability. I can not reproduce it.
The executor log: 
{code}
java.io.StreamCorruptedException: invalid type code: AC
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1377)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at 
org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:87)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$3.apply(PairRDDFunctions.scala:101)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$3.apply(PairRDDFunctions.scala:100)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
at

[jira] [Commented] (SPARK-2465) Use long as user / item ID for ALS

[
https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063344#comment-14063344
]

Sean Owen commented on SPARK-2465:
--

Yeah that's a good separate point. My hunch is that serialization would remove
much of the difference.
I think this may need to be closed for now as there's not quite support; I'll
comment separately on the PR.

Use long as user / item ID for ALS
--

Key: SPARK-2465
URL: https://issues.apache.org/jira/browse/SPARK-2465
Project: Spark
Issue Type: Improvement
Components: MLlib
Affects Versions: 1.0.1
Reporter: Sean Owen
Priority: Minor
Attachments: ALS using MEMORY_AND_DISK.png, ALS using
MEMORY_AND_DISK_SER.png, Screen Shot 2014-07-13 at 8.49.40 PM.png

I'd like to float this for consideration: use longs instead of ints for user
and product IDs in the ALS implementation.
The main reason for is that identifiers are not generally numeric at all, and
will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits
means collisions are likely after hundreds of thousands of users and items,
which is not unrealistic. Hashing to 64 bits pushes this back to billions.
It would also mean numeric IDs that happen to be larger than the largest int
can be used directly as identifiers.
On the downside of course: 8 bytes instead of 4 bytes of memory used per
Rating.
Thoughts? I will post a PR so as to show what the change would be.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop

2014-07-16 Thread Kostiantyn Kudriavtsev (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063350#comment-14063350
 ] 

Kostiantyn Kudriavtsev commented on SPARK-2356:
---

and the use case when I got this exception - I didn't touch hadoop at all
My code works only with local files, not HDFS! It was very strange to stuck in 
this kind of issue.
I believe, it must be marked as critical and fixed asap!

 Exception: Could not locate executable null\bin\winutils.exe in the Hadoop 
 ---

 Key: SPARK-2356
 URL: https://issues.apache.org/jira/browse/SPARK-2356
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Kostiantyn Kudriavtsev

 I'm trying to run some transformation on Spark, it works fine on cluster 
 (YARN, linux machines). However, when I'm trying to run it on local machine 
 (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file 
 from local filesystem):
 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
 hadoop binary path
 java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
 Hadoop binaries.
   at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
   at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
   at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
   at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
   at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
   at org.apache.hadoop.security.Groups.init(Groups.java:77)
   at 
 org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
   at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
   at org.apache.spark.SparkContext.init(SparkContext.scala:228)
   at org.apache.spark.SparkContext.init(SparkContext.scala:97)
 It's happend because Hadoop config is initialised each time when spark 
 context is created regardless is hadoop required or not.
 I propose to add some special flag to indicate if hadoop config is required 
 (or start this configuration manually)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop

2014-07-16 Thread Kostiantyn Kudriavtsev (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kostiantyn Kudriavtsev updated SPARK-2356:
--

Priority: Critical  (was: Major)

 Exception: Could not locate executable null\bin\winutils.exe in the Hadoop 
 ---

 Key: SPARK-2356
 URL: https://issues.apache.org/jira/browse/SPARK-2356
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Kostiantyn Kudriavtsev
Priority: Critical

 I'm trying to run some transformation on Spark, it works fine on cluster 
 (YARN, linux machines). However, when I'm trying to run it on local machine 
 (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file 
 from local filesystem):
 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
 hadoop binary path
 java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
 Hadoop binaries.
   at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
   at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
   at org.apache.hadoop.util.Shell.clinit(Shell.java:326)
   at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76)
   at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
   at org.apache.hadoop.security.Groups.init(Groups.java:77)
   at 
 org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
   at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
   at 
 org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
   at org.apache.spark.SparkContext.init(SparkContext.scala:228)
   at org.apache.spark.SparkContext.init(SparkContext.scala:97)
 It's happend because Hadoop config is initialised each time when spark 
 context is created regardless is hadoop required or not.
 I propose to add some special flag to indicate if hadoop config is required 
 (or start this configuration manually)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts

[
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063380#comment-14063380
]

Sean Owen commented on SPARK-2420:
--

I think Jetty is the only actual issue here. Guava actually won't be an issue
if it's set to 11 in Spark, or downstream. (Spark does not use anything in
Guava 12+, or else, it wouldn't work in some Hadoop contexts.)

Servlet 3.0 really truly does work for all of this. The trick is actually
removing all the other copies of Servlet 2.5!

However, Jetty versions could be a real stumbling block. I'd like to focus on
that. There are basically two namespaces that Jetty uses, from its older
incarnation and newer versions. You have to harmonize both. What does Hive need
vs what's in Spark?

The reason I have some hope Xuefu is that obviously Spark already works in an
assembly with Hive classes. However, it may not be quite the same version, and,
I know in one case hive-exec had to be shaded because it is not available as a
non-assembly jar. There are some devils in the details.

Having seen a lot of this first-hand here, I can try to help. Can this be
elaborated with specific problems?

Change Spark build to minimize library conflicts

Key: SPARK-2420
URL: https://issues.apache.org/jira/browse/SPARK-2420
Project: Spark
Issue Type: Wish
Components: Build
Affects Versions: 1.0.0
Reporter: Xuefu Zhang
Attachments: spark_1.0.0.patch

During the prototyping of HIVE-7292, many library conflicts showed up because
Spark build contains versions of libraries that's vastly different from
current major Hadoop version. It would be nice if we can choose versions
that's in line with Hadoop or shading them in the assembly. Here are the wish
list:
1. Upgrade protobuf version to 2.5.0 from current 2.4.1
2. Shading Spark's jetty and servlet dependency in the assembly.
3. guava version difference. Spark is using a higher version. I'm not sure
what's the best solution for this.
The list may grow as HIVE-7292 proceeds.
For information only, the attached is a patch that we applied on Spark in
order to make Spark work with Hive. It gives an idea of the scope of changes.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2454) Separate driver spark home from executor spark home

2014-07-16 Thread Nan Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063402#comment-14063402
 ] 

Nan Zhu commented on SPARK-2454:


this will make sparkHome as an application-specific parameter explicitly, I 
just thought it will confuse the user since sparkHome is actually a global 
setup for all application/executors run on the same machine


The good thing here is it can support the user to run the application in 
different version of spark sharing the same cluster.(especially when you 
are doing spark dev work) 

 Separate driver spark home from executor spark home
 ---

 Key: SPARK-2454
 URL: https://issues.apache.org/jira/browse/SPARK-2454
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
 Fix For: 1.1.0


 The driver may not always share the same directory structure as the 
 executors. It makes little sense to always re-use the driver's spark home on 
 the executors.
 https://github.com/apache/spark/pull/1244/ is an open effort to fix this. 
 However, this still requires us to set SPARK_HOME on all the executor nodes. 
 Really we should separate this out into something like
 spark.driver.home
 spark.executor.home
 rather than re-using SPARK_HOME everywhere.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

[
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063419#comment-14063419
]

Sean Owen commented on SPARK-2341:
--

OK is it worth a pull request for changing the boolean multiclass argument to a
string? I wanted to ask if that was your intent before I do that.

libsvm format support is certainly important. It happens to have to encode
non-numeric input as numbers. It need not be that way throughout MLlib, since
it isn't that way in other input formats. (In this API method, it's pretty
minor, since libsvm does by definition use this encoding.) So yes that would be
great if data sets or API objects didn't assume that categorical data was
numeric, but encoded type in the data set or even in the object model itself. I
think it's mostly a design and type-safety argument -- same reason we have
String instead of just byte[] everywhere.

Sure I will have to build this conversion at some point anyway and can share
the result then.

loadLibSVMFile doesn't handle regression datasets
-

Key: SPARK-2341
URL: https://issues.apache.org/jira/browse/SPARK-2341
Project: Spark
Issue Type: Bug
Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
Labels: easyfix

Many datasets exist in LibSVM format for regression tasks [1] but currently
the loadLibSVMFile primitive doesn't handle regression datasets.
More precisely, the LabelParser is either a MulticlassLabelParser or a
BinaryLabelParser. What happens then is that the file is loaded but in
multiclass mode : each target value is interpreted as a class name !
The fix would be to write a RegressionLabelParser which converts target
values to Double and plug it into the loadLibSVMFile routine.
[1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2190) Specialized ColumnType for Timestamp

2014-07-16 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063431#comment-14063431
 ] 

Cheng Lian commented on SPARK-2190:
---

PR https://github.com/apache/spark/pull/1440

 Specialized ColumnType for Timestamp
 

 Key: SPARK-2190
 URL: https://issues.apache.org/jira/browse/SPARK-2190
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Critical

 I'm going to call this a bug since currently its like 300X slower than it 
 needs to  be.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-953) Latent Dirichlet Association (LDA model)

2014-07-16 Thread Masaki Rikitoku (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063447#comment-14063447
 ] 

Masaki Rikitoku commented on SPARK-953:
---

parallel gibbs sampling for lda (plda) may be usable.

 Latent Dirichlet Association (LDA model)
 

 Key: SPARK-953
 URL: https://issues.apache.org/jira/browse/SPARK-953
 Project: Spark
  Issue Type: Story
  Components: Examples
Affects Versions: 0.7.3
Reporter: caizhua
Priority: Critical

 This code is for learning the LDA model. However, if our input is 2.5 M 
 documents per machine, a dictionary with 1 words, running in EC2 
 m2.4xlarge instance with 68 G memory each machine. The time is really really 
 slow. For five iterations, the time cost is 8145, 24725, 51688, 58674, 56850 
 seconds. The time for shuffling is quite slow. The LDA.tbl is the simulated 
 data set for the program, and it is quite fast.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2313) PySpark should accept port via a command line argument rather than STDIN

2014-07-16 Thread Matthew Farrellee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063454#comment-14063454
 ] 

Matthew Farrellee commented on SPARK-2313:
--

as this stands, having another communication mechanism for py4j that can be 
controlled by the parent is the proper solution. using something like a domain 
socket may also assist in the return path from py4j (tmp file).

fyi, a recent change pushed all existing output to stderr in the 
spark-class/spark-submit path

i'm not actively working on this

 PySpark should accept port via a command line argument rather than STDIN
 

 Key: SPARK-2313
 URL: https://issues.apache.org/jira/browse/SPARK-2313
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Patrick Wendell

 Relying on stdin is a brittle mechanism and has broken several times in the 
 past. From what I can tell this is used only to bootstrap worker.py one time. 
 It would be strictly simpler to just pass it is a command line.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2443) Reading from Partitioned Tables is Slow

2014-07-16 Thread Teng Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063455#comment-14063455
 ] 

Teng Qiu commented on SPARK-2443:
-

Hi, how can you access parquet table using HiveContext / hql in spark? i tried 
to add parquet-hive jar or parquet-hive-bundle jar in driver-class-path and 
spark.executor.extraClassPath, but it doesn't work. CDH5.0.2, hive 0.12.0-cdh5

Thanks

 Reading from Partitioned Tables is Slow
 ---

 Key: SPARK-2443
 URL: https://issues.apache.org/jira/browse/SPARK-2443
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Zongheng Yang
 Fix For: 1.1.0, 1.0.2


 Here are some numbers, all queries return ~20million:
 {code}
 SELECT COUNT(*) FROM non partitioned table
 5.496467726 s
 SELECT COUNT(*) FROM partitioned table stored in parquet
 50.26947 s
 SELECT COUNT(*) FROM same table as previous but loaded with parquetFile 
 instead of through hive
 2s
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2111) pyspark errors when SPARK_PRINT_LAUNCH_COMMAND=1

2014-07-16 Thread Matthew Farrellee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063457#comment-14063457
 ] 

Matthew Farrellee commented on SPARK-2111:
--

this was resolved by https://github.com/apache/spark/pull/1050

 pyspark errors when SPARK_PRINT_LAUNCH_COMMAND=1
 

 Key: SPARK-2111
 URL: https://issues.apache.org/jira/browse/SPARK-2111
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Thomas Graves

 If you set SPARK_PRINT_LAUNCH_COMMAND=1 to see what java command is being 
 used to launch spark and then try to run pyspark it errors out with a very 
 non-useful error message:
 Traceback (most recent call last):
   File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/shell.py, 
 line 43, in module
 sc = SparkContext(appName=PySparkShell, pyFiles=add_files)
   File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/context.py, 
 line 94, in __init__
 SparkContext._ensure_initialized(self, gateway=gateway)
   File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/context.py, 
 line 184, in _ensure_initialized
 SparkContext._gateway = gateway or launch_gateway()
   File 
 /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/java_gateway.py, 
 line 51, in launch_gateway
 gateway_port = int(proc.stdout.readline())
 ValueError: invalid literal for int() with base 10: 'Spark Command: 
 /home/gs/java/jdk/bin/java -cp 
 :/home/gs/hadoop/current/share/hadoop/common/hadoop-gpl-compression.jar:/home/gs/hadoop/current/share/hadoop/hdfs/lib/YahooDNSToSwitchMapping-0.2.14020207'



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2111) pyspark errors when SPARK_PRINT_LAUNCH_COMMAND=1

2014-07-16 Thread Matthew Farrellee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063462#comment-14063462
 ] 

Matthew Farrellee commented on SPARK-2111:
--

[~pwendell] please close this issue as resolved

 pyspark errors when SPARK_PRINT_LAUNCH_COMMAND=1
 

 Key: SPARK-2111
 URL: https://issues.apache.org/jira/browse/SPARK-2111
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Thomas Graves

 If you set SPARK_PRINT_LAUNCH_COMMAND=1 to see what java command is being 
 used to launch spark and then try to run pyspark it errors out with a very 
 non-useful error message:
 Traceback (most recent call last):
   File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/shell.py, 
 line 43, in module
 sc = SparkContext(appName=PySparkShell, pyFiles=add_files)
   File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/context.py, 
 line 94, in __init__
 SparkContext._ensure_initialized(self, gateway=gateway)
   File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/context.py, 
 line 184, in _ensure_initialized
 SparkContext._gateway = gateway or launch_gateway()
   File 
 /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/java_gateway.py, 
 line 51, in launch_gateway
 gateway_port = int(proc.stdout.readline())
 ValueError: invalid literal for int() with base 10: 'Spark Command: 
 /home/gs/java/jdk/bin/java -cp 
 :/home/gs/hadoop/current/share/hadoop/common/hadoop-gpl-compression.jar:/home/gs/hadoop/current/share/hadoop/hdfs/lib/YahooDNSToSwitchMapping-0.2.14020207'



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2111) pyspark errors when SPARK_PRINT_LAUNCH_COMMAND=1

2014-07-16 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-2111.
--

   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1

 pyspark errors when SPARK_PRINT_LAUNCH_COMMAND=1
 

 Key: SPARK-2111
 URL: https://issues.apache.org/jira/browse/SPARK-2111
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Thomas Graves
 Fix For: 1.0.1, 1.1.0


 If you set SPARK_PRINT_LAUNCH_COMMAND=1 to see what java command is being 
 used to launch spark and then try to run pyspark it errors out with a very 
 non-useful error message:
 Traceback (most recent call last):
   File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/shell.py, 
 line 43, in module
 sc = SparkContext(appName=PySparkShell, pyFiles=add_files)
   File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/context.py, 
 line 94, in __init__
 SparkContext._ensure_initialized(self, gateway=gateway)
   File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/context.py, 
 line 184, in _ensure_initialized
 SparkContext._gateway = gateway or launch_gateway()
   File 
 /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/java_gateway.py, 
 line 51, in launch_gateway
 gateway_port = int(proc.stdout.readline())
 ValueError: invalid literal for int() with base 10: 'Spark Command: 
 /home/gs/java/jdk/bin/java -cp 
 :/home/gs/hadoop/current/share/hadoop/common/hadoop-gpl-compression.jar:/home/gs/hadoop/current/share/hadoop/hdfs/lib/YahooDNSToSwitchMapping-0.2.14020207'



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2111) pyspark errors when SPARK_PRINT_LAUNCH_COMMAND=1

2014-07-16 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-2111:
-

Assignee: Prashant Sharma

 pyspark errors when SPARK_PRINT_LAUNCH_COMMAND=1
 

 Key: SPARK-2111
 URL: https://issues.apache.org/jira/browse/SPARK-2111
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Thomas Graves
Assignee: Prashant Sharma
 Fix For: 1.0.1, 1.1.0


 If you set SPARK_PRINT_LAUNCH_COMMAND=1 to see what java command is being 
 used to launch spark and then try to run pyspark it errors out with a very 
 non-useful error message:
 Traceback (most recent call last):
   File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/shell.py, 
 line 43, in module
 sc = SparkContext(appName=PySparkShell, pyFiles=add_files)
   File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/context.py, 
 line 94, in __init__
 SparkContext._ensure_initialized(self, gateway=gateway)
   File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/context.py, 
 line 184, in _ensure_initialized
 SparkContext._gateway = gateway or launch_gateway()
   File 
 /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/java_gateway.py, 
 line 51, in launch_gateway
 gateway_port = int(proc.stdout.readline())
 ValueError: invalid literal for int() with base 10: 'Spark Command: 
 /home/gs/java/jdk/bin/java -cp 
 :/home/gs/hadoop/current/share/hadoop/common/hadoop-gpl-compression.jar:/home/gs/hadoop/current/share/hadoop/hdfs/lib/YahooDNSToSwitchMapping-0.2.14020207'



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2443) Reading from Partitioned Tables is Slow

2014-07-16 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063530#comment-14063530
 ] 

Michael Armbrust commented on SPARK-2443:
-

[~chutium], I would recommend using the native parquet interface if possible 
(explained in the programming guide: 
http://spark.apache.org/docs/latest/sql-programming-guide.html).  That said, 
others have reported success using the hive serde for parquet as well.

Please ask future questions on the user mailing list and not JIRA: 
https://spark.apache.org/community.html

 Reading from Partitioned Tables is Slow
 ---

 Key: SPARK-2443
 URL: https://issues.apache.org/jira/browse/SPARK-2443
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Zongheng Yang
 Fix For: 1.1.0, 1.0.2


 Here are some numbers, all queries return ~20million:
 {code}
 SELECT COUNT(*) FROM non partitioned table
 5.496467726 s
 SELECT COUNT(*) FROM partitioned table stored in parquet
 50.26947 s
 SELECT COUNT(*) FROM same table as previous but loaded with parquetFile 
 instead of through hive
 2s
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2014-07-16 Thread RJ Nowling (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

RJ Nowling updated SPARK-2308:
--

Attachment: uneven_centers.pdf
many_small_centers.pdf

 Add KMeans MiniBatch clustering algorithm to MLlib
 --

 Key: SPARK-2308
 URL: https://issues.apache.org/jira/browse/SPARK-2308
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Priority: Minor
 Attachments: many_small_centers.pdf, uneven_centers.pdf


 Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
 data points in each iteration instead of the full set of data points, 
 improving performance (and in some cases, accuracy).  The mini-batch version 
 is compatible with the KMeans|| initialization algorithm currently 
 implemented in MLlib.
 I suggest adding KMeans Mini-batch as an alternative.
 I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2014-07-16 Thread RJ Nowling (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063601#comment-14063601
]

RJ Nowling commented on SPARK-2308:
---

I tested kmeans vs minibatch kmeans under 2 scenarios:

* 4 centers of 1000, 100, 10, and 1 data points.
* 100 centers with 10 points each

The proposed centers were generated along a grid. The data points were
generated by adding samples from N(0, 1.0) in each dimension to the centers. I
found the expected centers by averaging the points generated from each proposed
center.

I ran KMeans and MiniBatch KMeans for each set of data points with 30
iterations and k-means++ initialization.

I plotted the expected centers (blue), KMeans centers (red), and MiniBatch
centers (green). The two method showed similar results. They both struggled
with the small clusters and ended up finding two centers for the large cluster,
ignoring the single data point. For the 100 even clusters, both methods got
most of the centers reasonably correct and in a few cases, had 2 centers where
there should be 1.

I've attached the plots (many_small_centers,pdf, uneven_centers.pdf).

In reviewing the scikit-learn implementation, I saw that they handled small
clusters as special cases. In the case of small clusters, one of the points in
the cluster is randomly chosen as the center instead of finding the center as a
running average of the points sampled.

Add KMeans MiniBatch clustering algorithm to MLlib
--

Key: SPARK-2308
URL: https://issues.apache.org/jira/browse/SPARK-2308
Project: Spark
Issue Type: New Feature
Components: MLlib
Reporter: RJ Nowling
Priority: Minor
Attachments: many_small_centers.pdf, uneven_centers.pdf

Mini-batch is a version of KMeans that uses a randomly-sampled subset of the
data points in each iteration instead of the full set of data points,
improving performance (and in some cases, accuracy). The mini-batch version
is compatible with the KMeans|| initialization algorithm currently
implemented in MLlib.
I suggest adding KMeans Mini-batch as an alternative.
I'd like this to be assigned to me.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2525) Remove as many compilation warning messages as possible in Spark SQL

2014-07-16 Thread Yin Huai (JIRA)

Yin Huai created SPARK-2525:
---

 Summary: Remove as many compilation warning messages as possible 
in Spark SQL
 Key: SPARK-2525
 URL: https://issues.apache.org/jira/browse/SPARK-2525
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2525) Remove as many compilation warning messages as possible in Spark SQL

2014-07-16 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063701#comment-14063701
 ] 

Yin Huai commented on SPARK-2525:
-

Those deprecation warnings in Spark SQL are caused by using Hive's Serializer 
and Deserializer (these two are marked deprecated). We have to use them for 
now. We can resolve these warnings once SerDe interfaces in Hive have been 
cleaned up.

 Remove as many compilation warning messages as possible in Spark SQL
 

 Key: SPARK-2525
 URL: https://issues.apache.org/jira/browse/SPARK-2525
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Reopened] (SPARK-2314) RDD actions are only overridden in Scala, not java or python

2014-07-16 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reopened SPARK-2314:
-


Yeah I think we might need to do this in python too as at least take is 
implemented manually in rdd.py

 RDD actions are only overridden in Scala, not java or python
 

 Key: SPARK-2314
 URL: https://issues.apache.org/jira/browse/SPARK-2314
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0, 1.0.1
Reporter: Michael Armbrust
Assignee: Aaron Staple
  Labels: starter
 Fix For: 1.1.0, 1.0.2


 For example, collect and take().  We should keep these two in sync, or move 
 this code to schemaRDD like if possible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2526) Simplify make-distribution.sh

Patrick Wendell created SPARK-2526:
--

 Summary: Simplify make-distribution.sh
 Key: SPARK-2526
 URL: https://issues.apache.org/jira/browse/SPARK-2526
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell


There is a some complexity make-distribution.sh around selecting profiles. This 
is both annoying to maintain and also limits the number of ways that packagers 
can use this. For instance, it's not possible to build with separate HDFS and 
YARN versions, and supporting this with our current flags would get pretty 
complicated. We should just allow the user to pass a list of profiles directly 
to make-distribution.sh - the Maven build itself is already parameterized to 
support this. We also now have good docs explaining the use of profiles in the 
Maven build.

All of this logic was more necessary when we used SBT for the package build, 
but we haven't done that for several versions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2526) Simplify make-distribution.sh to just pass through Maven options


 [ 
https://issues.apache.org/jira/browse/SPARK-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2526:
---

Summary: Simplify make-distribution.sh to just pass through Maven options  
(was: Simplify make-distribution.sh)

 Simplify make-distribution.sh to just pass through Maven options
 

 Key: SPARK-2526
 URL: https://issues.apache.org/jira/browse/SPARK-2526
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell

 There is a some complexity make-distribution.sh around selecting profiles. 
 This is both annoying to maintain and also limits the number of ways that 
 packagers can use this. For instance, it's not possible to build with 
 separate HDFS and YARN versions, and supporting this with our current flags 
 would get pretty complicated. We should just allow the user to pass a list of 
 profiles directly to make-distribution.sh - the Maven build itself is already 
 parameterized to support this. We also now have good docs explaining the use 
 of profiles in the Maven build.
 All of this logic was more necessary when we used SBT for the package build, 
 but we haven't done that for several versions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2523) Potential Bugs if SerDe is not the identical among partitions and table

2014-07-16 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063753#comment-14063753
 ] 

Yin Huai commented on SPARK-2523:
-

Yeah, no problem. Can you add a case which can trigger the bug in the 
description?

 Potential Bugs if SerDe is not the identical among partitions and table
 ---

 Key: SPARK-2523
 URL: https://issues.apache.org/jira/browse/SPARK-2523
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao

 In HiveTableScan.scala, ObjectInspector was created for all of the partition 
 based records, which probably causes ClassCastException if the object 
 inspector is not identical among table  partitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2495) Ability to re-create ML models


[ 
https://issues.apache.org/jira/browse/SPARK-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063762#comment-14063762
 ] 

Alexander Albul commented on SPARK-2495:


Yes, i can work on it, but first i need to understand what is the reason of 
making constructors *private*.

I can propose approach with some utility object that we can use to create 
different models:

ModelLoader.loadLogisticRegression(weights: Vector, intercept: Double): 
LogisticRegressionModel

alternatively, we can put method load into *LogisticRegressionWithSGD* for 
example, but i do not like this approach because we can load models that are 
trained without SGD as well so it is not directly related.

But first of all, if they are private by mistake, we can just open 
constructors. WDYT?

 Ability to re-create ML models
 --

 Key: SPARK-2495
 URL: https://issues.apache.org/jira/browse/SPARK-2495
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.1
Reporter: Alexander Albul

 Hi everyone.
 Previously (prior to Spark 1.0) we was working with MLib like this:
 1) Calculate model (costly operation)
 2) Take model and collect it's fields like weights, intercept e.t.c.
 3) Store model somewhere in our format
 4) Do predictions by loading model attributes, creating new model and 
 predicting using it.
 Now i see that model's constructors have *private* modifier and cannot be 
 created from outside.
 If you want to hide implementation details and keep this constructor as 
 developer api, why not to create at least method, which will take weights, 
 intercept (for example) an materialize that model?
 A good example of model that i am talking about is: *LinearRegressionModel*
 I know that *LinearRegressionWithSGD* class have *createModel* method but the 
 problem is that it have *protected* modifier as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2495) Ability to re-create ML models


[ 
https://issues.apache.org/jira/browse/SPARK-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063762#comment-14063762
 ] 

Alexander Albul edited comment on SPARK-2495 at 7/16/14 5:34 PM:
-

Yes, i can work on it, but first i need to understand what is the reason of 
making constructors *private*.

I can propose approach with some utility object that we can use to create 
different models:

*ModelLoader.loadLogisticRegression(weights: Vector, intercept: Double): 
LogisticRegressionModel*

alternatively, we can put method *load* into *LogisticRegressionWithSGD* for 
example, but i do not like this approach because we can load models that are 
trained with the different appriach (not SGD) so it is not directly related.

But first of all, if they are private by mistake, we can just open 
constructors. WDYT?


was (Author: gorenuru):
Yes, i can work on it, but first i need to understand what is the reason of 
making constructors *private*.

I can propose approach with some utility object that we can use to create 
different models:

*ModelLoader.loadLogisticRegression(weights: Vector, intercept: Double): 
LogisticRegressionModel*

alternatively, we can put method *load* into *LogisticRegressionWithSGD* for 
example, but i do not like this approach because we can load models that are 
trained without SGD as well so it is not directly related.

But first of all, if they are private by mistake, we can just open 
constructors. WDYT?

 Ability to re-create ML models
 --

 Key: SPARK-2495
 URL: https://issues.apache.org/jira/browse/SPARK-2495
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.1
Reporter: Alexander Albul

 Hi everyone.
 Previously (prior to Spark 1.0) we was working with MLib like this:
 1) Calculate model (costly operation)
 2) Take model and collect it's fields like weights, intercept e.t.c.
 3) Store model somewhere in our format
 4) Do predictions by loading model attributes, creating new model and 
 predicting using it.
 Now i see that model's constructors have *private* modifier and cannot be 
 created from outside.
 If you want to hide implementation details and keep this constructor as 
 developer api, why not to create at least method, which will take weights, 
 intercept (for example) an materialize that model?
 A good example of model that i am talking about is: *LinearRegressionModel*
 I know that *LinearRegressionWithSGD* class have *createModel* method but the 
 problem is that it have *protected* modifier as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2527) incorrect persistence level shown in Spark UI after repersisting

2014-07-16 Thread Diana Carroll (JIRA)

Diana Carroll created SPARK-2527:


 Summary: incorrect persistence level shown in Spark UI after 
repersisting
 Key: SPARK-2527
 URL: https://issues.apache.org/jira/browse/SPARK-2527
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Diana Carroll


If I persist an RDD at one level, unpersist it, then repersist it at another 
level, the UI will continue to show the RDD at the first level...but correctly 
show individual partitions at the second level.

{code}
import org.apache.spark.api.java.StorageLevels._
val test1 = sc.parallelize(Array(1,2,3))test1.persist(StorageLevels.DISK_ONLY)
test1.count()
test1.unpersist()
test1.persist(StorageLevels.MEMORY_ONLY)
test1.count()
{code}

after the first call to persist and count, the Spark App web UI shows:

RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated 
rdd_14_0Disk Serialized 1x Replicated

After the second call, it shows:

RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated 
rdd_14_0Memory Deserialized 1x Replicated 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2528) spark-ec2 security group permissions are too open

2014-07-16 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-2528:
---

 Summary: spark-ec2 security group permissions are too open
 Key: SPARK-2528
 URL: https://issues.apache.org/jira/browse/SPARK-2528
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.0.0
Reporter: Nicholas Chammas
Priority: Minor


{{spark-ec2}} configures EC2 security groups with ports [open to the world | 
https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L280]. This is an 
unnecessary security risk, even for a short-lived cluster.

Wherever possible, it would be better if, when launching a new cluster, 
{{spark-ec2}} detects the host's external IP address (e.g. via 
{{icanhazip.com}}) and grants access specifically to that IP address.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2527) incorrect persistence level shown in Spark UI after repersisting

2014-07-16 Thread Diana Carroll (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Diana Carroll updated SPARK-2527:
-

Description: 
If I persist an RDD at one level, unpersist it, then repersist it at another 
level, the UI will continue to show the RDD at the first level...but correctly 
show individual partitions at the second level.

{code}
import org.apache.spark.api.java.StorageLevels
import org.apache.spark.api.java.StorageLevels._
val test1 = sc.parallelize(Array(1,2,3))test1.persist(StorageLevels.DISK_ONLY)
test1.count()
test1.unpersist()
test1.persist(StorageLevels.MEMORY_ONLY)
test1.count()
{code}

after the first call to persist and count, the Spark App web UI shows:

RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated 
rdd_14_0Disk Serialized 1x Replicated

After the second call, it shows:

RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated 
rdd_14_0Memory Deserialized 1x Replicated 

  was:
If I persist an RDD at one level, unpersist it, then repersist it at another 
level, the UI will continue to show the RDD at the first level...but correctly 
show individual partitions at the second level.

{code}
import org.apache.spark.api.java.StorageLevels._
val test1 = sc.parallelize(Array(1,2,3))test1.persist(StorageLevels.DISK_ONLY)
test1.count()
test1.unpersist()
test1.persist(StorageLevels.MEMORY_ONLY)
test1.count()
{code}

after the first call to persist and count, the Spark App web UI shows:

RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated 
rdd_14_0Disk Serialized 1x Replicated

After the second call, it shows:

RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated 
rdd_14_0Memory Deserialized 1x Replicated 


 incorrect persistence level shown in Spark UI after repersisting
 

 Key: SPARK-2527
 URL: https://issues.apache.org/jira/browse/SPARK-2527
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Diana Carroll
 Attachments: persistbug1.png, persistbug2.png


 If I persist an RDD at one level, unpersist it, then repersist it at another 
 level, the UI will continue to show the RDD at the first level...but 
 correctly show individual partitions at the second level.
 {code}
 import org.apache.spark.api.java.StorageLevels
 import org.apache.spark.api.java.StorageLevels._
 val test1 = sc.parallelize(Array(1,2,3))test1.persist(StorageLevels.DISK_ONLY)
 test1.count()
 test1.unpersist()
 test1.persist(StorageLevels.MEMORY_ONLY)
 test1.count()
 {code}
 after the first call to persist and count, the Spark App web UI shows:
 RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated 
 rdd_14_0  Disk Serialized 1x Replicated
 After the second call, it shows:
 RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated 
 rdd_14_0  Memory Deserialized 1x Replicated 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2495) Ability to re-create ML models


[ 
https://issues.apache.org/jira/browse/SPARK-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063762#comment-14063762
 ] 

Alexander Albul edited comment on SPARK-2495 at 7/16/14 5:33 PM:
-

Yes, i can work on it, but first i need to understand what is the reason of 
making constructors *private*.

I can propose approach with some utility object that we can use to create 
different models:

*ModelLoader.loadLogisticRegression(weights: Vector, intercept: Double): 
LogisticRegressionModel*

alternatively, we can put method *load* into *LogisticRegressionWithSGD* for 
example, but i do not like this approach because we can load models that are 
trained without SGD as well so it is not directly related.

But first of all, if they are private by mistake, we can just open 
constructors. WDYT?


was (Author: gorenuru):
Yes, i can work on it, but first i need to understand what is the reason of 
making constructors *private*.

I can propose approach with some utility object that we can use to create 
different models:

ModelLoader.loadLogisticRegression(weights: Vector, intercept: Double): 
LogisticRegressionModel

alternatively, we can put method load into *LogisticRegressionWithSGD* for 
example, but i do not like this approach because we can load models that are 
trained without SGD as well so it is not directly related.

But first of all, if they are private by mistake, we can just open 
constructors. WDYT?

 Ability to re-create ML models
 --

 Key: SPARK-2495
 URL: https://issues.apache.org/jira/browse/SPARK-2495
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.1
Reporter: Alexander Albul

 Hi everyone.
 Previously (prior to Spark 1.0) we was working with MLib like this:
 1) Calculate model (costly operation)
 2) Take model and collect it's fields like weights, intercept e.t.c.
 3) Store model somewhere in our format
 4) Do predictions by loading model attributes, creating new model and 
 predicting using it.
 Now i see that model's constructors have *private* modifier and cannot be 
 created from outside.
 If you want to hide implementation details and keep this constructor as 
 developer api, why not to create at least method, which will take weights, 
 intercept (for example) an materialize that model?
 A good example of model that i am talking about is: *LinearRegressionModel*
 I know that *LinearRegressionWithSGD* class have *createModel* method but the 
 problem is that it have *protected* modifier as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2519) Eliminate pattern-matching on Tuple2 in performance-critical aggregation code


[ 
https://issues.apache.org/jira/browse/SPARK-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063833#comment-14063833
 ] 

Reynold Xin commented on SPARK-2519:


Does this pull request fix all of them?

 Eliminate pattern-matching on Tuple2 in performance-critical aggregation code
 -

 Key: SPARK-2519
 URL: https://issues.apache.org/jira/browse/SPARK-2519
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Reporter: Sandy Ryza
Assignee: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2269) Clean up and add unit tests for resourceOffers in MesosSchedulerBackend


 [ 
https://issues.apache.org/jira/browse/SPARK-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2269:
---

Assignee: Tim Chen

 Clean up and add unit tests for resourceOffers in MesosSchedulerBackend
 ---

 Key: SPARK-2269
 URL: https://issues.apache.org/jira/browse/SPARK-2269
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Reporter: Patrick Wendell
Assignee: Tim Chen

 This function could be simplified a bit. We could re-write it without 
 offerableIndices or creating the mesosTasks array as large as the offer list. 
 There is a lot of logic around making sure you get the correct index into 
 mesosTasks and offers, really we should just build mesosTasks directly from 
 the offers we get back. To associate the tasks we are launching with the 
 offers we can just create a hashMap from the slaveId to the original offer.
 The basic logic of the function is that you take the mesos offers, convert 
 them to spark offers, then convert the results back.
 One reason I think it might be designed as it is now is to deal with the case 
 where Mesos gives multiple offers for a single slave. I checked directly with 
 the Mesos team and they said this won't ever happen, you'll get at most one 
 offer per mesos slave within a set of offers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2522) Use TorrentBroadcastFactory as the default broadcast factory


 [ 
https://issues.apache.org/jira/browse/SPARK-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2522.


   Resolution: Fixed
Fix Version/s: 1.1.0

 Use TorrentBroadcastFactory as the default broadcast factory
 

 Key: SPARK-2522
 URL: https://issues.apache.org/jira/browse/SPARK-2522
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.1.0


 HttpBroadcastFactory is the current default broadcast factory. It sends the 
 broadcast data to each worker one by one, which is slow when the cluster is 
 big. TorrentBroadcastFactory scales much better than http. Maybe we should 
 make torrent the default broadcast method.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2317) Improve task logging


 [ 
https://issues.apache.org/jira/browse/SPARK-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2317.


   Resolution: Fixed
Fix Version/s: 1.1.0

 Improve task logging
 

 Key: SPARK-2317
 URL: https://issues.apache.org/jira/browse/SPARK-2317
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.0, 1.0.1
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.1.0


 We use TID to indicate task logging. However, TID itself does not capture 
 stage or retries, making it harder to correlate with the application itself. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2463) Creating multiple StreamingContexts from shell generates duplicate Streaming tabs in UI


[ 
https://issues.apache.org/jira/browse/SPARK-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063944#comment-14063944
 ] 

Andrew Or commented on SPARK-2463:
--

It probably won't be too much work, because the functionality of detaching a 
tab is already there (just not used). The trickier bits are probably not the UI 
code but in the streaming code, where we have to make sure that a SparkContext 
only has one StreamingContext at any given time. We can maintain a static map 
to track this, but there may be a few gotchas in sharing the same SparkContext 
across multiple StreamingContexts (e.g. we have to make sure that blocks cached 
by the first ssc cannot be read by the second ssc).

 Creating multiple StreamingContexts from shell generates duplicate Streaming 
 tabs in UI
 ---

 Key: SPARK-2463
 URL: https://issues.apache.org/jira/browse/SPARK-2463
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Web UI
Affects Versions: 1.0.1
Reporter: Nicholas Chammas

 Start a {{StreamingContext}} from the interactive shell and then stop it. Go 
 to {{http://master_url:4040/streaming/}} and you will see a tab in the UI for 
 Streaming.
 Now from the same shell, create and start a new {{StreamingContext}}. There 
 will now be a duplicate tab for Streaming in the UI. Repeating this process 
 generates additional Streaming tabs. 
 They all link to the same information.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2530) Relax incorrect assumption of one ExternalAppendOnlyMap per thread

Andrew Or created SPARK-2530:


 Summary: Relax incorrect assumption of one ExternalAppendOnlyMap 
per thread
 Key: SPARK-2530
 URL: https://issues.apache.org/jira/browse/SPARK-2530
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or


Originally reported by Matei.

Our current implementation of EAOM assumes only one map is created per task. 
This is not true in the following case, however:

{code}
rdd1.join(rdd2).reduceByKey(...)
{code}

This is because reduce by key does a map side combine, which creates an EAOM 
that streams from an EAOM previously created by the same thread to aggregate 
values from the join.

The more concerning thing is the following: we currently maintain a global 
shuffle memory map (thread ID - memory used by that thread to shuffle). If we 
create two EAOMs in the same thread, the memory occupied by the first map may 
be clobbered by that occupied by the second. This has very adverse consequences 
if the first map is huge but the second is just starting out, in which case we 
end up believing that we use much less memory than we actually do.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts

2014-07-16 Thread Xuefu Zhang (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063998#comment-14063998
]

Xuefu Zhang commented on SPARK-2420:

Thanks for your comments, [~srowen]. I mostly agree with your assessment.

1. Guava is indeed a problem and the problem is confirmed. The solution also
seems simple. It it can be set to 11 in spark, it solves the problem we have.

2. For jetty, it was a problem with Hive on Spark POC, possibly because we
shipped all libraries from Hive process's classpath to the spark cluster. We
have a task (HIVE-7371) to identify a minimum set of jars to be shipped. With
that, the story might change. We will confirm if Jetty is a problem once we
have a better idea on HIVE-7371. Also, we will check if the problem, if exists,
can be fixed on Hive side first. If not, we'd like to get help from Spark and
also provide a reason why.

Change Spark build to minimize library conflicts

Key: SPARK-2420
URL: https://issues.apache.org/jira/browse/SPARK-2420
Project: Spark
Issue Type: Wish
Components: Build
Affects Versions: 1.0.0
Reporter: Xuefu Zhang
Attachments: spark_1.0.0.patch

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts


[ 
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064000#comment-14064000
 ] 

Reynold Xin commented on SPARK-2420:


Thanks for looking into this.

 Change Spark build to minimize library conflicts
 

 Key: SPARK-2420
 URL: https://issues.apache.org/jira/browse/SPARK-2420
 Project: Spark
  Issue Type: Wish
  Components: Build
Affects Versions: 1.0.0
Reporter: Xuefu Zhang
 Attachments: spark_1.0.0.patch


 During the prototyping of HIVE-7292, many library conflicts showed up because 
 Spark build contains versions of libraries that's vastly different from 
 current major Hadoop version. It would be nice if we can choose versions 
 that's in line with Hadoop or shading them in the assembly. Here are the wish 
 list:
 1. Upgrade protobuf version to 2.5.0 from current 2.4.1
 2. Shading Spark's jetty and servlet dependency in the assembly.
 3. guava version difference. Spark is using a higher version. I'm not sure 
 what's the best solution for this.
 The list may grow as HIVE-7292 proceeds.
 For information only, the attached is a patch that we applied on Spark in 
 order to make Spark work with Hive. It gives an idea of the scope of changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Issue Comment Deleted] (SPARK-1215) Clustering: Index out of bounds error

2014-07-16 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-1215:
-

Comment: was deleted

(was: Just to let you know, I'll give the go-ahead for this tomorrow.  Based on
Patrick's recommendation, I'm blocking on this JIRA until tomorrow to give
Prashant  ( https://issues.apache.org/jira/browse/SPARK-2497 ) a chance to
fix a MIMA issue.  If it's not fixed by tomorrow, then I'll get Manish to
do a temp filter workaround.


On Tue, Jul 15, 2014 at 9:45 PM, Xiangrui Meng (JIRA) j...@apache.org

)

 Clustering: Index out of bounds error
 -

 Key: SPARK-1215
 URL: https://issues.apache.org/jira/browse/SPARK-1215
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: dewshick
Assignee: Joseph K. Bradley
 Attachments: test.csv


 code:
 import org.apache.spark.mllib.clustering._
 val test = sc.makeRDD(Array(4,4,4,4,4).map(e = Array(e.toDouble)))
 val kmeans = new KMeans().setK(4)
 kmeans.run(test) evals with java.lang.ArrayIndexOutOfBoundsException
 error:
 14/01/17 12:35:54 INFO scheduler.DAGScheduler: Stage 25 (collectAsMap at 
 KMeans.scala:243) finished in 0.047 s
 14/01/17 12:35:54 INFO spark.SparkContext: Job finished: collectAsMap at 
 KMeans.scala:243, took 16.389537116 s
 Exception in thread main java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at com.simontuffs.onejar.Boot.run(Boot.java:340)
   at com.simontuffs.onejar.Boot.main(Boot.java:166)
 Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
   at 
 org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:47)
   at 
 org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:247)
   at 
 org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
   at scala.collection.immutable.Range.foreach(Range.scala:81)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
   at scala.collection.immutable.Range.map(Range.scala:46)
   at 
 org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:244)
   at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:124)
   at Clustering$$anonfun$1.apply$mcDI$sp(Clustering.scala:21)
   at Clustering$$anonfun$1.apply(Clustering.scala:19)
   at Clustering$$anonfun$1.apply(Clustering.scala:19)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
   at scala.collection.immutable.Range.foreach(Range.scala:78)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
   at scala.collection.immutable.Range.map(Range.scala:46)
   at Clustering$.main(Clustering.scala:19)
   at Clustering.main(Clustering.scala)
   ... 6 more



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2504) Fix nullability of Substring expression.

2014-07-16 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2504.
-

   Resolution: Fixed
Fix Version/s: 1.0.2
   1.1.0
 Assignee: Takuya Ueshin

 Fix nullability of Substring expression.
 

 Key: SPARK-2504
 URL: https://issues.apache.org/jira/browse/SPARK-2504
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin
 Fix For: 1.1.0, 1.0.2


 This is a follow-up of [#1359|https://github.com/apache/spark/pull/1359] with 
 nullability narrowing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts

[
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064019#comment-14064019
]

Sean Owen commented on SPARK-2420:
--

It'd be best to say what problem you are seeing with Guava -- but hold the
phone, scratch my comments, I just tried compiling with Guava 11.0.2 and it
looks like there is now some code that does use methods not present in 11. Just
a few, but they exist. I assume it works on Hadoop now because Guava 14 wins in
the assembly.

I don't think it's a lot of work to even make changes to accommodate Guava 11
here. It seems pretty unpalatable to do so. Hadoop really needs to upgrade. But
then that takes forever to trickle down to the majority of deployments anyway.
Is there any feasibility of just upgrading Hive-on-Spark?

So it's...
1) Downgrade Spark. Not so much to accommodate Hive, although that's nice, but
to avoid latent clashes with Hadoop
2) Somehow use Guava 14 in Hive on Spark
3) Shade shade shade

Anyone else have opinions?

Change Spark build to minimize library conflicts

Key: SPARK-2420
URL: https://issues.apache.org/jira/browse/SPARK-2420
Project: Spark
Issue Type: Wish
Components: Build
Affects Versions: 1.0.0
Reporter: Xuefu Zhang
Attachments: spark_1.0.0.patch

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2454) Separate driver spark home from executor spark home


[ 
https://issues.apache.org/jira/browse/SPARK-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064036#comment-14064036
 ] 

Andrew Or commented on SPARK-2454:
--

There may be multiple installations of Spark on the executor machine, in which 
case a global SPARK_HOME environment variable is not sufficient. What I am 
suggesting is that we should still keep the option of allowing the driver to 
overwrite executor spark homes (spark.executor.home), and only overwrite the 
executor SPARK_HOME if this is specified.

 Separate driver spark home from executor spark home
 ---

 Key: SPARK-2454
 URL: https://issues.apache.org/jira/browse/SPARK-2454
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
 Fix For: 1.1.0


 The driver may not always share the same directory structure as the 
 executors. It makes little sense to always re-use the driver's spark home on 
 the executors.
 https://github.com/apache/spark/pull/1244/ is an open effort to fix this. 
 However, this still requires us to set SPARK_HOME on all the executor nodes. 
 Really we should separate this out into something like
 spark.driver.home
 spark.executor.home
 rather than re-using SPARK_HOME everywhere.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2454) Separate driver spark home from executor spark home

[
https://issues.apache.org/jira/browse/SPARK-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Or updated SPARK-2454:
-

Description:
The driver may not always share the same directory structure as the executors.
It makes little sense to always re-use the driver's spark home on the executors.

https://github.com/apache/spark/pull/1244/ is an open effort to fix this.
However, this still requires us to set SPARK_HOME on all the executor nodes.
Really we should separate this out into something like `spark.executor.home`
rather than re-using SPARK_HOME everywhere.

was:
The driver may not always share the same directory structure as the executors.
It makes little sense to always re-use the driver's spark home on the executors.

spark.driver.home
spark.executor.home

rather than re-using SPARK_HOME everywhere.

Separate driver spark home from executor spark home
---

Key: SPARK-2454
URL: https://issues.apache.org/jira/browse/SPARK-2454
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
Fix For: 1.1.0

The driver may not always share the same directory structure as the
executors. It makes little sense to always re-use the driver's spark home on
the executors.
https://github.com/apache/spark/pull/1244/ is an open effort to fix this.
However, this still requires us to set SPARK_HOME on all the executor nodes.
Really we should separate this out into something like `spark.executor.home`
rather than re-using SPARK_HOME everywhere.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (SPARK-2465) Use long as user / item ID for ALS


 [ 
https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-2465.


Resolution: Won't Fix

Will possibly revisit this in the long term, or look at creating a parallel 
LongALS / LongRating API.

 Use long as user / item ID for ALS
 --

 Key: SPARK-2465
 URL: https://issues.apache.org/jira/browse/SPARK-2465
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.1
Reporter: Sean Owen
Priority: Minor
 Attachments: ALS using MEMORY_AND_DISK.png, ALS using 
 MEMORY_AND_DISK_SER.png, Screen Shot 2014-07-13 at 8.49.40 PM.png


 I'd like to float this for consideration: use longs instead of ints for user 
 and product IDs in the ALS implementation.
 The main reason for is that identifiers are not generally numeric at all, and 
 will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits 
 means collisions are likely after hundreds of thousands of users and items, 
 which is not unrealistic. Hashing to 64 bits pushes this back to billions.
 It would also mean numeric IDs that happen to be larger than the largest int 
 can be used directly as identifiers.
 On the downside of course: 8 bytes instead of 4 bytes of memory used per 
 Rating.
 Thoughts? I will post a PR so as to show what the change would be.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2531) Make BroadcastNestedLoopJoin take into account a BuildSide

2014-07-16 Thread Zongheng Yang (JIRA)

Zongheng Yang created SPARK-2531:


 Summary: Make BroadcastNestedLoopJoin take into account a BuildSide
 Key: SPARK-2531
 URL: https://issues.apache.org/jira/browse/SPARK-2531
 Project: Spark
  Issue Type: Bug
Reporter: Zongheng Yang






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2454) Separate driver spark home from executor spark home

2014-07-16 Thread Nan Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064085#comment-14064085
 ] 

Nan Zhu commented on SPARK-2454:


I see, it makes sense to me...

 Separate driver spark home from executor spark home
 ---

 Key: SPARK-2454
 URL: https://issues.apache.org/jira/browse/SPARK-2454
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
 Fix For: 1.1.0


 The driver may not always share the same directory structure as the 
 executors. It makes little sense to always re-use the driver's spark home on 
 the executors.
 https://github.com/apache/spark/pull/1244/ is an open effort to fix this. 
 However, this still requires us to set SPARK_HOME on all the executor nodes. 
 Really we should separate this out into something like `spark.executor.home` 
 rather than re-using SPARK_HOME everywhere.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2411) Standalone Master - direct users to turn on event logs


 [ 
https://issues.apache.org/jira/browse/SPARK-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2411:
-

Attachment: (was: Master event logs.png)

 Standalone Master - direct users to turn on event logs
 --

 Key: SPARK-2411
 URL: https://issues.apache.org/jira/browse/SPARK-2411
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
 Attachments: Master history not found.png


 Right now if the user attempts to click on a finished application's UI, it 
 simply refreshes. This is simply because the event logs are not there, in 
 which case we set the href=.
 We could provide more information by pointing them to configure 
 spark.eventLog.enabled if they click on the empty link.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2519) Eliminate pattern-matching on Tuple2 in performance-critical aggregation code

2014-07-16 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064102#comment-14064102
 ] 

Sandy Ryza commented on SPARK-2519:
---

I looked in ShuffledRDD, ExternalAppendOnlyMap, AppendOnlyMap, 
SizeTrackingAppendOnlyMap, and Aggregator.

Just looked in CoGroupedRDD and PairRDDFunctions and found a few more - 
https://github.com/apache/spark/pull/1447.

 Eliminate pattern-matching on Tuple2 in performance-critical aggregation code
 -

 Key: SPARK-2519
 URL: https://issues.apache.org/jira/browse/SPARK-2519
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Reporter: Sandy Ryza
Assignee: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2531) Make BroadcastNestedLoopJoin take into account a BuildSide

2014-07-16 Thread Zongheng Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064106#comment-14064106
 ] 

Zongheng Yang commented on SPARK-2531:
--

Github PR: https://github.com/apache/spark/pull/1448

 Make BroadcastNestedLoopJoin take into account a BuildSide
 --

 Key: SPARK-2531
 URL: https://issues.apache.org/jira/browse/SPARK-2531
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.1
Reporter: Zongheng Yang
Assignee: Zongheng Yang
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2154) Worker goes down.


 [ 
https://issues.apache.org/jira/browse/SPARK-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2154:
---

Assignee: Aaron Davidson

 Worker goes down.
 -

 Key: SPARK-2154
 URL: https://issues.apache.org/jira/browse/SPARK-2154
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.8.1, 0.9.0, 1.0.0
 Environment: Spark on cluster of three nodes on Ubuntu 12.04.4 LTS
Reporter: siva venkat gogineni
Assignee: Aaron Davidson
  Labels: patch
 Fix For: 1.1.0, 1.0.2

 Attachments: Sccreenhot at various states of driver ..jpg


 Worker dies when i try to submit drivers more than the allocated cores. When 
 I submit 9 drivers with one core for each driver on a cluster having 8 cores 
 all together the worker dies as soon as i submit the 9 the driver. It works 
 fine until it reaches 8 cores, As soon as i submit 9th driver the driver 
 status remains Submitted and the worker crashes. I understand that we 
 cannot run  drivers more than the allocated cores but the problem here is 
 instead of the 9th driver being in queue it is being executed and as a result 
 it is crashing the worker. Let me know if there is a way to get around this 
 issue or is it being fixed in the upcoming version?
 Cluster Details:
 Spark 1.00
 2 nodes with 4 cores each.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2154) Worker goes down.


 [ 
https://issues.apache.org/jira/browse/SPARK-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2154.


   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.2

Issue resolved by pull request 1405
[https://github.com/apache/spark/pull/1405]

 Worker goes down.
 -

 Key: SPARK-2154
 URL: https://issues.apache.org/jira/browse/SPARK-2154
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.8.1, 0.9.0, 1.0.0
 Environment: Spark on cluster of three nodes on Ubuntu 12.04.4 LTS
Reporter: siva venkat gogineni
  Labels: patch
 Fix For: 1.0.2, 1.1.0

 Attachments: Sccreenhot at various states of driver ..jpg


 Worker dies when i try to submit drivers more than the allocated cores. When 
 I submit 9 drivers with one core for each driver on a cluster having 8 cores 
 all together the worker dies as soon as i submit the 9 the driver. It works 
 fine until it reaches 8 cores, As soon as i submit 9th driver the driver 
 status remains Submitted and the worker crashes. I understand that we 
 cannot run  drivers more than the allocated cores but the problem here is 
 instead of the 9th driver being in queue it is being executed and as a result 
 it is crashing the worker. Let me know if there is a way to get around this 
 issue or is it being fixed in the upcoming version?
 Cluster Details:
 Spark 1.00
 2 nodes with 4 cores each.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (SPARK-2154) Worker goes down.

2014-07-16 Thread siva venkat gogineni (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

siva venkat gogineni closed SPARK-2154.
---


Fixed in the future releases

 Worker goes down.
 -

 Key: SPARK-2154
 URL: https://issues.apache.org/jira/browse/SPARK-2154
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.8.1, 0.9.0, 1.0.0
 Environment: Spark on cluster of three nodes on Ubuntu 12.04.4 LTS
Reporter: siva venkat gogineni
Assignee: Aaron Davidson
  Labels: patch
 Fix For: 1.1.0, 1.0.2

 Attachments: Sccreenhot at various states of driver ..jpg


 Worker dies when i try to submit drivers more than the allocated cores. When 
 I submit 9 drivers with one core for each driver on a cluster having 8 cores 
 all together the worker dies as soon as i submit the 9 the driver. It works 
 fine until it reaches 8 cores, As soon as i submit 9th driver the driver 
 status remains Submitted and the worker crashes. I understand that we 
 cannot run  drivers more than the allocated cores but the problem here is 
 instead of the 9th driver being in queue it is being executed and as a result 
 it is crashing the worker. Let me know if there is a way to get around this 
 issue or is it being fixed in the upcoming version?
 Cluster Details:
 Spark 1.00
 2 nodes with 4 cores each.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2532) Fix issues with consolidated shuffle

2014-07-16 Thread Mridul Muralidharan (JIRA)

Mridul Muralidharan created SPARK-2532:
--

 Summary: Fix issues with consolidated shuffle
 Key: SPARK-2532
 URL: https://issues.apache.org/jira/browse/SPARK-2532
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: All
Reporter: Mridul Muralidharan
Assignee: Mridul Muralidharan
Priority: Critical
 Fix For: 1.1.0



Will file PR with changes as soon as merge is done (earlier merge became 
outdated in 2 weeks unfortunately :) ).

Consolidated shuffle is broken in multiple ways in spark :

a) Task failure(s) can cause the state to become inconsistent.

b) Multiple revert's or combination of close/revert/close can cause the state 
to be inconsistent.
(As part of exception/error handling).

c) Some of the api in block writer causes implementation issues - for example: 
a revert is always followed by close : but the implemention tries to keep them 
separate, resulting in surface for errors.

d) Fetching data from consolidated shuffle files can go badly wrong if the file 
is being actively written to : it computes length by subtracting next offset 
from current offset (or length if this is last offset)- the latter fails when 
fetch is happening in parallel to write.
Note, this happens even if there are no task failures of any kind !
This usually results in stream corruption or decompression errors.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2434) Generate runtime warnings for naive implementations


 [ 
https://issues.apache.org/jira/browse/SPARK-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2434:
-

Assignee: Burak Yavuz

 Generate runtime warnings for naive implementations
 ---

 Key: SPARK-2434
 URL: https://issues.apache.org/jira/browse/SPARK-2434
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Burak Yavuz
Priority: Minor
  Labels: Starter

 There are some example code under src/main/scala/org/apache/spark/examples:
 * LocalALS
 * LocalFileLR
 * LocalKMeans
 * LocalLP
 * SparkALS
 * SparkHdfsLR
 * SparkKMeans
 * SparkLR
 They provide naive implementations of some machine learning algorithms that 
 are already covered in MLlib. It is okay to keep them because the 
 implementation is straightforward and easy to read. However, we should 
 generate warning messages at runtime and point users to MLlib's 
 implementation, in case users use them in practice.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts

2014-07-16 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064168#comment-14064168
 ] 

Xuefu Zhang commented on SPARK-2420:


As to guava conflict, HIVE-7387 has more details and analysis.

 Change Spark build to minimize library conflicts
 

 Key: SPARK-2420
 URL: https://issues.apache.org/jira/browse/SPARK-2420
 Project: Spark
  Issue Type: Wish
  Components: Build
Affects Versions: 1.0.0
Reporter: Xuefu Zhang
 Attachments: spark_1.0.0.patch


 During the prototyping of HIVE-7292, many library conflicts showed up because 
 Spark build contains versions of libraries that's vastly different from 
 current major Hadoop version. It would be nice if we can choose versions 
 that's in line with Hadoop or shading them in the assembly. Here are the wish 
 list:
 1. Upgrade protobuf version to 2.5.0 from current 2.4.1
 2. Shading Spark's jetty and servlet dependency in the assembly.
 3. guava version difference. Spark is using a higher version. I'm not sure 
 what's the best solution for this.
 The list may grow as HIVE-7292 proceeds.
 For information only, the attached is a patch that we applied on Spark in 
 order to make Spark work with Hive. It gives an idea of the scope of changes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2154) Worker goes down.


[ 
https://issues.apache.org/jira/browse/SPARK-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064172#comment-14064172
 ] 

Patrick Wendell commented on SPARK-2154:


[~talk2siva8] Yes, that's correct.

 Worker goes down.
 -

 Key: SPARK-2154
 URL: https://issues.apache.org/jira/browse/SPARK-2154
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.8.1, 0.9.0, 1.0.0
 Environment: Spark on cluster of three nodes on Ubuntu 12.04.4 LTS
Reporter: siva venkat gogineni
Assignee: Aaron Davidson
  Labels: patch
 Fix For: 1.1.0, 1.0.2

 Attachments: Sccreenhot at various states of driver ..jpg


 Worker dies when i try to submit drivers more than the allocated cores. When 
 I submit 9 drivers with one core for each driver on a cluster having 8 cores 
 all together the worker dies as soon as i submit the 9 the driver. It works 
 fine until it reaches 8 cores, As soon as i submit 9th driver the driver 
 status remains Submitted and the worker crashes. I understand that we 
 cannot run  drivers more than the allocated cores but the problem here is 
 instead of the 9th driver being in queue it is being executed and as a result 
 it is crashing the worker. Let me know if there is a way to get around this 
 issue or is it being fixed in the upcoming version?
 Cluster Details:
 Spark 1.00
 2 nodes with 4 cores each.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2495) Ability to re-create ML models


 [ 
https://issues.apache.org/jira/browse/SPARK-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2495:
-

Assignee: Alexander Albul

 Ability to re-create ML models
 --

 Key: SPARK-2495
 URL: https://issues.apache.org/jira/browse/SPARK-2495
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.1
Reporter: Alexander Albul
Assignee: Alexander Albul

 Hi everyone.
 Previously (prior to Spark 1.0) we was working with MLib like this:
 1) Calculate model (costly operation)
 2) Take model and collect it's fields like weights, intercept e.t.c.
 3) Store model somewhere in our format
 4) Do predictions by loading model attributes, creating new model and 
 predicting using it.
 Now i see that model's constructors have *private* modifier and cannot be 
 created from outside.
 If you want to hide implementation details and keep this constructor as 
 developer api, why not to create at least method, which will take weights, 
 intercept (for example) an materialize that model?
 A good example of model that i am talking about is: *LinearRegressionModel*
 I know that *LinearRegressionWithSGD* class have *createModel* method but the 
 problem is that it have *protected* modifier as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1997) Update breeze to version 0.8.1


 [ 
https://issues.apache.org/jira/browse/SPARK-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1997:
-

Target Version/s: 1.1.0

 Update breeze to version 0.8.1
 --

 Key: SPARK-1997
 URL: https://issues.apache.org/jira/browse/SPARK-1997
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Guoqiang Li

 {{breeze 0.7}} does not support {{scala 2.11}} .



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2533) Show summary of locality level of completed tasks in the each stage page of web UI


 [ 
https://issues.apache.org/jira/browse/SPARK-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masayoshi TSUZUKI updated SPARK-2533:
-

Summary: Show summary of locality level of completed tasks in the each 
stage page of web UI  (was:  Show summary of locality level of completed 
tasks in the each stage page of web UI)

 Show summary of locality level of completed tasks in the each stage page of 
 web UI
 --

 Key: SPARK-2533
 URL: https://issues.apache.org/jira/browse/SPARK-2533
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Masayoshi TSUZUKI
Priority: Minor

 When the number of tasks is very large, it is impossible to know how many 
 tasks were executed under (PROCESS_LOCAL/NODE_LOCAL/RACK_LOCAL) from the 
 stage page of web UI. It would be better to show the summary of task locality 
 level in web UI.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2533) ---- Show summary of locality level of completed tasks in the each stage page of web UI

Masayoshi TSUZUKI created SPARK-2533:


 Summary:  Show summary of locality level of completed tasks in 
the each stage page of web UI
 Key: SPARK-2533
 URL: https://issues.apache.org/jira/browse/SPARK-2533
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Masayoshi TSUZUKI
Priority: Minor


When the number of tasks is very large, it is impossible to know how many tasks 
were executed under (PROCESS_LOCAL/NODE_LOCAL/RACK_LOCAL) from the stage page 
of web UI. It would be better to show the summary of task locality level in web 
UI.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2534) Avoid pulling in the entire RDD in groupByKey

Reynold Xin created SPARK-2534:
--

 Summary: Avoid pulling in the entire RDD in groupByKey
 Key: SPARK-2534
 URL: https://issues.apache.org/jira/browse/SPARK-2534
 Project: Spark
  Issue Type: Bug
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical


The way groupByKey is written actually pulls the entire PairRDDFunctions into 
the 3 closures, sometimes resulting in gigantic task sizes:

{code}
  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = {
// groupByKey shouldn't use map side combine because map side combine does 
not
// reduce the amount of data shuffled and requires all map side data be 
inserted
// into a hash table, leading to more objects in the old gen.
def createCombiner(v: V) = ArrayBuffer(v)
def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v
def mergeCombiners(c1: ArrayBuffer[V], c2: ArrayBuffer[V]) = c1 ++ c2
val bufs = combineByKey[ArrayBuffer[V]](
  createCombiner _, mergeValue _, mergeCombiners _, partitioner, 
mapSideCombine=false)
bufs.mapValues(_.toIterable)
  }
{code}

Changing the functions from def to val would solve it. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2534) Avoid pulling in the entire RDD in groupByKey

2014-07-16 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064333#comment-14064333
 ] 

Sandy Ryza commented on SPARK-2534:
---

Yowza

 Avoid pulling in the entire RDD in groupByKey
 -

 Key: SPARK-2534
 URL: https://issues.apache.org/jira/browse/SPARK-2534
 Project: Spark
  Issue Type: Bug
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical

 The way groupByKey is written actually pulls the entire PairRDDFunctions into 
 the 3 closures, sometimes resulting in gigantic task sizes:
 {code}
   def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = {
 // groupByKey shouldn't use map side combine because map side combine 
 does not
 // reduce the amount of data shuffled and requires all map side data be 
 inserted
 // into a hash table, leading to more objects in the old gen.
 def createCombiner(v: V) = ArrayBuffer(v)
 def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v
 def mergeCombiners(c1: ArrayBuffer[V], c2: ArrayBuffer[V]) = c1 ++ c2
 val bufs = combineByKey[ArrayBuffer[V]](
   createCombiner _, mergeValue _, mergeCombiners _, partitioner, 
 mapSideCombine=false)
 bufs.mapValues(_.toIterable)
   }
 {code}
 Changing the functions from def to val would solve it. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2501) Handle stage re-submissions properly in the UI


[ 
https://issues.apache.org/jira/browse/SPARK-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064350#comment-14064350
 ] 

Masayoshi TSUZUKI commented on SPARK-2501:
--

[SPARK-2299] seems to include the problem that key of some hashmaps should be 
stageId + attemptId instead of stageId only.

 Handle stage re-submissions properly in the UI
 --

 Key: SPARK-2501
 URL: https://issues.apache.org/jira/browse/SPARK-2501
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Patrick Wendell
Assignee: Masayoshi TSUZUKI
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2535) Add StringComparison case to NullPropagation.

Takuya Ueshin created SPARK-2535:


 Summary: Add StringComparison case to NullPropagation.
 Key: SPARK-2535
 URL: https://issues.apache.org/jira/browse/SPARK-2535
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin


{{StringComparison}} expressions including {{null}} literal cases could be 
added to {{NullPropagation}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2501) Handle stage re-submissions properly in the UI


[ 
https://issues.apache.org/jira/browse/SPARK-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064350#comment-14064350
 ] 

Masayoshi TSUZUKI edited comment on SPARK-2501 at 7/16/14 11:58 PM:


[SPARK-2299] seems to include the problem in JobProgressListener that key of 
some hashmaps should be stageId + attemptId instead of stageId only.


was (Author: tsudukim):
[SPARK-2299] seems to include the problem that key of some hashmaps should be 
stageId + attemptId instead of stageId only.

 Handle stage re-submissions properly in the UI
 --

 Key: SPARK-2501
 URL: https://issues.apache.org/jira/browse/SPARK-2501
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Patrick Wendell
Assignee: Masayoshi TSUZUKI
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2536) Update the MLlib page of Spark website

Xiangrui Meng created SPARK-2536:


 Summary: Update the MLlib page of Spark website
 Key: SPARK-2536
 URL: https://issues.apache.org/jira/browse/SPARK-2536
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


It stills shows v0.9.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2536) Update the MLlib page of Spark website


 [ 
https://issues.apache.org/jira/browse/SPARK-2536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2536:
-

Description: It still shows v0.9.  (was: It stills shows v0.9.)

 Update the MLlib page of Spark website
 --

 Key: SPARK-2536
 URL: https://issues.apache.org/jira/browse/SPARK-2536
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 It still shows v0.9.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2538) External aggregation in Python

2014-07-16 Thread Davies Liu (JIRA)

Davies Liu created SPARK-2538:
-

 Summary: External aggregation in Python
 Key: SPARK-2538
 URL: https://issues.apache.org/jira/browse/SPARK-2538
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.0.0, 1.0.1
Reporter: Davies Liu
 Fix For: 1.0.1, 1.0.0


For huge reduce tasks, user will got out of memory exception when all the data 
can not fit in memory.

It should put some of the data into disks and then merge them together, just 
like what we do in Scala. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2495) Ability to re-create ML models


[ 
https://issues.apache.org/jira/browse/SPARK-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064374#comment-14064374
 ] 

Alexander Albul commented on SPARK-2495:


Hi Meng,

Here is the list of models that potentially can can be created by used by 
hands:

* LogisticRegressionModel
* NaiveBayesModel
* SVMModel
* KMeansModel
* MatrixFactorizationModel
* LassoModel
* LinearRegressionModel
* RidgeRegressionModel

I propose to open all of them.
Even if you didn't decide yet the final form of NaiveBayesModel parameters, 
this should not reflect model's constructor visibility IMHO. 
We can make it visible and mark with DeveloperAPI annotation for example.
Anyway, i propose to change constructor visibility at least for all models 
except NaiveBayesModel

 Ability to re-create ML models
 --

 Key: SPARK-2495
 URL: https://issues.apache.org/jira/browse/SPARK-2495
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.1
Reporter: Alexander Albul
Assignee: Alexander Albul

 Hi everyone.
 Previously (prior to Spark 1.0) we was working with MLib like this:
 1) Calculate model (costly operation)
 2) Take model and collect it's fields like weights, intercept e.t.c.
 3) Store model somewhere in our format
 4) Do predictions by loading model attributes, creating new model and 
 predicting using it.
 Now i see that model's constructors have *private* modifier and cannot be 
 created from outside.
 If you want to hide implementation details and keep this constructor as 
 developer api, why not to create at least method, which will take weights, 
 intercept (for example) an materialize that model?
 A good example of model that i am talking about is: *LinearRegressionModel*
 I know that *LinearRegressionWithSGD* class have *createModel* method but the 
 problem is that it have *protected* modifier as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2537) Workaround Timezone specific Hive tests

2014-07-16 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-2537:
--

Description: 
Several Hive tests in {{HiveCompatibilitySuite}} are timezone sensitive:

- {{timestamp_1}}
- {{timestamp_2}}
- {{timestamp_3}}
- {{timestamp_udf}}

Their answers differ between different timezones. Caching golden answers 
naively cause build failures in other timezones. Currently these tests are 
blacklisted. A not so clever solution is to cache golden answers of all 
timezones for these tests, then select the right version for the current build 
according to system timezone.

  was:
Several Hive tests in {{HiveCompatibilitySuite}} are timezone sensitive:

- {{timestamp_1}}
- {{timestamp_2}}
- {{timestamp_3}}
- {{timestamp_udf}}

Their answers differ between different timezones. Caching golden answers 
naively cause build failures in other timezones. A not so clever solution is to 
cache golden answers of all timezones for these tests, then select the right 
version for the current build according to system timezone.


 Workaround Timezone specific Hive tests
 ---

 Key: SPARK-2537
 URL: https://issues.apache.org/jira/browse/SPARK-2537
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.1, 1.1.0
Reporter: Cheng Lian
Priority: Minor

 Several Hive tests in {{HiveCompatibilitySuite}} are timezone sensitive:
 - {{timestamp_1}}
 - {{timestamp_2}}
 - {{timestamp_3}}
 - {{timestamp_udf}}
 Their answers differ between different timezones. Caching golden answers 
 naively cause build failures in other timezones. Currently these tests are 
 blacklisted. A not so clever solution is to cache golden answers of all 
 timezones for these tests, then select the right version for the current 
 build according to system timezone.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2535) Add StringComparison case to NullPropagation.


[ 
https://issues.apache.org/jira/browse/SPARK-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064381#comment-14064381
 ] 

Takuya Ueshin commented on SPARK-2535:
--

PR: https://github.com/apache/spark/pull/1451

 Add StringComparison case to NullPropagation.
 -

 Key: SPARK-2535
 URL: https://issues.apache.org/jira/browse/SPARK-2535
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin

 {{StringComparison}} expressions including {{null}} literal cases could be 
 added to {{NullPropagation}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2539) ConnectionManager should handle Uncaught Exception

2014-07-16 Thread Kousuke Saruta (JIRA)

Kousuke Saruta created SPARK-2539:
-

 Summary: ConnectionManager should handle Uncaught Exception
 Key: SPARK-2539
 URL: https://issues.apache.org/jira/browse/SPARK-2539
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Kousuke Saruta


ConnectionManager uses ThreadPool and worker threads run on spawned thread in 
ThreadPool.
In current implementation, some uncaught exception thrown from the thread, 
nobody handle that.
If the Exception is fatal, it should be handled and executor should exit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2540) Add More Types Support for unwarpData of HiveUDF

Cheng Hao created SPARK-2540:


 Summary: Add More Types Support for unwarpData of HiveUDF
 Key: SPARK-2540
 URL: https://issues.apache.org/jira/browse/SPARK-2540
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor


In the function HiveInspectors.unwrapData, currently the 
HiveVarcharObjectInspector  HiveDecimalObjectInspector are not supported.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.


[ 
https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064421#comment-14064421
 ] 

Xiangrui Meng commented on SPARK-2433:
--

PR for branch-0.9: https://github.com/apache/spark/pull/1453

 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an 
 implementation bug.
 

 Key: SPARK-2433
 URL: https://issues.apache.org/jira/browse/SPARK-2433
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 0.9.1
 Environment: Any 
Reporter: Rahul K Bhojwani
  Labels: easyfix, test
   Original Estimate: 1h
  Remaining Estimate: 1h

 Don't have much experience with reporting errors. This is first time. If 
 something is not clear please feel free to contact me (details given below)
 In the pyspark mllib library. 
 Path : \spark-0.9.1\python\pyspark\mllib\classification.py
 Class: NaiveBayesModel
 Method:  self.predict
 Earlier Implementation:
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 
 New Implementation:
 No:1
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 No:2
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + dot(x,self.theta.T))
 Explanation:
 No:1 is correct according to me. Don't know about No:2.
 Error one:
 The matrix self.theta is of dimension [n_classes , n_features]. 
 while the matrix x is of dimension [1 , n_features].
 Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
 It will always give error:  ValueError: matrices are not aligned
 In the commented example given in the classification.py, n_classes = 
 n_features = 2. That's why no error.
 Both Implementation no.1 and Implementation no. 2 takes care of it.
 Error 2:
 As basic implementation of naive bayes is: P(class_n | sample) = 
 count_feature_1 * P(feature_1 | class_n ) * count_feature_n * 
 P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)
 and taking the class with max value.
 That's what implementation 1 is doing.
 In Implementation 2: 
 Its basically class with max value :
 ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * 
 P(feature_n|class_n) * P(class_n))
 Don't know if it gives the exact result.
 Thanks
 Rahul Bhojwani
 rahulbhojwani2...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.


 [ 
https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2433:
-

Fix Version/s: 1.0.0

 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an 
 implementation bug.
 

 Key: SPARK-2433
 URL: https://issues.apache.org/jira/browse/SPARK-2433
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 0.9.1
 Environment: Any 
Reporter: Rahul K Bhojwani
  Labels: easyfix, test
 Fix For: 1.0.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 Don't have much experience with reporting errors. This is first time. If 
 something is not clear please feel free to contact me (details given below)
 In the pyspark mllib library. 
 Path : \spark-0.9.1\python\pyspark\mllib\classification.py
 Class: NaiveBayesModel
 Method:  self.predict
 Earlier Implementation:
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 
 New Implementation:
 No:1
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 No:2
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + dot(x,self.theta.T))
 Explanation:
 No:1 is correct according to me. Don't know about No:2.
 Error one:
 The matrix self.theta is of dimension [n_classes , n_features]. 
 while the matrix x is of dimension [1 , n_features].
 Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
 It will always give error:  ValueError: matrices are not aligned
 In the commented example given in the classification.py, n_classes = 
 n_features = 2. That's why no error.
 Both Implementation no.1 and Implementation no. 2 takes care of it.
 Error 2:
 As basic implementation of naive bayes is: P(class_n | sample) = 
 count_feature_1 * P(feature_1 | class_n ) * count_feature_n * 
 P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)
 and taking the class with max value.
 That's what implementation 1 is doing.
 In Implementation 2: 
 Its basically class with max value :
 ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * 
 P(feature_n|class_n) * P(class_n))
 Don't know if it gives the exact result.
 Thanks
 Rahul Bhojwani
 rahulbhojwani2...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2438) Streaming + MLLib


 [ 
https://issues.apache.org/jira/browse/SPARK-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2438:
-

Assignee: Jeremy Freeman

 Streaming + MLLib
 -

 Key: SPARK-2438
 URL: https://issues.apache.org/jira/browse/SPARK-2438
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Jeremy Freeman
Assignee: Jeremy Freeman
  Labels: features

 This is a ticket to track progress on developing streaming analyses in MLLib.
 Many streaming applications benefit from or require fitting models online, 
 where the parameters of a model (e.g. regression, clustering) are updated 
 continually as new data arrive. This can be accomplished by incorporating 
 MLLib algorithms into model-updating operations over DStreams. In some cases 
 this can be achieved using existing updaters (e.g. those based on SGD), but 
 in other cases will require custom update rules (e.g. for KMeans). The goal 
 is to have streaming versions of many common algorithms, in particular 
 regression, classification, clustering, and possibly dimensionality reduction.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2541) Standalone mode can't access secure HDFS anymore

2014-07-16 Thread Thomas Graves (JIRA)

Thomas Graves created SPARK-2541:


 Summary: Standalone mode can't access secure HDFS anymore
 Key: SPARK-2541
 URL: https://issues.apache.org/jira/browse/SPARK-2541
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0, 1.0.1
Reporter: Thomas Graves


In spark 0.9.x you could access secure HDFS from Standalone deploy, that 
doesn't work in 1.X anymore. 

It looks like the issues is in SparkHadoopUtil.runAsSparkUser.  Previously it 
wouldn't do the doAs if the currentUser == user.  Not sure how it affects when 
the daemons run as a super user but SPARK_USER is set to someone else.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2406) Partitioned Parquet Support

2014-07-16 Thread Pat McDonough (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064450#comment-14064450
 ] 

Pat McDonough commented on SPARK-2406:
--

Okay, so it sounds like these are completely different use cases then, which 
might deserve their own JIRAs:
 * Native support for partitioned Parquet tables (probably what this JIRA is 
meant to address)
 * Automatic use of Native Parquet readers when using a Hive parquet table 
(what my previous comment described)

Does that sound correct?

 Partitioned Parquet Support
 ---

 Key: SPARK-2406
 URL: https://issues.apache.org/jira/browse/SPARK-2406
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2481) The environment variables SPARK_HISTORY_OPTS is covered in start-history-server.sh