[jira] [Created] (SPARK-2517) Remove as many compilation warning messages as possible
Reynold Xin created SPARK-2517: -- Summary: Remove as many compilation warning messages as possible Key: SPARK-2517 URL: https://issues.apache.org/jira/browse/SPARK-2517 Project: Spark Issue Type: Improvement Reporter: Reynold Xin Some examples: {code} [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala:138: abstract type ExpectedAppender is unchecked since it is eliminated by erasure [warn] assert(appender.isInstanceOf[ExpectedAppender]) [warn] ^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala:143: abstract type ExpectedRollingPolicy is unchecked since it is eliminated by erasure [warn] rollingPolicy.isInstanceOf[ExpectedRollingPolicy] [warn] ^ {code} {code} [warn] /scratch/rxin/spark/streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala:386: method connect in class IOManager is deprecated: use the new implementation in package akka.io instead [warn] override def preStart = IOManager(context.system).connect(new InetSocketAddress(port)) [warn] ^ [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:207: non-variable type argument String in type pattern Map[String,Any] is unchecked since it is eliminated by erasure [warn] case (key: String, struct: Map[String, Any]) = { [warn] ^ [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:238: non-variable type argument String in type pattern java.util.Map[String,Object] is unchecked since it is eliminated by erasure [warn] case map: java.util.Map[String, Object] = [warn] ^ [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:243: non-variable type argument Object in type pattern java.util.List[Object] is unchecked since it is eliminated by erasure [warn] case list: java.util.List[Object] = [warn] ^ [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:323: non-variable type argument String in type pattern Map[String,Any] is unchecked since it is eliminated by erasure [warn] case value: Map[String, Any] = toJsonObjectString(value) [warn] ^ [info] Compiling 2 Scala sources to /scratch/rxin/spark/repl/target/scala-2.10/test-classes... [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:382: method mapWith in class RDD is deprecated: use mapPartitionsWithIndex [warn] val randoms = ones.mapWith( [warn]^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:400: method flatMapWith in class RDD is deprecated: use mapPartitionsWithIndex and flatMap [warn] val randoms = ones.flatMapWith( [warn]^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:421: method filterWith in class RDD is deprecated: use mapPartitionsWithIndex and filter [warn] val sample = ints.filterWith( [warn] ^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala:76: method mapWith in class RDD is deprecated: use mapPartitionsWithIndex [warn] x.mapWith(x = x.toString)((x,y)=x + uc.op(y)) [warn] ^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala:82: method filterWith in class RDD is deprecated: use mapPartitionsWithIndex and filter [warn] x.filterWith(x = x.toString)((x,y)=uc.pred(y)) [warn] ^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/VectorSuite.scala:29: class Vector in package util is deprecated: Use Vectors.dense from Spark's mllib.linalg package instead. [warn] def verifyVector(vector: Vector, expectedLength: Int) = { [warn]^ [warn] one warning found {code} {code} [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:238: non-variable type argument String in type pattern java.util.Map[String,Object] is unchecked since it is eliminated by erasure [warn] case map: java.util.Map[String, Object] = [warn] ^ [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:243: non-variable type argument Object in type pattern java.util.List[Object] is unchecked since it is eliminated by erasure [warn] case list: java.util.List[Object] = [warn] ^ [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:323:
[jira] [Updated] (SPARK-2517) Remove as many compilation warning messages as possible
[ https://issues.apache.org/jira/browse/SPARK-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2517: --- Assignee: Yin Huai Remove as many compilation warning messages as possible --- Key: SPARK-2517 URL: https://issues.apache.org/jira/browse/SPARK-2517 Project: Spark Issue Type: Improvement Reporter: Reynold Xin Assignee: Yin Huai Some examples: {code} [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala:138: abstract type ExpectedAppender is unchecked since it is eliminated by erasure [warn] assert(appender.isInstanceOf[ExpectedAppender]) [warn] ^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala:143: abstract type ExpectedRollingPolicy is unchecked since it is eliminated by erasure [warn] rollingPolicy.isInstanceOf[ExpectedRollingPolicy] [warn] ^ {code} {code} [warn] /scratch/rxin/spark/streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala:386: method connect in class IOManager is deprecated: use the new implementation in package akka.io instead [warn] override def preStart = IOManager(context.system).connect(new InetSocketAddress(port)) [warn] ^ [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:207: non-variable type argument String in type pattern Map[String,Any] is unchecked since it is eliminated by erasure [warn] case (key: String, struct: Map[String, Any]) = { [warn] ^ [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:238: non-variable type argument String in type pattern java.util.Map[String,Object] is unchecked since it is eliminated by erasure [warn] case map: java.util.Map[String, Object] = [warn] ^ [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:243: non-variable type argument Object in type pattern java.util.List[Object] is unchecked since it is eliminated by erasure [warn] case list: java.util.List[Object] = [warn] ^ [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:323: non-variable type argument String in type pattern Map[String,Any] is unchecked since it is eliminated by erasure [warn] case value: Map[String, Any] = toJsonObjectString(value) [warn] ^ [info] Compiling 2 Scala sources to /scratch/rxin/spark/repl/target/scala-2.10/test-classes... [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:382: method mapWith in class RDD is deprecated: use mapPartitionsWithIndex [warn] val randoms = ones.mapWith( [warn]^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:400: method flatMapWith in class RDD is deprecated: use mapPartitionsWithIndex and flatMap [warn] val randoms = ones.flatMapWith( [warn]^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:421: method filterWith in class RDD is deprecated: use mapPartitionsWithIndex and filter [warn] val sample = ints.filterWith( [warn] ^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala:76: method mapWith in class RDD is deprecated: use mapPartitionsWithIndex [warn] x.mapWith(x = x.toString)((x,y)=x + uc.op(y)) [warn] ^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala:82: method filterWith in class RDD is deprecated: use mapPartitionsWithIndex and filter [warn] x.filterWith(x = x.toString)((x,y)=uc.pred(y)) [warn] ^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/VectorSuite.scala:29: class Vector in package util is deprecated: Use Vectors.dense from Spark's mllib.linalg package instead. [warn] def verifyVector(vector: Vector, expectedLength: Int) = { [warn]^ [warn] one warning found {code} {code} [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:238: non-variable type argument String in type pattern java.util.Map[String,Object] is unchecked since it is eliminated by erasure [warn] case map: java.util.Map[String, Object] = [warn] ^ [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:243:
[jira] [Created] (SPARK-2518) Fix foldability of Substring expression.
Takuya Ueshin created SPARK-2518: Summary: Fix foldability of Substring expression. Key: SPARK-2518 URL: https://issues.apache.org/jira/browse/SPARK-2518 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin This is a follow-up of [#1428|https://github.com/apache/spark/pull/1428]. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2518) Fix foldability of Substring expression.
[ https://issues.apache.org/jira/browse/SPARK-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063212#comment-14063212 ] Takuya Ueshin commented on SPARK-2518: -- PRed: https://github.com/apache/spark/pull/1432 Fix foldability of Substring expression. Key: SPARK-2518 URL: https://issues.apache.org/jira/browse/SPARK-2518 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin This is a follow-up of [#1428|https://github.com/apache/spark/pull/1428]. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2518) Fix foldability of Substring expression.
[ https://issues.apache.org/jira/browse/SPARK-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2518: --- Assignee: Takuya Ueshin Fix foldability of Substring expression. Key: SPARK-2518 URL: https://issues.apache.org/jira/browse/SPARK-2518 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Assignee: Takuya Ueshin This is a follow-up of [#1428|https://github.com/apache/spark/pull/1428]. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2517) Remove as many compilation warning messages as possible
[ https://issues.apache.org/jira/browse/SPARK-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2517: --- Description: We should probably treat warnings as failures in Jenkins. Some examples: {code} [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala:138: abstract type ExpectedAppender is unchecked since it is eliminated by erasure [warn] assert(appender.isInstanceOf[ExpectedAppender]) [warn] ^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/FileAppenderSuite.scala:143: abstract type ExpectedRollingPolicy is unchecked since it is eliminated by erasure [warn] rollingPolicy.isInstanceOf[ExpectedRollingPolicy] [warn] ^ {code} {code} [warn] /scratch/rxin/spark/streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala:386: method connect in class IOManager is deprecated: use the new implementation in package akka.io instead [warn] override def preStart = IOManager(context.system).connect(new InetSocketAddress(port)) [warn] ^ [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:207: non-variable type argument String in type pattern Map[String,Any] is unchecked since it is eliminated by erasure [warn] case (key: String, struct: Map[String, Any]) = { [warn] ^ [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:238: non-variable type argument String in type pattern java.util.Map[String,Object] is unchecked since it is eliminated by erasure [warn] case map: java.util.Map[String, Object] = [warn] ^ [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:243: non-variable type argument Object in type pattern java.util.List[Object] is unchecked since it is eliminated by erasure [warn] case list: java.util.List[Object] = [warn] ^ [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:323: non-variable type argument String in type pattern Map[String,Any] is unchecked since it is eliminated by erasure [warn] case value: Map[String, Any] = toJsonObjectString(value) [warn] ^ [info] Compiling 2 Scala sources to /scratch/rxin/spark/repl/target/scala-2.10/test-classes... [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:382: method mapWith in class RDD is deprecated: use mapPartitionsWithIndex [warn] val randoms = ones.mapWith( [warn]^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:400: method flatMapWith in class RDD is deprecated: use mapPartitionsWithIndex and flatMap [warn] val randoms = ones.flatMapWith( [warn]^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:421: method filterWith in class RDD is deprecated: use mapPartitionsWithIndex and filter [warn] val sample = ints.filterWith( [warn] ^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala:76: method mapWith in class RDD is deprecated: use mapPartitionsWithIndex [warn] x.mapWith(x = x.toString)((x,y)=x + uc.op(y)) [warn] ^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala:82: method filterWith in class RDD is deprecated: use mapPartitionsWithIndex and filter [warn] x.filterWith(x = x.toString)((x,y)=uc.pred(y)) [warn] ^ [warn] /scratch/rxin/spark/core/src/test/scala/org/apache/spark/util/VectorSuite.scala:29: class Vector in package util is deprecated: Use Vectors.dense from Spark's mllib.linalg package instead. [warn] def verifyVector(vector: Vector, expectedLength: Int) = { [warn]^ [warn] one warning found {code} {code} [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:238: non-variable type argument String in type pattern java.util.Map[String,Object] is unchecked since it is eliminated by erasure [warn] case map: java.util.Map[String, Object] = [warn] ^ [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:243: non-variable type argument Object in type pattern java.util.List[Object] is unchecked since it is eliminated by erasure [warn] case list: java.util.List[Object] = [warn] ^ [warn] /scratch/rxin/spark/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala:323: non-variable type argument String in type pattern Map[String,Any] is unchecked since
[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support
[ https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063247#comment-14063247 ] Chris Fregly commented on SPARK-1981: - PR: https://github.com/apache/spark/pull/1434 Add AWS Kinesis streaming support - Key: SPARK-1981 URL: https://issues.apache.org/jira/browse/SPARK-1981 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Chris Fregly Assignee: Chris Fregly Add AWS Kinesis support to Spark Streaming. Initial discussion occured here: https://github.com/apache/spark/pull/223 I discussed this with Parviz from AWS recently and we agreed that I would take this over. Look for a new PR that takes into account all the feedback from the earlier PR including spark-1.0-compliant implementation, AWS-license-aware build support, tests, comments, and style guide compliance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1761) Add broadcast information on SparkUI storage tab
[ https://issues.apache.org/jira/browse/SPARK-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063250#comment-14063250 ] Reynold Xin commented on SPARK-1761: This would be very useful actually. Add broadcast information on SparkUI storage tab Key: SPARK-1761 URL: https://issues.apache.org/jira/browse/SPARK-1761 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Or Fix For: 1.1.0 It would be nice to know where the broadcast blocks are persisted. More details coming. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2274) spark SQL query hang up sometimes
[ https://issues.apache.org/jira/browse/SPARK-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2274: --- Component/s: SQL spark SQL query hang up sometimes - Key: SPARK-2274 URL: https://issues.apache.org/jira/browse/SPARK-2274 Project: Spark Issue Type: Question Components: SQL Environment: spark 1.0.0 Reporter: jackielihf when I run spark SQL query, it hang up sometimes: 1) simple SQL query works, such as select * from a left out join b on a.id=b.id 2) BUT if it has more joins, such as select * from a left out join b on a.id=b.id left out join c on a.id=c.id..., spark shell seems to hang up. spark shell prints: scala hc.hql(select A.id,A.tit,A.sub_tit,B.abst,B.cont,A.aut,A.com_name,A.med_name,A.pub_dt,A.upd_time,A.ent_time,A.info_lvl,A.is_pic,A.lnk_addr,A.is_ann,A.info_open_lvl,A.keyw_name,C.typ_code as type,D.evt_dir,D.evt_st,E.secu_id,E.typ_codeii,E.exch_code,E.trd_code,E.secu_sht from txt_nws_bas_update A left outer join txt_nws_bas_txt B on A.id=B.orig_id left outer join txt_nws_typ C on A.id=C.orig_id left outer join txt_nws_secu D on A.id=D.orig_id left outer join bas_secu_info E on D.secu_id=E.secu_id where D.secu_id is not null limit 5).foreach(println) 14/06/25 13:32:25 INFO ParseDriver: Parsing command: select A.id,A.tit,A.sub_tit,B.abst,B.cont,A.aut,A.com_name,A.med_name,A.pub_dt,A.upd_time,A.ent_time,A.info_lvl,A.is_pic,A.lnk_addr,A.is_ann,A.info_open_lvl,A.keyw_name,C.typ_code as type,D.evt_dir,D.evt_st,E.secu_id,E.typ_codeii,E.exch_code,E.trd_code,E.secu_sht from txt_nws_bas_update A left outer join txt_nws_bas_txt B on A.id=B.orig_id left outer join txt_nws_typ C on A.id=C.orig_id left outer join txt_nws_secu D on A.id=D.orig_id left outer join bas_secu_info E on D.secu_id=E.secu_id where D.secu_id is not null limit 5 14/06/25 13:32:25 INFO ParseDriver: Parse Completed 14/06/25 13:32:25 INFO Analyzer: Max iterations (2) reached for batch MultiInstanceRelations 14/06/25 13:32:25 INFO Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences 14/06/25 13:32:27 INFO MemoryStore: ensureFreeSpace(220923) called with curMem=0, maxMem=311387750 14/06/25 13:32:27 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 215.7 KB, free 296.8 MB) 14/06/25 13:32:27 INFO MemoryStore: ensureFreeSpace(220971) called with curMem=220923, maxMem=311387750 14/06/25 13:32:27 INFO MemoryStore: Block broadcast_1 stored as values to memory (estimated size 215.8 KB, free 296.5 MB) 14/06/25 13:32:28 INFO MemoryStore: ensureFreeSpace(220971) called with curMem=441894, maxMem=311387750 14/06/25 13:32:28 INFO MemoryStore: Block broadcast_2 stored as values to memory (estimated size 215.8 KB, free 296.3 MB) 14/06/25 13:32:28 INFO MemoryStore: ensureFreeSpace(220971) called with curMem=662865, maxMem=311387750 14/06/25 13:32:28 INFO MemoryStore: Block broadcast_3 stored as values to memory (estimated size 215.8 KB, free 296.1 MB) 14/06/25 13:32:28 INFO MemoryStore: ensureFreeSpace(220971) called with curMem=883836, maxMem=311387750 14/06/25 13:32:28 INFO MemoryStore: Block broadcast_4 stored as values to memory (estimated size 215.8 KB, free 295.9 MB) 14/06/25 13:32:29 INFO SQLContext$$anon$1: Max iterations (2) reached for batch Add exchange 14/06/25 13:32:29 INFO SQLContext$$anon$1: Max iterations (2) reached for batch Prepare Expressions 14/06/25 13:32:30 INFO FileInputFormat: Total input paths to process : 1 14/06/25 13:32:30 INFO SparkContext: Starting job: collect at joins.scala:184 14/06/25 13:32:30 INFO DAGScheduler: Got job 0 (collect at joins.scala:184) with 2 output partitions (allowLocal=false) 14/06/25 13:32:30 INFO DAGScheduler: Final stage: Stage 0(collect at joins.scala:184) 14/06/25 13:32:30 INFO DAGScheduler: Parents of final stage: List() 14/06/25 13:32:30 INFO DAGScheduler: Missing parents: List() 14/06/25 13:32:30 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[7] at map at joins.scala:184), which has no missing parents 14/06/25 13:32:30 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[7] at map at joins.scala:184) 14/06/25 13:32:30 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 14/06/25 13:32:30 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor 0: 192.168.56.100 (PROCESS_LOCAL) 14/06/25 13:32:30 INFO TaskSetManager: Serialized task 0.0:0 as 4088 bytes in 3 ms 14/06/25 13:32:30 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on executor 1: 192.168.56.100 (PROCESS_LOCAL) 14/06/25 13:32:30 INFO TaskSetManager: Serialized task 0.0:1 as 4088 bytes in 2 ms 14/06/25 13:32:44 INFO BlockManagerInfo: Added taskresult_1 in memory on 192.168.56.100:47102 (size: 15.0 MB, free: 281.9 MB) 14/06/25
[jira] [Updated] (SPARK-2274) spark SQL query hang up sometimes
[ https://issues.apache.org/jira/browse/SPARK-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2274: --- Assignee: Michael Armbrust spark SQL query hang up sometimes - Key: SPARK-2274 URL: https://issues.apache.org/jira/browse/SPARK-2274 Project: Spark Issue Type: Question Components: SQL Environment: spark 1.0.0 Reporter: jackielihf Assignee: Michael Armbrust when I run spark SQL query, it hang up sometimes: 1) simple SQL query works, such as select * from a left out join b on a.id=b.id 2) BUT if it has more joins, such as select * from a left out join b on a.id=b.id left out join c on a.id=c.id..., spark shell seems to hang up. spark shell prints: scala hc.hql(select A.id,A.tit,A.sub_tit,B.abst,B.cont,A.aut,A.com_name,A.med_name,A.pub_dt,A.upd_time,A.ent_time,A.info_lvl,A.is_pic,A.lnk_addr,A.is_ann,A.info_open_lvl,A.keyw_name,C.typ_code as type,D.evt_dir,D.evt_st,E.secu_id,E.typ_codeii,E.exch_code,E.trd_code,E.secu_sht from txt_nws_bas_update A left outer join txt_nws_bas_txt B on A.id=B.orig_id left outer join txt_nws_typ C on A.id=C.orig_id left outer join txt_nws_secu D on A.id=D.orig_id left outer join bas_secu_info E on D.secu_id=E.secu_id where D.secu_id is not null limit 5).foreach(println) 14/06/25 13:32:25 INFO ParseDriver: Parsing command: select A.id,A.tit,A.sub_tit,B.abst,B.cont,A.aut,A.com_name,A.med_name,A.pub_dt,A.upd_time,A.ent_time,A.info_lvl,A.is_pic,A.lnk_addr,A.is_ann,A.info_open_lvl,A.keyw_name,C.typ_code as type,D.evt_dir,D.evt_st,E.secu_id,E.typ_codeii,E.exch_code,E.trd_code,E.secu_sht from txt_nws_bas_update A left outer join txt_nws_bas_txt B on A.id=B.orig_id left outer join txt_nws_typ C on A.id=C.orig_id left outer join txt_nws_secu D on A.id=D.orig_id left outer join bas_secu_info E on D.secu_id=E.secu_id where D.secu_id is not null limit 5 14/06/25 13:32:25 INFO ParseDriver: Parse Completed 14/06/25 13:32:25 INFO Analyzer: Max iterations (2) reached for batch MultiInstanceRelations 14/06/25 13:32:25 INFO Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences 14/06/25 13:32:27 INFO MemoryStore: ensureFreeSpace(220923) called with curMem=0, maxMem=311387750 14/06/25 13:32:27 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 215.7 KB, free 296.8 MB) 14/06/25 13:32:27 INFO MemoryStore: ensureFreeSpace(220971) called with curMem=220923, maxMem=311387750 14/06/25 13:32:27 INFO MemoryStore: Block broadcast_1 stored as values to memory (estimated size 215.8 KB, free 296.5 MB) 14/06/25 13:32:28 INFO MemoryStore: ensureFreeSpace(220971) called with curMem=441894, maxMem=311387750 14/06/25 13:32:28 INFO MemoryStore: Block broadcast_2 stored as values to memory (estimated size 215.8 KB, free 296.3 MB) 14/06/25 13:32:28 INFO MemoryStore: ensureFreeSpace(220971) called with curMem=662865, maxMem=311387750 14/06/25 13:32:28 INFO MemoryStore: Block broadcast_3 stored as values to memory (estimated size 215.8 KB, free 296.1 MB) 14/06/25 13:32:28 INFO MemoryStore: ensureFreeSpace(220971) called with curMem=883836, maxMem=311387750 14/06/25 13:32:28 INFO MemoryStore: Block broadcast_4 stored as values to memory (estimated size 215.8 KB, free 295.9 MB) 14/06/25 13:32:29 INFO SQLContext$$anon$1: Max iterations (2) reached for batch Add exchange 14/06/25 13:32:29 INFO SQLContext$$anon$1: Max iterations (2) reached for batch Prepare Expressions 14/06/25 13:32:30 INFO FileInputFormat: Total input paths to process : 1 14/06/25 13:32:30 INFO SparkContext: Starting job: collect at joins.scala:184 14/06/25 13:32:30 INFO DAGScheduler: Got job 0 (collect at joins.scala:184) with 2 output partitions (allowLocal=false) 14/06/25 13:32:30 INFO DAGScheduler: Final stage: Stage 0(collect at joins.scala:184) 14/06/25 13:32:30 INFO DAGScheduler: Parents of final stage: List() 14/06/25 13:32:30 INFO DAGScheduler: Missing parents: List() 14/06/25 13:32:30 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[7] at map at joins.scala:184), which has no missing parents 14/06/25 13:32:30 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[7] at map at joins.scala:184) 14/06/25 13:32:30 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 14/06/25 13:32:30 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor 0: 192.168.56.100 (PROCESS_LOCAL) 14/06/25 13:32:30 INFO TaskSetManager: Serialized task 0.0:0 as 4088 bytes in 3 ms 14/06/25 13:32:30 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on executor 1: 192.168.56.100 (PROCESS_LOCAL) 14/06/25 13:32:30 INFO TaskSetManager: Serialized task 0.0:1 as 4088 bytes in 2 ms 14/06/25 13:32:44 INFO BlockManagerInfo: Added taskresult_1 in memory on
[jira] [Commented] (SPARK-2519) Eliminate pattern-matching on Tuple2 in performance-critical aggregation code
[ https://issues.apache.org/jira/browse/SPARK-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063251#comment-14063251 ] Sandy Ryza commented on SPARK-2519: --- https://github.com/apache/spark/pull/1435 Eliminate pattern-matching on Tuple2 in performance-critical aggregation code - Key: SPARK-2519 URL: https://issues.apache.org/jira/browse/SPARK-2519 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.
[ https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2433: - Target Version/s: 0.9.2 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug. Key: SPARK-2433 URL: https://issues.apache.org/jira/browse/SPARK-2433 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 0.9.1 Environment: Any Reporter: Rahul K Bhojwani Labels: easyfix, test Original Estimate: 1h Remaining Estimate: 1h Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below) In the pyspark mllib library. Path : \spark-0.9.1\python\pyspark\mllib\classification.py Class: NaiveBayesModel Method: self.predict Earlier Implementation: def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) New Implementation: No:1 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) No:2 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + dot(x,self.theta.T)) Explanation: No:1 is correct according to me. Don't know about No:2. Error one: The matrix self.theta is of dimension [n_classes , n_features]. while the matrix x is of dimension [1 , n_features]. Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features]. It will always give error: ValueError: matrices are not aligned In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error. Both Implementation no.1 and Implementation no. 2 takes care of it. Error 2: As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE) and taking the class with max value. That's what implementation 1 is doing. In Implementation 2: Its basically class with max value : ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n)) Don't know if it gives the exact result. Thanks Rahul Bhojwani rahulbhojwani2...@gmail.com -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2521) Broadcast RDD object once per TaskSet (instead of sending it for every task)
[ https://issues.apache.org/jira/browse/SPARK-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063256#comment-14063256 ] Reynold Xin commented on SPARK-2521: cc [~matei] Broadcast RDD object once per TaskSet (instead of sending it for every task) Key: SPARK-2521 URL: https://issues.apache.org/jira/browse/SPARK-2521 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin This can substantially reduce task size, as well as being able to support very large closures (e.g. closures that reference large variables). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2521) Broadcast RDD object once per TaskSet (instead of sending it for every task)
[ https://issues.apache.org/jira/browse/SPARK-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2521: --- Component/s: Spark Core Broadcast RDD object once per TaskSet (instead of sending it for every task) Key: SPARK-2521 URL: https://issues.apache.org/jira/browse/SPARK-2521 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin This can substantially reduce task size, as well as being able to support very large closures (e.g. closures that reference large variables). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2521) Broadcast RDD object once per TaskSet (instead of sending it for every task)
[ https://issues.apache.org/jira/browse/SPARK-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2521: --- Description: This can substantially reduce task size, as well as being able to support very large closures (e.g. closures that reference large variables). Once this is in, we can also remove broadcasting the Hadoop JobConf. was: This can substantially reduce task size, as well as being able to support very large closures (e.g. closures that reference large variables). Broadcast RDD object once per TaskSet (instead of sending it for every task) Key: SPARK-2521 URL: https://issues.apache.org/jira/browse/SPARK-2521 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Reynold Xin Assignee: Reynold Xin This can substantially reduce task size, as well as being able to support very large closures (e.g. closures that reference large variables). Once this is in, we can also remove broadcasting the Hadoop JobConf. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.
[ https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063268#comment-14063268 ] Xiangrui Meng commented on SPARK-2433: -- [~rahul1993] Thanks for reporting this bug! It was fixed in branch-1.0 but not in branch-0.9. So you can send a PR similar to https://github.com/apache/spark/pull/463 to branch-0.9, following https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark . I'll try to cut a release candidate for v0.9.2 tomorrow. So if you don't have time to send the PR, I may have to do that myself. In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug. Key: SPARK-2433 URL: https://issues.apache.org/jira/browse/SPARK-2433 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 0.9.1 Environment: Any Reporter: Rahul K Bhojwani Labels: easyfix, test Original Estimate: 1h Remaining Estimate: 1h Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below) In the pyspark mllib library. Path : \spark-0.9.1\python\pyspark\mllib\classification.py Class: NaiveBayesModel Method: self.predict Earlier Implementation: def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) New Implementation: No:1 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) No:2 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + dot(x,self.theta.T)) Explanation: No:1 is correct according to me. Don't know about No:2. Error one: The matrix self.theta is of dimension [n_classes , n_features]. while the matrix x is of dimension [1 , n_features]. Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features]. It will always give error: ValueError: matrices are not aligned In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error. Both Implementation no.1 and Implementation no. 2 takes care of it. Error 2: As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE) and taking the class with max value. That's what implementation 1 is doing. In Implementation 2: Its basically class with max value : ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n)) Don't know if it gives the exact result. Thanks Rahul Bhojwani rahulbhojwani2...@gmail.com -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2522) Use TorrentBroadcastFactory as the default broadcast factory
Xiangrui Meng created SPARK-2522: Summary: Use TorrentBroadcastFactory as the default broadcast factory Key: SPARK-2522 URL: https://issues.apache.org/jira/browse/SPARK-2522 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Xiangrui Meng Assignee: Xiangrui Meng HttpBroadcastFactory is the current default broadcast factory. It sends the broadcast data to each worker one by one, which is slow when the cluster is big. TorrentBroadcastFactory scales much better than http. Maybe we should make torrent the default broadcast method. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2523) Potential Bugs if SerDe is not the identical among partitions and table
Cheng Hao created SPARK-2523: Summary: Potential Bugs if SerDe is not the identical among partitions and table Key: SPARK-2523 URL: https://issues.apache.org/jira/browse/SPARK-2523 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao In HiveTableScan.scala, ObjectInspector was created for all of the partition based records, which probably causes ClassCastException if the object inspector is not identical among table partitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2523) Potential Bugs if SerDe is not the identical among partitions and table
[ https://issues.apache.org/jira/browse/SPARK-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063288#comment-14063288 ] Cheng Hao commented on SPARK-2523: -- This is the follow up for https://github.com/apache/spark/pull/1408 https://github.com/apache/spark/pull/1390 Potential Bugs if SerDe is not the identical among partitions and table --- Key: SPARK-2523 URL: https://issues.apache.org/jira/browse/SPARK-2523 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao In HiveTableScan.scala, ObjectInspector was created for all of the partition based records, which probably causes ClassCastException if the object inspector is not identical among table partitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2523) Potential Bugs if SerDe is not the identical among partitions and table
[ https://issues.apache.org/jira/browse/SPARK-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063288#comment-14063288 ] Cheng Hao edited comment on SPARK-2523 at 7/16/14 8:42 AM: --- This is the follow up for https://github.com/apache/spark/pull/1408 https://github.com/apache/spark/pull/1390 The PR is https://github.com/apache/spark/pull/1439 was (Author: chenghao): This is the follow up for https://github.com/apache/spark/pull/1408 https://github.com/apache/spark/pull/1390 Potential Bugs if SerDe is not the identical among partitions and table --- Key: SPARK-2523 URL: https://issues.apache.org/jira/browse/SPARK-2523 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao In HiveTableScan.scala, ObjectInspector was created for all of the partition based records, which probably causes ClassCastException if the object inspector is not identical among table partitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2523) Potential Bugs if SerDe is not the identical among partitions and table
[ https://issues.apache.org/jira/browse/SPARK-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063289#comment-14063289 ] Cheng Hao commented on SPARK-2523: -- [~yhuai] Can you review the code for me? Potential Bugs if SerDe is not the identical among partitions and table --- Key: SPARK-2523 URL: https://issues.apache.org/jira/browse/SPARK-2523 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao In HiveTableScan.scala, ObjectInspector was created for all of the partition based records, which probably causes ClassCastException if the object inspector is not identical among table partitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2520) the executor is thrown java.io.StreamCorruptedException
[ https://issues.apache.org/jira/browse/SPARK-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-2520: --- Description: This issue occurs with a very small probability. I can not reproduce it. The executor log: {code} 14/07/15 21:54:50 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to spark@sanshan:34429 14/07/15 21:54:50 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to spark@sanshan:31934 14/07/15 21:54:50 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to spark@sanshan:30557 14/07/15 21:54:50 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to spark@sanshan:42606 14/07/15 21:54:50 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to spark@sanshan:37314 14/07/15 21:54:50 INFO scheduler.TaskSetManager: Starting task 0.0:166 as TID 4948 on executor 20: tuan221 (PROCESS_LOCAL) 14/07/15 21:54:50 INFO scheduler.TaskSetManager: Serialized task 0.0:166 as 3129 bytes in 1 ms 14/07/15 21:54:50 WARN scheduler.TaskSetManager: Lost TID 4868 (task 0.0:86) 14/07/15 21:54:50 WARN scheduler.TaskSetManager: Loss was due to java.io.StreamCorruptedException java.io.StreamCorruptedException: invalid type code: AC at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1377) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:87) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$3.apply(PairRDDFunctions.scala:101) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$3.apply(PairRDDFunctions.scala:100) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/07/15 21:54:50 INFO scheduler.TaskSetManager: Starting task 0.0:86 as TID 4949 on executor 20: tuan221 (PROCESS_LOCAL) 14/07/15 21:54:50 INFO scheduler.TaskSetManager: Serialized task 0.0:86 as 3129 bytes in 0 ms 14/07/15 21:54:50 WARN scheduler.TaskSetManager: Lost TID 4785 (task 0.0:3) {code} was: This issue occurs with a very small probability. I can not reproduce it. The executor log: {code} java.io.StreamCorruptedException: invalid type code: AC at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1377) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:87) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$3.apply(PairRDDFunctions.scala:101) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$3.apply(PairRDDFunctions.scala:100) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582) at
[jira] [Commented] (SPARK-2465) Use long as user / item ID for ALS
[ https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063344#comment-14063344 ] Sean Owen commented on SPARK-2465: -- Yeah that's a good separate point. My hunch is that serialization would remove much of the difference. I think this may need to be closed for now as there's not quite support; I'll comment separately on the PR. Use long as user / item ID for ALS -- Key: SPARK-2465 URL: https://issues.apache.org/jira/browse/SPARK-2465 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.1 Reporter: Sean Owen Priority: Minor Attachments: ALS using MEMORY_AND_DISK.png, ALS using MEMORY_AND_DISK_SER.png, Screen Shot 2014-07-13 at 8.49.40 PM.png I'd like to float this for consideration: use longs instead of ints for user and product IDs in the ALS implementation. The main reason for is that identifiers are not generally numeric at all, and will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits means collisions are likely after hundreds of thousands of users and items, which is not unrealistic. Hashing to 64 bits pushes this back to billions. It would also mean numeric IDs that happen to be larger than the largest int can be used directly as identifiers. On the downside of course: 8 bytes instead of 4 bytes of memory used per Rating. Thoughts? I will post a PR so as to show what the change would be. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop
[ https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063350#comment-14063350 ] Kostiantyn Kudriavtsev commented on SPARK-2356: --- and the use case when I got this exception - I didn't touch hadoop at all My code works only with local files, not HDFS! It was very strange to stuck in this kind of issue. I believe, it must be marked as critical and fixed asap! Exception: Could not locate executable null\bin\winutils.exe in the Hadoop --- Key: SPARK-2356 URL: https://issues.apache.org/jira/browse/SPARK-2356 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Kostiantyn Kudriavtsev I'm trying to run some transformation on Spark, it works fine on cluster (YARN, linux machines). However, when I'm trying to run it on local machine (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file from local filesystem): 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.init(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) at org.apache.spark.SparkContext.init(SparkContext.scala:97) It's happend because Hadoop config is initialised each time when spark context is created regardless is hadoop required or not. I propose to add some special flag to indicate if hadoop config is required (or start this configuration manually) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop
[ https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kostiantyn Kudriavtsev updated SPARK-2356: -- Priority: Critical (was: Major) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop --- Key: SPARK-2356 URL: https://issues.apache.org/jira/browse/SPARK-2356 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Kostiantyn Kudriavtsev Priority: Critical I'm trying to run some transformation on Spark, it works fine on cluster (YARN, linux machines). However, when I'm trying to run it on local machine (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file from local filesystem): 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.clinit(Shell.java:326) at org.apache.hadoop.util.StringUtils.clinit(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.init(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.init(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.init(SparkContext.scala:228) at org.apache.spark.SparkContext.init(SparkContext.scala:97) It's happend because Hadoop config is initialised each time when spark context is created regardless is hadoop required or not. I propose to add some special flag to indicate if hadoop config is required (or start this configuration manually) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063380#comment-14063380 ] Sean Owen commented on SPARK-2420: -- I think Jetty is the only actual issue here. Guava actually won't be an issue if it's set to 11 in Spark, or downstream. (Spark does not use anything in Guava 12+, or else, it wouldn't work in some Hadoop contexts.) Servlet 3.0 really truly does work for all of this. The trick is actually removing all the other copies of Servlet 2.5! However, Jetty versions could be a real stumbling block. I'd like to focus on that. There are basically two namespaces that Jetty uses, from its older incarnation and newer versions. You have to harmonize both. What does Hive need vs what's in Spark? The reason I have some hope Xuefu is that obviously Spark already works in an assembly with Hive classes. However, it may not be quite the same version, and, I know in one case hive-exec had to be shaded because it is not available as a non-assembly jar. There are some devils in the details. Having seen a lot of this first-hand here, I can try to help. Can this be elaborated with specific problems? Change Spark build to minimize library conflicts Key: SPARK-2420 URL: https://issues.apache.org/jira/browse/SPARK-2420 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.0.0 Reporter: Xuefu Zhang Attachments: spark_1.0.0.patch During the prototyping of HIVE-7292, many library conflicts showed up because Spark build contains versions of libraries that's vastly different from current major Hadoop version. It would be nice if we can choose versions that's in line with Hadoop or shading them in the assembly. Here are the wish list: 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 2. Shading Spark's jetty and servlet dependency in the assembly. 3. guava version difference. Spark is using a higher version. I'm not sure what's the best solution for this. The list may grow as HIVE-7292 proceeds. For information only, the attached is a patch that we applied on Spark in order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2454) Separate driver spark home from executor spark home
[ https://issues.apache.org/jira/browse/SPARK-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063402#comment-14063402 ] Nan Zhu commented on SPARK-2454: this will make sparkHome as an application-specific parameter explicitly, I just thought it will confuse the user since sparkHome is actually a global setup for all application/executors run on the same machine The good thing here is it can support the user to run the application in different version of spark sharing the same cluster.(especially when you are doing spark dev work) Separate driver spark home from executor spark home --- Key: SPARK-2454 URL: https://issues.apache.org/jira/browse/SPARK-2454 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Fix For: 1.1.0 The driver may not always share the same directory structure as the executors. It makes little sense to always re-use the driver's spark home on the executors. https://github.com/apache/spark/pull/1244/ is an open effort to fix this. However, this still requires us to set SPARK_HOME on all the executor nodes. Really we should separate this out into something like spark.driver.home spark.executor.home rather than re-using SPARK_HOME everywhere. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063419#comment-14063419 ] Sean Owen commented on SPARK-2341: -- OK is it worth a pull request for changing the boolean multiclass argument to a string? I wanted to ask if that was your intent before I do that. libsvm format support is certainly important. It happens to have to encode non-numeric input as numbers. It need not be that way throughout MLlib, since it isn't that way in other input formats. (In this API method, it's pretty minor, since libsvm does by definition use this encoding.) So yes that would be great if data sets or API objects didn't assume that categorical data was numeric, but encoded type in the data set or even in the object model itself. I think it's mostly a design and type-safety argument -- same reason we have String instead of just byte[] everywhere. Sure I will have to build this conversion at some point anyway and can share the result then. loadLibSVMFile doesn't handle regression datasets - Key: SPARK-2341 URL: https://issues.apache.org/jira/browse/SPARK-2341 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Eustache Priority: Minor Labels: easyfix Many datasets exist in LibSVM format for regression tasks [1] but currently the loadLibSVMFile primitive doesn't handle regression datasets. More precisely, the LabelParser is either a MulticlassLabelParser or a BinaryLabelParser. What happens then is that the file is loaded but in multiclass mode : each target value is interpreted as a class name ! The fix would be to write a RegressionLabelParser which converts target values to Double and plug it into the loadLibSVMFile routine. [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2190) Specialized ColumnType for Timestamp
[ https://issues.apache.org/jira/browse/SPARK-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063431#comment-14063431 ] Cheng Lian commented on SPARK-2190: --- PR https://github.com/apache/spark/pull/1440 Specialized ColumnType for Timestamp Key: SPARK-2190 URL: https://issues.apache.org/jira/browse/SPARK-2190 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Critical I'm going to call this a bug since currently its like 300X slower than it needs to be. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-953) Latent Dirichlet Association (LDA model)
[ https://issues.apache.org/jira/browse/SPARK-953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063447#comment-14063447 ] Masaki Rikitoku commented on SPARK-953: --- parallel gibbs sampling for lda (plda) may be usable. Latent Dirichlet Association (LDA model) Key: SPARK-953 URL: https://issues.apache.org/jira/browse/SPARK-953 Project: Spark Issue Type: Story Components: Examples Affects Versions: 0.7.3 Reporter: caizhua Priority: Critical This code is for learning the LDA model. However, if our input is 2.5 M documents per machine, a dictionary with 1 words, running in EC2 m2.4xlarge instance with 68 G memory each machine. The time is really really slow. For five iterations, the time cost is 8145, 24725, 51688, 58674, 56850 seconds. The time for shuffling is quite slow. The LDA.tbl is the simulated data set for the program, and it is quite fast. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2313) PySpark should accept port via a command line argument rather than STDIN
[ https://issues.apache.org/jira/browse/SPARK-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063454#comment-14063454 ] Matthew Farrellee commented on SPARK-2313: -- as this stands, having another communication mechanism for py4j that can be controlled by the parent is the proper solution. using something like a domain socket may also assist in the return path from py4j (tmp file). fyi, a recent change pushed all existing output to stderr in the spark-class/spark-submit path i'm not actively working on this PySpark should accept port via a command line argument rather than STDIN Key: SPARK-2313 URL: https://issues.apache.org/jira/browse/SPARK-2313 Project: Spark Issue Type: Bug Components: PySpark Reporter: Patrick Wendell Relying on stdin is a brittle mechanism and has broken several times in the past. From what I can tell this is used only to bootstrap worker.py one time. It would be strictly simpler to just pass it is a command line. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2443) Reading from Partitioned Tables is Slow
[ https://issues.apache.org/jira/browse/SPARK-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063455#comment-14063455 ] Teng Qiu commented on SPARK-2443: - Hi, how can you access parquet table using HiveContext / hql in spark? i tried to add parquet-hive jar or parquet-hive-bundle jar in driver-class-path and spark.executor.extraClassPath, but it doesn't work. CDH5.0.2, hive 0.12.0-cdh5 Thanks Reading from Partitioned Tables is Slow --- Key: SPARK-2443 URL: https://issues.apache.org/jira/browse/SPARK-2443 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Zongheng Yang Fix For: 1.1.0, 1.0.2 Here are some numbers, all queries return ~20million: {code} SELECT COUNT(*) FROM non partitioned table 5.496467726 s SELECT COUNT(*) FROM partitioned table stored in parquet 50.26947 s SELECT COUNT(*) FROM same table as previous but loaded with parquetFile instead of through hive 2s {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2111) pyspark errors when SPARK_PRINT_LAUNCH_COMMAND=1
[ https://issues.apache.org/jira/browse/SPARK-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063457#comment-14063457 ] Matthew Farrellee commented on SPARK-2111: -- this was resolved by https://github.com/apache/spark/pull/1050 pyspark errors when SPARK_PRINT_LAUNCH_COMMAND=1 Key: SPARK-2111 URL: https://issues.apache.org/jira/browse/SPARK-2111 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Thomas Graves If you set SPARK_PRINT_LAUNCH_COMMAND=1 to see what java command is being used to launch spark and then try to run pyspark it errors out with a very non-useful error message: Traceback (most recent call last): File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/shell.py, line 43, in module sc = SparkContext(appName=PySparkShell, pyFiles=add_files) File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/context.py, line 94, in __init__ SparkContext._ensure_initialized(self, gateway=gateway) File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/context.py, line 184, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway() File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/java_gateway.py, line 51, in launch_gateway gateway_port = int(proc.stdout.readline()) ValueError: invalid literal for int() with base 10: 'Spark Command: /home/gs/java/jdk/bin/java -cp :/home/gs/hadoop/current/share/hadoop/common/hadoop-gpl-compression.jar:/home/gs/hadoop/current/share/hadoop/hdfs/lib/YahooDNSToSwitchMapping-0.2.14020207' -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2111) pyspark errors when SPARK_PRINT_LAUNCH_COMMAND=1
[ https://issues.apache.org/jira/browse/SPARK-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063462#comment-14063462 ] Matthew Farrellee commented on SPARK-2111: -- [~pwendell] please close this issue as resolved pyspark errors when SPARK_PRINT_LAUNCH_COMMAND=1 Key: SPARK-2111 URL: https://issues.apache.org/jira/browse/SPARK-2111 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Thomas Graves If you set SPARK_PRINT_LAUNCH_COMMAND=1 to see what java command is being used to launch spark and then try to run pyspark it errors out with a very non-useful error message: Traceback (most recent call last): File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/shell.py, line 43, in module sc = SparkContext(appName=PySparkShell, pyFiles=add_files) File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/context.py, line 94, in __init__ SparkContext._ensure_initialized(self, gateway=gateway) File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/context.py, line 184, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway() File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/java_gateway.py, line 51, in launch_gateway gateway_port = int(proc.stdout.readline()) ValueError: invalid literal for int() with base 10: 'Spark Command: /home/gs/java/jdk/bin/java -cp :/home/gs/hadoop/current/share/hadoop/common/hadoop-gpl-compression.jar:/home/gs/hadoop/current/share/hadoop/hdfs/lib/YahooDNSToSwitchMapping-0.2.14020207' -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2111) pyspark errors when SPARK_PRINT_LAUNCH_COMMAND=1
[ https://issues.apache.org/jira/browse/SPARK-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-2111. -- Resolution: Fixed Fix Version/s: 1.1.0 1.0.1 pyspark errors when SPARK_PRINT_LAUNCH_COMMAND=1 Key: SPARK-2111 URL: https://issues.apache.org/jira/browse/SPARK-2111 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Thomas Graves Fix For: 1.0.1, 1.1.0 If you set SPARK_PRINT_LAUNCH_COMMAND=1 to see what java command is being used to launch spark and then try to run pyspark it errors out with a very non-useful error message: Traceback (most recent call last): File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/shell.py, line 43, in module sc = SparkContext(appName=PySparkShell, pyFiles=add_files) File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/context.py, line 94, in __init__ SparkContext._ensure_initialized(self, gateway=gateway) File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/context.py, line 184, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway() File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/java_gateway.py, line 51, in launch_gateway gateway_port = int(proc.stdout.readline()) ValueError: invalid literal for int() with base 10: 'Spark Command: /home/gs/java/jdk/bin/java -cp :/home/gs/hadoop/current/share/hadoop/common/hadoop-gpl-compression.jar:/home/gs/hadoop/current/share/hadoop/hdfs/lib/YahooDNSToSwitchMapping-0.2.14020207' -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2111) pyspark errors when SPARK_PRINT_LAUNCH_COMMAND=1
[ https://issues.apache.org/jira/browse/SPARK-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-2111: - Assignee: Prashant Sharma pyspark errors when SPARK_PRINT_LAUNCH_COMMAND=1 Key: SPARK-2111 URL: https://issues.apache.org/jira/browse/SPARK-2111 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Thomas Graves Assignee: Prashant Sharma Fix For: 1.0.1, 1.1.0 If you set SPARK_PRINT_LAUNCH_COMMAND=1 to see what java command is being used to launch spark and then try to run pyspark it errors out with a very non-useful error message: Traceback (most recent call last): File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/shell.py, line 43, in module sc = SparkContext(appName=PySparkShell, pyFiles=add_files) File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/context.py, line 94, in __init__ SparkContext._ensure_initialized(self, gateway=gateway) File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/context.py, line 184, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway() File /homes/tgraves/test/hadoop2/y-spark-git/python/pyspark/java_gateway.py, line 51, in launch_gateway gateway_port = int(proc.stdout.readline()) ValueError: invalid literal for int() with base 10: 'Spark Command: /home/gs/java/jdk/bin/java -cp :/home/gs/hadoop/current/share/hadoop/common/hadoop-gpl-compression.jar:/home/gs/hadoop/current/share/hadoop/hdfs/lib/YahooDNSToSwitchMapping-0.2.14020207' -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2443) Reading from Partitioned Tables is Slow
[ https://issues.apache.org/jira/browse/SPARK-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063530#comment-14063530 ] Michael Armbrust commented on SPARK-2443: - [~chutium], I would recommend using the native parquet interface if possible (explained in the programming guide: http://spark.apache.org/docs/latest/sql-programming-guide.html). That said, others have reported success using the hive serde for parquet as well. Please ask future questions on the user mailing list and not JIRA: https://spark.apache.org/community.html Reading from Partitioned Tables is Slow --- Key: SPARK-2443 URL: https://issues.apache.org/jira/browse/SPARK-2443 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Zongheng Yang Fix For: 1.1.0, 1.0.2 Here are some numbers, all queries return ~20million: {code} SELECT COUNT(*) FROM non partitioned table 5.496467726 s SELECT COUNT(*) FROM partitioned table stored in parquet 50.26947 s SELECT COUNT(*) FROM same table as previous but loaded with parquetFile instead of through hive 2s {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] RJ Nowling updated SPARK-2308: -- Attachment: uneven_centers.pdf many_small_centers.pdf Add KMeans MiniBatch clustering algorithm to MLlib -- Key: SPARK-2308 URL: https://issues.apache.org/jira/browse/SPARK-2308 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Priority: Minor Attachments: many_small_centers.pdf, uneven_centers.pdf Mini-batch is a version of KMeans that uses a randomly-sampled subset of the data points in each iteration instead of the full set of data points, improving performance (and in some cases, accuracy). The mini-batch version is compatible with the KMeans|| initialization algorithm currently implemented in MLlib. I suggest adding KMeans Mini-batch as an alternative. I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063601#comment-14063601 ] RJ Nowling commented on SPARK-2308: --- I tested kmeans vs minibatch kmeans under 2 scenarios: * 4 centers of 1000, 100, 10, and 1 data points. * 100 centers with 10 points each The proposed centers were generated along a grid. The data points were generated by adding samples from N(0, 1.0) in each dimension to the centers. I found the expected centers by averaging the points generated from each proposed center. I ran KMeans and MiniBatch KMeans for each set of data points with 30 iterations and k-means++ initialization. I plotted the expected centers (blue), KMeans centers (red), and MiniBatch centers (green). The two method showed similar results. They both struggled with the small clusters and ended up finding two centers for the large cluster, ignoring the single data point. For the 100 even clusters, both methods got most of the centers reasonably correct and in a few cases, had 2 centers where there should be 1. I've attached the plots (many_small_centers,pdf, uneven_centers.pdf). In reviewing the scikit-learn implementation, I saw that they handled small clusters as special cases. In the case of small clusters, one of the points in the cluster is randomly chosen as the center instead of finding the center as a running average of the points sampled. Add KMeans MiniBatch clustering algorithm to MLlib -- Key: SPARK-2308 URL: https://issues.apache.org/jira/browse/SPARK-2308 Project: Spark Issue Type: New Feature Components: MLlib Reporter: RJ Nowling Priority: Minor Attachments: many_small_centers.pdf, uneven_centers.pdf Mini-batch is a version of KMeans that uses a randomly-sampled subset of the data points in each iteration instead of the full set of data points, improving performance (and in some cases, accuracy). The mini-batch version is compatible with the KMeans|| initialization algorithm currently implemented in MLlib. I suggest adding KMeans Mini-batch as an alternative. I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2525) Remove as many compilation warning messages as possible in Spark SQL
Yin Huai created SPARK-2525: --- Summary: Remove as many compilation warning messages as possible in Spark SQL Key: SPARK-2525 URL: https://issues.apache.org/jira/browse/SPARK-2525 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2525) Remove as many compilation warning messages as possible in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063701#comment-14063701 ] Yin Huai commented on SPARK-2525: - Those deprecation warnings in Spark SQL are caused by using Hive's Serializer and Deserializer (these two are marked deprecated). We have to use them for now. We can resolve these warnings once SerDe interfaces in Hive have been cleaned up. Remove as many compilation warning messages as possible in Spark SQL Key: SPARK-2525 URL: https://issues.apache.org/jira/browse/SPARK-2525 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (SPARK-2314) RDD actions are only overridden in Scala, not java or python
[ https://issues.apache.org/jira/browse/SPARK-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reopened SPARK-2314: - Yeah I think we might need to do this in python too as at least take is implemented manually in rdd.py RDD actions are only overridden in Scala, not java or python Key: SPARK-2314 URL: https://issues.apache.org/jira/browse/SPARK-2314 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0, 1.0.1 Reporter: Michael Armbrust Assignee: Aaron Staple Labels: starter Fix For: 1.1.0, 1.0.2 For example, collect and take(). We should keep these two in sync, or move this code to schemaRDD like if possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2526) Simplify make-distribution.sh
Patrick Wendell created SPARK-2526: -- Summary: Simplify make-distribution.sh Key: SPARK-2526 URL: https://issues.apache.org/jira/browse/SPARK-2526 Project: Spark Issue Type: Bug Components: Build Reporter: Patrick Wendell Assignee: Patrick Wendell There is a some complexity make-distribution.sh around selecting profiles. This is both annoying to maintain and also limits the number of ways that packagers can use this. For instance, it's not possible to build with separate HDFS and YARN versions, and supporting this with our current flags would get pretty complicated. We should just allow the user to pass a list of profiles directly to make-distribution.sh - the Maven build itself is already parameterized to support this. We also now have good docs explaining the use of profiles in the Maven build. All of this logic was more necessary when we used SBT for the package build, but we haven't done that for several versions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2526) Simplify make-distribution.sh to just pass through Maven options
[ https://issues.apache.org/jira/browse/SPARK-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2526: --- Summary: Simplify make-distribution.sh to just pass through Maven options (was: Simplify make-distribution.sh) Simplify make-distribution.sh to just pass through Maven options Key: SPARK-2526 URL: https://issues.apache.org/jira/browse/SPARK-2526 Project: Spark Issue Type: Bug Components: Build Reporter: Patrick Wendell Assignee: Patrick Wendell There is a some complexity make-distribution.sh around selecting profiles. This is both annoying to maintain and also limits the number of ways that packagers can use this. For instance, it's not possible to build with separate HDFS and YARN versions, and supporting this with our current flags would get pretty complicated. We should just allow the user to pass a list of profiles directly to make-distribution.sh - the Maven build itself is already parameterized to support this. We also now have good docs explaining the use of profiles in the Maven build. All of this logic was more necessary when we used SBT for the package build, but we haven't done that for several versions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2523) Potential Bugs if SerDe is not the identical among partitions and table
[ https://issues.apache.org/jira/browse/SPARK-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063753#comment-14063753 ] Yin Huai commented on SPARK-2523: - Yeah, no problem. Can you add a case which can trigger the bug in the description? Potential Bugs if SerDe is not the identical among partitions and table --- Key: SPARK-2523 URL: https://issues.apache.org/jira/browse/SPARK-2523 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao In HiveTableScan.scala, ObjectInspector was created for all of the partition based records, which probably causes ClassCastException if the object inspector is not identical among table partitions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2495) Ability to re-create ML models
[ https://issues.apache.org/jira/browse/SPARK-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063762#comment-14063762 ] Alexander Albul commented on SPARK-2495: Yes, i can work on it, but first i need to understand what is the reason of making constructors *private*. I can propose approach with some utility object that we can use to create different models: ModelLoader.loadLogisticRegression(weights: Vector, intercept: Double): LogisticRegressionModel alternatively, we can put method load into *LogisticRegressionWithSGD* for example, but i do not like this approach because we can load models that are trained without SGD as well so it is not directly related. But first of all, if they are private by mistake, we can just open constructors. WDYT? Ability to re-create ML models -- Key: SPARK-2495 URL: https://issues.apache.org/jira/browse/SPARK-2495 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.1 Reporter: Alexander Albul Hi everyone. Previously (prior to Spark 1.0) we was working with MLib like this: 1) Calculate model (costly operation) 2) Take model and collect it's fields like weights, intercept e.t.c. 3) Store model somewhere in our format 4) Do predictions by loading model attributes, creating new model and predicting using it. Now i see that model's constructors have *private* modifier and cannot be created from outside. If you want to hide implementation details and keep this constructor as developer api, why not to create at least method, which will take weights, intercept (for example) an materialize that model? A good example of model that i am talking about is: *LinearRegressionModel* I know that *LinearRegressionWithSGD* class have *createModel* method but the problem is that it have *protected* modifier as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2495) Ability to re-create ML models
[ https://issues.apache.org/jira/browse/SPARK-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063762#comment-14063762 ] Alexander Albul edited comment on SPARK-2495 at 7/16/14 5:34 PM: - Yes, i can work on it, but first i need to understand what is the reason of making constructors *private*. I can propose approach with some utility object that we can use to create different models: *ModelLoader.loadLogisticRegression(weights: Vector, intercept: Double): LogisticRegressionModel* alternatively, we can put method *load* into *LogisticRegressionWithSGD* for example, but i do not like this approach because we can load models that are trained with the different appriach (not SGD) so it is not directly related. But first of all, if they are private by mistake, we can just open constructors. WDYT? was (Author: gorenuru): Yes, i can work on it, but first i need to understand what is the reason of making constructors *private*. I can propose approach with some utility object that we can use to create different models: *ModelLoader.loadLogisticRegression(weights: Vector, intercept: Double): LogisticRegressionModel* alternatively, we can put method *load* into *LogisticRegressionWithSGD* for example, but i do not like this approach because we can load models that are trained without SGD as well so it is not directly related. But first of all, if they are private by mistake, we can just open constructors. WDYT? Ability to re-create ML models -- Key: SPARK-2495 URL: https://issues.apache.org/jira/browse/SPARK-2495 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.1 Reporter: Alexander Albul Hi everyone. Previously (prior to Spark 1.0) we was working with MLib like this: 1) Calculate model (costly operation) 2) Take model and collect it's fields like weights, intercept e.t.c. 3) Store model somewhere in our format 4) Do predictions by loading model attributes, creating new model and predicting using it. Now i see that model's constructors have *private* modifier and cannot be created from outside. If you want to hide implementation details and keep this constructor as developer api, why not to create at least method, which will take weights, intercept (for example) an materialize that model? A good example of model that i am talking about is: *LinearRegressionModel* I know that *LinearRegressionWithSGD* class have *createModel* method but the problem is that it have *protected* modifier as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2527) incorrect persistence level shown in Spark UI after repersisting
Diana Carroll created SPARK-2527: Summary: incorrect persistence level shown in Spark UI after repersisting Key: SPARK-2527 URL: https://issues.apache.org/jira/browse/SPARK-2527 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.0 Reporter: Diana Carroll If I persist an RDD at one level, unpersist it, then repersist it at another level, the UI will continue to show the RDD at the first level...but correctly show individual partitions at the second level. {code} import org.apache.spark.api.java.StorageLevels._ val test1 = sc.parallelize(Array(1,2,3))test1.persist(StorageLevels.DISK_ONLY) test1.count() test1.unpersist() test1.persist(StorageLevels.MEMORY_ONLY) test1.count() {code} after the first call to persist and count, the Spark App web UI shows: RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated rdd_14_0Disk Serialized 1x Replicated After the second call, it shows: RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated rdd_14_0Memory Deserialized 1x Replicated -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2528) spark-ec2 security group permissions are too open
Nicholas Chammas created SPARK-2528: --- Summary: spark-ec2 security group permissions are too open Key: SPARK-2528 URL: https://issues.apache.org/jira/browse/SPARK-2528 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.0.0 Reporter: Nicholas Chammas Priority: Minor {{spark-ec2}} configures EC2 security groups with ports [open to the world | https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L280]. This is an unnecessary security risk, even for a short-lived cluster. Wherever possible, it would be better if, when launching a new cluster, {{spark-ec2}} detects the host's external IP address (e.g. via {{icanhazip.com}}) and grants access specifically to that IP address. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2527) incorrect persistence level shown in Spark UI after repersisting
[ https://issues.apache.org/jira/browse/SPARK-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Diana Carroll updated SPARK-2527: - Description: If I persist an RDD at one level, unpersist it, then repersist it at another level, the UI will continue to show the RDD at the first level...but correctly show individual partitions at the second level. {code} import org.apache.spark.api.java.StorageLevels import org.apache.spark.api.java.StorageLevels._ val test1 = sc.parallelize(Array(1,2,3))test1.persist(StorageLevels.DISK_ONLY) test1.count() test1.unpersist() test1.persist(StorageLevels.MEMORY_ONLY) test1.count() {code} after the first call to persist and count, the Spark App web UI shows: RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated rdd_14_0Disk Serialized 1x Replicated After the second call, it shows: RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated rdd_14_0Memory Deserialized 1x Replicated was: If I persist an RDD at one level, unpersist it, then repersist it at another level, the UI will continue to show the RDD at the first level...but correctly show individual partitions at the second level. {code} import org.apache.spark.api.java.StorageLevels._ val test1 = sc.parallelize(Array(1,2,3))test1.persist(StorageLevels.DISK_ONLY) test1.count() test1.unpersist() test1.persist(StorageLevels.MEMORY_ONLY) test1.count() {code} after the first call to persist and count, the Spark App web UI shows: RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated rdd_14_0Disk Serialized 1x Replicated After the second call, it shows: RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated rdd_14_0Memory Deserialized 1x Replicated incorrect persistence level shown in Spark UI after repersisting Key: SPARK-2527 URL: https://issues.apache.org/jira/browse/SPARK-2527 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.0 Reporter: Diana Carroll Attachments: persistbug1.png, persistbug2.png If I persist an RDD at one level, unpersist it, then repersist it at another level, the UI will continue to show the RDD at the first level...but correctly show individual partitions at the second level. {code} import org.apache.spark.api.java.StorageLevels import org.apache.spark.api.java.StorageLevels._ val test1 = sc.parallelize(Array(1,2,3))test1.persist(StorageLevels.DISK_ONLY) test1.count() test1.unpersist() test1.persist(StorageLevels.MEMORY_ONLY) test1.count() {code} after the first call to persist and count, the Spark App web UI shows: RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated rdd_14_0 Disk Serialized 1x Replicated After the second call, it shows: RDD Storage Info for 14 Storage Level: Disk Serialized 1x Replicated rdd_14_0 Memory Deserialized 1x Replicated -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2495) Ability to re-create ML models
[ https://issues.apache.org/jira/browse/SPARK-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063762#comment-14063762 ] Alexander Albul edited comment on SPARK-2495 at 7/16/14 5:33 PM: - Yes, i can work on it, but first i need to understand what is the reason of making constructors *private*. I can propose approach with some utility object that we can use to create different models: *ModelLoader.loadLogisticRegression(weights: Vector, intercept: Double): LogisticRegressionModel* alternatively, we can put method *load* into *LogisticRegressionWithSGD* for example, but i do not like this approach because we can load models that are trained without SGD as well so it is not directly related. But first of all, if they are private by mistake, we can just open constructors. WDYT? was (Author: gorenuru): Yes, i can work on it, but first i need to understand what is the reason of making constructors *private*. I can propose approach with some utility object that we can use to create different models: ModelLoader.loadLogisticRegression(weights: Vector, intercept: Double): LogisticRegressionModel alternatively, we can put method load into *LogisticRegressionWithSGD* for example, but i do not like this approach because we can load models that are trained without SGD as well so it is not directly related. But first of all, if they are private by mistake, we can just open constructors. WDYT? Ability to re-create ML models -- Key: SPARK-2495 URL: https://issues.apache.org/jira/browse/SPARK-2495 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.1 Reporter: Alexander Albul Hi everyone. Previously (prior to Spark 1.0) we was working with MLib like this: 1) Calculate model (costly operation) 2) Take model and collect it's fields like weights, intercept e.t.c. 3) Store model somewhere in our format 4) Do predictions by loading model attributes, creating new model and predicting using it. Now i see that model's constructors have *private* modifier and cannot be created from outside. If you want to hide implementation details and keep this constructor as developer api, why not to create at least method, which will take weights, intercept (for example) an materialize that model? A good example of model that i am talking about is: *LinearRegressionModel* I know that *LinearRegressionWithSGD* class have *createModel* method but the problem is that it have *protected* modifier as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2519) Eliminate pattern-matching on Tuple2 in performance-critical aggregation code
[ https://issues.apache.org/jira/browse/SPARK-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063833#comment-14063833 ] Reynold Xin commented on SPARK-2519: Does this pull request fix all of them? Eliminate pattern-matching on Tuple2 in performance-critical aggregation code - Key: SPARK-2519 URL: https://issues.apache.org/jira/browse/SPARK-2519 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Reporter: Sandy Ryza Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2269) Clean up and add unit tests for resourceOffers in MesosSchedulerBackend
[ https://issues.apache.org/jira/browse/SPARK-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2269: --- Assignee: Tim Chen Clean up and add unit tests for resourceOffers in MesosSchedulerBackend --- Key: SPARK-2269 URL: https://issues.apache.org/jira/browse/SPARK-2269 Project: Spark Issue Type: Bug Components: Mesos Reporter: Patrick Wendell Assignee: Tim Chen This function could be simplified a bit. We could re-write it without offerableIndices or creating the mesosTasks array as large as the offer list. There is a lot of logic around making sure you get the correct index into mesosTasks and offers, really we should just build mesosTasks directly from the offers we get back. To associate the tasks we are launching with the offers we can just create a hashMap from the slaveId to the original offer. The basic logic of the function is that you take the mesos offers, convert them to spark offers, then convert the results back. One reason I think it might be designed as it is now is to deal with the case where Mesos gives multiple offers for a single slave. I checked directly with the Mesos team and they said this won't ever happen, you'll get at most one offer per mesos slave within a set of offers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2522) Use TorrentBroadcastFactory as the default broadcast factory
[ https://issues.apache.org/jira/browse/SPARK-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2522. Resolution: Fixed Fix Version/s: 1.1.0 Use TorrentBroadcastFactory as the default broadcast factory Key: SPARK-2522 URL: https://issues.apache.org/jira/browse/SPARK-2522 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.1.0 HttpBroadcastFactory is the current default broadcast factory. It sends the broadcast data to each worker one by one, which is slow when the cluster is big. TorrentBroadcastFactory scales much better than http. Maybe we should make torrent the default broadcast method. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2317) Improve task logging
[ https://issues.apache.org/jira/browse/SPARK-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2317. Resolution: Fixed Fix Version/s: 1.1.0 Improve task logging Key: SPARK-2317 URL: https://issues.apache.org/jira/browse/SPARK-2317 Project: Spark Issue Type: Improvement Affects Versions: 1.0.0, 1.0.1 Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.1.0 We use TID to indicate task logging. However, TID itself does not capture stage or retries, making it harder to correlate with the application itself. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2463) Creating multiple StreamingContexts from shell generates duplicate Streaming tabs in UI
[ https://issues.apache.org/jira/browse/SPARK-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063944#comment-14063944 ] Andrew Or commented on SPARK-2463: -- It probably won't be too much work, because the functionality of detaching a tab is already there (just not used). The trickier bits are probably not the UI code but in the streaming code, where we have to make sure that a SparkContext only has one StreamingContext at any given time. We can maintain a static map to track this, but there may be a few gotchas in sharing the same SparkContext across multiple StreamingContexts (e.g. we have to make sure that blocks cached by the first ssc cannot be read by the second ssc). Creating multiple StreamingContexts from shell generates duplicate Streaming tabs in UI --- Key: SPARK-2463 URL: https://issues.apache.org/jira/browse/SPARK-2463 Project: Spark Issue Type: Bug Components: Streaming, Web UI Affects Versions: 1.0.1 Reporter: Nicholas Chammas Start a {{StreamingContext}} from the interactive shell and then stop it. Go to {{http://master_url:4040/streaming/}} and you will see a tab in the UI for Streaming. Now from the same shell, create and start a new {{StreamingContext}}. There will now be a duplicate tab for Streaming in the UI. Repeating this process generates additional Streaming tabs. They all link to the same information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2530) Relax incorrect assumption of one ExternalAppendOnlyMap per thread
Andrew Or created SPARK-2530: Summary: Relax incorrect assumption of one ExternalAppendOnlyMap per thread Key: SPARK-2530 URL: https://issues.apache.org/jira/browse/SPARK-2530 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Originally reported by Matei. Our current implementation of EAOM assumes only one map is created per task. This is not true in the following case, however: {code} rdd1.join(rdd2).reduceByKey(...) {code} This is because reduce by key does a map side combine, which creates an EAOM that streams from an EAOM previously created by the same thread to aggregate values from the join. The more concerning thing is the following: we currently maintain a global shuffle memory map (thread ID - memory used by that thread to shuffle). If we create two EAOMs in the same thread, the memory occupied by the first map may be clobbered by that occupied by the second. This has very adverse consequences if the first map is huge but the second is just starting out, in which case we end up believing that we use much less memory than we actually do. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063998#comment-14063998 ] Xuefu Zhang commented on SPARK-2420: Thanks for your comments, [~srowen]. I mostly agree with your assessment. 1. Guava is indeed a problem and the problem is confirmed. The solution also seems simple. It it can be set to 11 in spark, it solves the problem we have. 2. For jetty, it was a problem with Hive on Spark POC, possibly because we shipped all libraries from Hive process's classpath to the spark cluster. We have a task (HIVE-7371) to identify a minimum set of jars to be shipped. With that, the story might change. We will confirm if Jetty is a problem once we have a better idea on HIVE-7371. Also, we will check if the problem, if exists, can be fixed on Hive side first. If not, we'd like to get help from Spark and also provide a reason why. Change Spark build to minimize library conflicts Key: SPARK-2420 URL: https://issues.apache.org/jira/browse/SPARK-2420 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.0.0 Reporter: Xuefu Zhang Attachments: spark_1.0.0.patch During the prototyping of HIVE-7292, many library conflicts showed up because Spark build contains versions of libraries that's vastly different from current major Hadoop version. It would be nice if we can choose versions that's in line with Hadoop or shading them in the assembly. Here are the wish list: 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 2. Shading Spark's jetty and servlet dependency in the assembly. 3. guava version difference. Spark is using a higher version. I'm not sure what's the best solution for this. The list may grow as HIVE-7292 proceeds. For information only, the attached is a patch that we applied on Spark in order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064000#comment-14064000 ] Reynold Xin commented on SPARK-2420: Thanks for looking into this. Change Spark build to minimize library conflicts Key: SPARK-2420 URL: https://issues.apache.org/jira/browse/SPARK-2420 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.0.0 Reporter: Xuefu Zhang Attachments: spark_1.0.0.patch During the prototyping of HIVE-7292, many library conflicts showed up because Spark build contains versions of libraries that's vastly different from current major Hadoop version. It would be nice if we can choose versions that's in line with Hadoop or shading them in the assembly. Here are the wish list: 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 2. Shading Spark's jetty and servlet dependency in the assembly. 3. guava version difference. Spark is using a higher version. I'm not sure what's the best solution for this. The list may grow as HIVE-7292 proceeds. For information only, the attached is a patch that we applied on Spark in order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Issue Comment Deleted] (SPARK-1215) Clustering: Index out of bounds error
[ https://issues.apache.org/jira/browse/SPARK-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-1215: - Comment: was deleted (was: Just to let you know, I'll give the go-ahead for this tomorrow. Based on Patrick's recommendation, I'm blocking on this JIRA until tomorrow to give Prashant ( https://issues.apache.org/jira/browse/SPARK-2497 ) a chance to fix a MIMA issue. If it's not fixed by tomorrow, then I'll get Manish to do a temp filter workaround. On Tue, Jul 15, 2014 at 9:45 PM, Xiangrui Meng (JIRA) j...@apache.org ) Clustering: Index out of bounds error - Key: SPARK-1215 URL: https://issues.apache.org/jira/browse/SPARK-1215 Project: Spark Issue Type: Bug Components: MLlib Reporter: dewshick Assignee: Joseph K. Bradley Attachments: test.csv code: import org.apache.spark.mllib.clustering._ val test = sc.makeRDD(Array(4,4,4,4,4).map(e = Array(e.toDouble))) val kmeans = new KMeans().setK(4) kmeans.run(test) evals with java.lang.ArrayIndexOutOfBoundsException error: 14/01/17 12:35:54 INFO scheduler.DAGScheduler: Stage 25 (collectAsMap at KMeans.scala:243) finished in 0.047 s 14/01/17 12:35:54 INFO spark.SparkContext: Job finished: collectAsMap at KMeans.scala:243, took 16.389537116 s Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.simontuffs.onejar.Boot.run(Boot.java:340) at com.simontuffs.onejar.Boot.main(Boot.java:166) Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:47) at org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:247) at org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) at scala.collection.immutable.Range.foreach(Range.scala:81) at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) at scala.collection.immutable.Range.map(Range.scala:46) at org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:244) at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:124) at Clustering$$anonfun$1.apply$mcDI$sp(Clustering.scala:21) at Clustering$$anonfun$1.apply(Clustering.scala:19) at Clustering$$anonfun$1.apply(Clustering.scala:19) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) at scala.collection.immutable.Range.foreach(Range.scala:78) at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) at scala.collection.immutable.Range.map(Range.scala:46) at Clustering$.main(Clustering.scala:19) at Clustering.main(Clustering.scala) ... 6 more -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2504) Fix nullability of Substring expression.
[ https://issues.apache.org/jira/browse/SPARK-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2504. - Resolution: Fixed Fix Version/s: 1.0.2 1.1.0 Assignee: Takuya Ueshin Fix nullability of Substring expression. Key: SPARK-2504 URL: https://issues.apache.org/jira/browse/SPARK-2504 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Assignee: Takuya Ueshin Fix For: 1.1.0, 1.0.2 This is a follow-up of [#1359|https://github.com/apache/spark/pull/1359] with nullability narrowing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064019#comment-14064019 ] Sean Owen commented on SPARK-2420: -- It'd be best to say what problem you are seeing with Guava -- but hold the phone, scratch my comments, I just tried compiling with Guava 11.0.2 and it looks like there is now some code that does use methods not present in 11. Just a few, but they exist. I assume it works on Hadoop now because Guava 14 wins in the assembly. I don't think it's a lot of work to even make changes to accommodate Guava 11 here. It seems pretty unpalatable to do so. Hadoop really needs to upgrade. But then that takes forever to trickle down to the majority of deployments anyway. Is there any feasibility of just upgrading Hive-on-Spark? So it's... 1) Downgrade Spark. Not so much to accommodate Hive, although that's nice, but to avoid latent clashes with Hadoop 2) Somehow use Guava 14 in Hive on Spark 3) Shade shade shade Anyone else have opinions? Change Spark build to minimize library conflicts Key: SPARK-2420 URL: https://issues.apache.org/jira/browse/SPARK-2420 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.0.0 Reporter: Xuefu Zhang Attachments: spark_1.0.0.patch During the prototyping of HIVE-7292, many library conflicts showed up because Spark build contains versions of libraries that's vastly different from current major Hadoop version. It would be nice if we can choose versions that's in line with Hadoop or shading them in the assembly. Here are the wish list: 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 2. Shading Spark's jetty and servlet dependency in the assembly. 3. guava version difference. Spark is using a higher version. I'm not sure what's the best solution for this. The list may grow as HIVE-7292 proceeds. For information only, the attached is a patch that we applied on Spark in order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2454) Separate driver spark home from executor spark home
[ https://issues.apache.org/jira/browse/SPARK-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064036#comment-14064036 ] Andrew Or commented on SPARK-2454: -- There may be multiple installations of Spark on the executor machine, in which case a global SPARK_HOME environment variable is not sufficient. What I am suggesting is that we should still keep the option of allowing the driver to overwrite executor spark homes (spark.executor.home), and only overwrite the executor SPARK_HOME if this is specified. Separate driver spark home from executor spark home --- Key: SPARK-2454 URL: https://issues.apache.org/jira/browse/SPARK-2454 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Fix For: 1.1.0 The driver may not always share the same directory structure as the executors. It makes little sense to always re-use the driver's spark home on the executors. https://github.com/apache/spark/pull/1244/ is an open effort to fix this. However, this still requires us to set SPARK_HOME on all the executor nodes. Really we should separate this out into something like spark.driver.home spark.executor.home rather than re-using SPARK_HOME everywhere. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2454) Separate driver spark home from executor spark home
[ https://issues.apache.org/jira/browse/SPARK-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2454: - Description: The driver may not always share the same directory structure as the executors. It makes little sense to always re-use the driver's spark home on the executors. https://github.com/apache/spark/pull/1244/ is an open effort to fix this. However, this still requires us to set SPARK_HOME on all the executor nodes. Really we should separate this out into something like `spark.executor.home` rather than re-using SPARK_HOME everywhere. was: The driver may not always share the same directory structure as the executors. It makes little sense to always re-use the driver's spark home on the executors. https://github.com/apache/spark/pull/1244/ is an open effort to fix this. However, this still requires us to set SPARK_HOME on all the executor nodes. Really we should separate this out into something like spark.driver.home spark.executor.home rather than re-using SPARK_HOME everywhere. Separate driver spark home from executor spark home --- Key: SPARK-2454 URL: https://issues.apache.org/jira/browse/SPARK-2454 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Fix For: 1.1.0 The driver may not always share the same directory structure as the executors. It makes little sense to always re-use the driver's spark home on the executors. https://github.com/apache/spark/pull/1244/ is an open effort to fix this. However, this still requires us to set SPARK_HOME on all the executor nodes. Really we should separate this out into something like `spark.executor.home` rather than re-using SPARK_HOME everywhere. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-2465) Use long as user / item ID for ALS
[ https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-2465. Resolution: Won't Fix Will possibly revisit this in the long term, or look at creating a parallel LongALS / LongRating API. Use long as user / item ID for ALS -- Key: SPARK-2465 URL: https://issues.apache.org/jira/browse/SPARK-2465 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.1 Reporter: Sean Owen Priority: Minor Attachments: ALS using MEMORY_AND_DISK.png, ALS using MEMORY_AND_DISK_SER.png, Screen Shot 2014-07-13 at 8.49.40 PM.png I'd like to float this for consideration: use longs instead of ints for user and product IDs in the ALS implementation. The main reason for is that identifiers are not generally numeric at all, and will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits means collisions are likely after hundreds of thousands of users and items, which is not unrealistic. Hashing to 64 bits pushes this back to billions. It would also mean numeric IDs that happen to be larger than the largest int can be used directly as identifiers. On the downside of course: 8 bytes instead of 4 bytes of memory used per Rating. Thoughts? I will post a PR so as to show what the change would be. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2531) Make BroadcastNestedLoopJoin take into account a BuildSide
Zongheng Yang created SPARK-2531: Summary: Make BroadcastNestedLoopJoin take into account a BuildSide Key: SPARK-2531 URL: https://issues.apache.org/jira/browse/SPARK-2531 Project: Spark Issue Type: Bug Reporter: Zongheng Yang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2454) Separate driver spark home from executor spark home
[ https://issues.apache.org/jira/browse/SPARK-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064085#comment-14064085 ] Nan Zhu commented on SPARK-2454: I see, it makes sense to me... Separate driver spark home from executor spark home --- Key: SPARK-2454 URL: https://issues.apache.org/jira/browse/SPARK-2454 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Fix For: 1.1.0 The driver may not always share the same directory structure as the executors. It makes little sense to always re-use the driver's spark home on the executors. https://github.com/apache/spark/pull/1244/ is an open effort to fix this. However, this still requires us to set SPARK_HOME on all the executor nodes. Really we should separate this out into something like `spark.executor.home` rather than re-using SPARK_HOME everywhere. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2411) Standalone Master - direct users to turn on event logs
[ https://issues.apache.org/jira/browse/SPARK-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2411: - Attachment: (was: Master event logs.png) Standalone Master - direct users to turn on event logs -- Key: SPARK-2411 URL: https://issues.apache.org/jira/browse/SPARK-2411 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Attachments: Master history not found.png Right now if the user attempts to click on a finished application's UI, it simply refreshes. This is simply because the event logs are not there, in which case we set the href=. We could provide more information by pointing them to configure spark.eventLog.enabled if they click on the empty link. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2519) Eliminate pattern-matching on Tuple2 in performance-critical aggregation code
[ https://issues.apache.org/jira/browse/SPARK-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064102#comment-14064102 ] Sandy Ryza commented on SPARK-2519: --- I looked in ShuffledRDD, ExternalAppendOnlyMap, AppendOnlyMap, SizeTrackingAppendOnlyMap, and Aggregator. Just looked in CoGroupedRDD and PairRDDFunctions and found a few more - https://github.com/apache/spark/pull/1447. Eliminate pattern-matching on Tuple2 in performance-critical aggregation code - Key: SPARK-2519 URL: https://issues.apache.org/jira/browse/SPARK-2519 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Reporter: Sandy Ryza Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2531) Make BroadcastNestedLoopJoin take into account a BuildSide
[ https://issues.apache.org/jira/browse/SPARK-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064106#comment-14064106 ] Zongheng Yang commented on SPARK-2531: -- Github PR: https://github.com/apache/spark/pull/1448 Make BroadcastNestedLoopJoin take into account a BuildSide -- Key: SPARK-2531 URL: https://issues.apache.org/jira/browse/SPARK-2531 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.1 Reporter: Zongheng Yang Assignee: Zongheng Yang Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2154) Worker goes down.
[ https://issues.apache.org/jira/browse/SPARK-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2154: --- Assignee: Aaron Davidson Worker goes down. - Key: SPARK-2154 URL: https://issues.apache.org/jira/browse/SPARK-2154 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.8.1, 0.9.0, 1.0.0 Environment: Spark on cluster of three nodes on Ubuntu 12.04.4 LTS Reporter: siva venkat gogineni Assignee: Aaron Davidson Labels: patch Fix For: 1.1.0, 1.0.2 Attachments: Sccreenhot at various states of driver ..jpg Worker dies when i try to submit drivers more than the allocated cores. When I submit 9 drivers with one core for each driver on a cluster having 8 cores all together the worker dies as soon as i submit the 9 the driver. It works fine until it reaches 8 cores, As soon as i submit 9th driver the driver status remains Submitted and the worker crashes. I understand that we cannot run drivers more than the allocated cores but the problem here is instead of the 9th driver being in queue it is being executed and as a result it is crashing the worker. Let me know if there is a way to get around this issue or is it being fixed in the upcoming version? Cluster Details: Spark 1.00 2 nodes with 4 cores each. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2154) Worker goes down.
[ https://issues.apache.org/jira/browse/SPARK-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2154. Resolution: Fixed Fix Version/s: 1.1.0 1.0.2 Issue resolved by pull request 1405 [https://github.com/apache/spark/pull/1405] Worker goes down. - Key: SPARK-2154 URL: https://issues.apache.org/jira/browse/SPARK-2154 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.8.1, 0.9.0, 1.0.0 Environment: Spark on cluster of three nodes on Ubuntu 12.04.4 LTS Reporter: siva venkat gogineni Labels: patch Fix For: 1.0.2, 1.1.0 Attachments: Sccreenhot at various states of driver ..jpg Worker dies when i try to submit drivers more than the allocated cores. When I submit 9 drivers with one core for each driver on a cluster having 8 cores all together the worker dies as soon as i submit the 9 the driver. It works fine until it reaches 8 cores, As soon as i submit 9th driver the driver status remains Submitted and the worker crashes. I understand that we cannot run drivers more than the allocated cores but the problem here is instead of the 9th driver being in queue it is being executed and as a result it is crashing the worker. Let me know if there is a way to get around this issue or is it being fixed in the upcoming version? Cluster Details: Spark 1.00 2 nodes with 4 cores each. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-2154) Worker goes down.
[ https://issues.apache.org/jira/browse/SPARK-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] siva venkat gogineni closed SPARK-2154. --- Fixed in the future releases Worker goes down. - Key: SPARK-2154 URL: https://issues.apache.org/jira/browse/SPARK-2154 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.8.1, 0.9.0, 1.0.0 Environment: Spark on cluster of three nodes on Ubuntu 12.04.4 LTS Reporter: siva venkat gogineni Assignee: Aaron Davidson Labels: patch Fix For: 1.1.0, 1.0.2 Attachments: Sccreenhot at various states of driver ..jpg Worker dies when i try to submit drivers more than the allocated cores. When I submit 9 drivers with one core for each driver on a cluster having 8 cores all together the worker dies as soon as i submit the 9 the driver. It works fine until it reaches 8 cores, As soon as i submit 9th driver the driver status remains Submitted and the worker crashes. I understand that we cannot run drivers more than the allocated cores but the problem here is instead of the 9th driver being in queue it is being executed and as a result it is crashing the worker. Let me know if there is a way to get around this issue or is it being fixed in the upcoming version? Cluster Details: Spark 1.00 2 nodes with 4 cores each. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2532) Fix issues with consolidated shuffle
Mridul Muralidharan created SPARK-2532: -- Summary: Fix issues with consolidated shuffle Key: SPARK-2532 URL: https://issues.apache.org/jira/browse/SPARK-2532 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: All Reporter: Mridul Muralidharan Assignee: Mridul Muralidharan Priority: Critical Fix For: 1.1.0 Will file PR with changes as soon as merge is done (earlier merge became outdated in 2 weeks unfortunately :) ). Consolidated shuffle is broken in multiple ways in spark : a) Task failure(s) can cause the state to become inconsistent. b) Multiple revert's or combination of close/revert/close can cause the state to be inconsistent. (As part of exception/error handling). c) Some of the api in block writer causes implementation issues - for example: a revert is always followed by close : but the implemention tries to keep them separate, resulting in surface for errors. d) Fetching data from consolidated shuffle files can go badly wrong if the file is being actively written to : it computes length by subtracting next offset from current offset (or length if this is last offset)- the latter fails when fetch is happening in parallel to write. Note, this happens even if there are no task failures of any kind ! This usually results in stream corruption or decompression errors. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2434) Generate runtime warnings for naive implementations
[ https://issues.apache.org/jira/browse/SPARK-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2434: - Assignee: Burak Yavuz Generate runtime warnings for naive implementations --- Key: SPARK-2434 URL: https://issues.apache.org/jira/browse/SPARK-2434 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Burak Yavuz Priority: Minor Labels: Starter There are some example code under src/main/scala/org/apache/spark/examples: * LocalALS * LocalFileLR * LocalKMeans * LocalLP * SparkALS * SparkHdfsLR * SparkKMeans * SparkLR They provide naive implementations of some machine learning algorithms that are already covered in MLlib. It is okay to keep them because the implementation is straightforward and easy to read. However, we should generate warning messages at runtime and point users to MLlib's implementation, in case users use them in practice. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts
[ https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064168#comment-14064168 ] Xuefu Zhang commented on SPARK-2420: As to guava conflict, HIVE-7387 has more details and analysis. Change Spark build to minimize library conflicts Key: SPARK-2420 URL: https://issues.apache.org/jira/browse/SPARK-2420 Project: Spark Issue Type: Wish Components: Build Affects Versions: 1.0.0 Reporter: Xuefu Zhang Attachments: spark_1.0.0.patch During the prototyping of HIVE-7292, many library conflicts showed up because Spark build contains versions of libraries that's vastly different from current major Hadoop version. It would be nice if we can choose versions that's in line with Hadoop or shading them in the assembly. Here are the wish list: 1. Upgrade protobuf version to 2.5.0 from current 2.4.1 2. Shading Spark's jetty and servlet dependency in the assembly. 3. guava version difference. Spark is using a higher version. I'm not sure what's the best solution for this. The list may grow as HIVE-7292 proceeds. For information only, the attached is a patch that we applied on Spark in order to make Spark work with Hive. It gives an idea of the scope of changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2154) Worker goes down.
[ https://issues.apache.org/jira/browse/SPARK-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064172#comment-14064172 ] Patrick Wendell commented on SPARK-2154: [~talk2siva8] Yes, that's correct. Worker goes down. - Key: SPARK-2154 URL: https://issues.apache.org/jira/browse/SPARK-2154 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.8.1, 0.9.0, 1.0.0 Environment: Spark on cluster of three nodes on Ubuntu 12.04.4 LTS Reporter: siva venkat gogineni Assignee: Aaron Davidson Labels: patch Fix For: 1.1.0, 1.0.2 Attachments: Sccreenhot at various states of driver ..jpg Worker dies when i try to submit drivers more than the allocated cores. When I submit 9 drivers with one core for each driver on a cluster having 8 cores all together the worker dies as soon as i submit the 9 the driver. It works fine until it reaches 8 cores, As soon as i submit 9th driver the driver status remains Submitted and the worker crashes. I understand that we cannot run drivers more than the allocated cores but the problem here is instead of the 9th driver being in queue it is being executed and as a result it is crashing the worker. Let me know if there is a way to get around this issue or is it being fixed in the upcoming version? Cluster Details: Spark 1.00 2 nodes with 4 cores each. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2495) Ability to re-create ML models
[ https://issues.apache.org/jira/browse/SPARK-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2495: - Assignee: Alexander Albul Ability to re-create ML models -- Key: SPARK-2495 URL: https://issues.apache.org/jira/browse/SPARK-2495 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.1 Reporter: Alexander Albul Assignee: Alexander Albul Hi everyone. Previously (prior to Spark 1.0) we was working with MLib like this: 1) Calculate model (costly operation) 2) Take model and collect it's fields like weights, intercept e.t.c. 3) Store model somewhere in our format 4) Do predictions by loading model attributes, creating new model and predicting using it. Now i see that model's constructors have *private* modifier and cannot be created from outside. If you want to hide implementation details and keep this constructor as developer api, why not to create at least method, which will take weights, intercept (for example) an materialize that model? A good example of model that i am talking about is: *LinearRegressionModel* I know that *LinearRegressionWithSGD* class have *createModel* method but the problem is that it have *protected* modifier as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1997) Update breeze to version 0.8.1
[ https://issues.apache.org/jira/browse/SPARK-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1997: - Target Version/s: 1.1.0 Update breeze to version 0.8.1 -- Key: SPARK-1997 URL: https://issues.apache.org/jira/browse/SPARK-1997 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Guoqiang Li Assignee: Guoqiang Li {{breeze 0.7}} does not support {{scala 2.11}} . -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2533) Show summary of locality level of completed tasks in the each stage page of web UI
[ https://issues.apache.org/jira/browse/SPARK-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masayoshi TSUZUKI updated SPARK-2533: - Summary: Show summary of locality level of completed tasks in the each stage page of web UI (was: Show summary of locality level of completed tasks in the each stage page of web UI) Show summary of locality level of completed tasks in the each stage page of web UI -- Key: SPARK-2533 URL: https://issues.apache.org/jira/browse/SPARK-2533 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.0.0 Reporter: Masayoshi TSUZUKI Priority: Minor When the number of tasks is very large, it is impossible to know how many tasks were executed under (PROCESS_LOCAL/NODE_LOCAL/RACK_LOCAL) from the stage page of web UI. It would be better to show the summary of task locality level in web UI. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2533) ---- Show summary of locality level of completed tasks in the each stage page of web UI
Masayoshi TSUZUKI created SPARK-2533: Summary: Show summary of locality level of completed tasks in the each stage page of web UI Key: SPARK-2533 URL: https://issues.apache.org/jira/browse/SPARK-2533 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.0.0 Reporter: Masayoshi TSUZUKI Priority: Minor When the number of tasks is very large, it is impossible to know how many tasks were executed under (PROCESS_LOCAL/NODE_LOCAL/RACK_LOCAL) from the stage page of web UI. It would be better to show the summary of task locality level in web UI. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2534) Avoid pulling in the entire RDD in groupByKey
Reynold Xin created SPARK-2534: -- Summary: Avoid pulling in the entire RDD in groupByKey Key: SPARK-2534 URL: https://issues.apache.org/jira/browse/SPARK-2534 Project: Spark Issue Type: Bug Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical The way groupByKey is written actually pulls the entire PairRDDFunctions into the 3 closures, sometimes resulting in gigantic task sizes: {code} def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = { // groupByKey shouldn't use map side combine because map side combine does not // reduce the amount of data shuffled and requires all map side data be inserted // into a hash table, leading to more objects in the old gen. def createCombiner(v: V) = ArrayBuffer(v) def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v def mergeCombiners(c1: ArrayBuffer[V], c2: ArrayBuffer[V]) = c1 ++ c2 val bufs = combineByKey[ArrayBuffer[V]]( createCombiner _, mergeValue _, mergeCombiners _, partitioner, mapSideCombine=false) bufs.mapValues(_.toIterable) } {code} Changing the functions from def to val would solve it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2534) Avoid pulling in the entire RDD in groupByKey
[ https://issues.apache.org/jira/browse/SPARK-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064333#comment-14064333 ] Sandy Ryza commented on SPARK-2534: --- Yowza Avoid pulling in the entire RDD in groupByKey - Key: SPARK-2534 URL: https://issues.apache.org/jira/browse/SPARK-2534 Project: Spark Issue Type: Bug Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical The way groupByKey is written actually pulls the entire PairRDDFunctions into the 3 closures, sometimes resulting in gigantic task sizes: {code} def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = { // groupByKey shouldn't use map side combine because map side combine does not // reduce the amount of data shuffled and requires all map side data be inserted // into a hash table, leading to more objects in the old gen. def createCombiner(v: V) = ArrayBuffer(v) def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v def mergeCombiners(c1: ArrayBuffer[V], c2: ArrayBuffer[V]) = c1 ++ c2 val bufs = combineByKey[ArrayBuffer[V]]( createCombiner _, mergeValue _, mergeCombiners _, partitioner, mapSideCombine=false) bufs.mapValues(_.toIterable) } {code} Changing the functions from def to val would solve it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2501) Handle stage re-submissions properly in the UI
[ https://issues.apache.org/jira/browse/SPARK-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064350#comment-14064350 ] Masayoshi TSUZUKI commented on SPARK-2501: -- [SPARK-2299] seems to include the problem that key of some hashmaps should be stageId + attemptId instead of stageId only. Handle stage re-submissions properly in the UI -- Key: SPARK-2501 URL: https://issues.apache.org/jira/browse/SPARK-2501 Project: Spark Issue Type: Bug Components: Web UI Reporter: Patrick Wendell Assignee: Masayoshi TSUZUKI Priority: Critical -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2535) Add StringComparison case to NullPropagation.
Takuya Ueshin created SPARK-2535: Summary: Add StringComparison case to NullPropagation. Key: SPARK-2535 URL: https://issues.apache.org/jira/browse/SPARK-2535 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin {{StringComparison}} expressions including {{null}} literal cases could be added to {{NullPropagation}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2501) Handle stage re-submissions properly in the UI
[ https://issues.apache.org/jira/browse/SPARK-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064350#comment-14064350 ] Masayoshi TSUZUKI edited comment on SPARK-2501 at 7/16/14 11:58 PM: [SPARK-2299] seems to include the problem in JobProgressListener that key of some hashmaps should be stageId + attemptId instead of stageId only. was (Author: tsudukim): [SPARK-2299] seems to include the problem that key of some hashmaps should be stageId + attemptId instead of stageId only. Handle stage re-submissions properly in the UI -- Key: SPARK-2501 URL: https://issues.apache.org/jira/browse/SPARK-2501 Project: Spark Issue Type: Bug Components: Web UI Reporter: Patrick Wendell Assignee: Masayoshi TSUZUKI Priority: Critical -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2536) Update the MLlib page of Spark website
Xiangrui Meng created SPARK-2536: Summary: Update the MLlib page of Spark website Key: SPARK-2536 URL: https://issues.apache.org/jira/browse/SPARK-2536 Project: Spark Issue Type: Improvement Components: Documentation, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng It stills shows v0.9. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2536) Update the MLlib page of Spark website
[ https://issues.apache.org/jira/browse/SPARK-2536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2536: - Description: It still shows v0.9. (was: It stills shows v0.9.) Update the MLlib page of Spark website -- Key: SPARK-2536 URL: https://issues.apache.org/jira/browse/SPARK-2536 Project: Spark Issue Type: Improvement Components: Documentation, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng It still shows v0.9. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2538) External aggregation in Python
Davies Liu created SPARK-2538: - Summary: External aggregation in Python Key: SPARK-2538 URL: https://issues.apache.org/jira/browse/SPARK-2538 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.0.0, 1.0.1 Reporter: Davies Liu Fix For: 1.0.1, 1.0.0 For huge reduce tasks, user will got out of memory exception when all the data can not fit in memory. It should put some of the data into disks and then merge them together, just like what we do in Scala. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2495) Ability to re-create ML models
[ https://issues.apache.org/jira/browse/SPARK-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064374#comment-14064374 ] Alexander Albul commented on SPARK-2495: Hi Meng, Here is the list of models that potentially can can be created by used by hands: * LogisticRegressionModel * NaiveBayesModel * SVMModel * KMeansModel * MatrixFactorizationModel * LassoModel * LinearRegressionModel * RidgeRegressionModel I propose to open all of them. Even if you didn't decide yet the final form of NaiveBayesModel parameters, this should not reflect model's constructor visibility IMHO. We can make it visible and mark with DeveloperAPI annotation for example. Anyway, i propose to change constructor visibility at least for all models except NaiveBayesModel Ability to re-create ML models -- Key: SPARK-2495 URL: https://issues.apache.org/jira/browse/SPARK-2495 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.1 Reporter: Alexander Albul Assignee: Alexander Albul Hi everyone. Previously (prior to Spark 1.0) we was working with MLib like this: 1) Calculate model (costly operation) 2) Take model and collect it's fields like weights, intercept e.t.c. 3) Store model somewhere in our format 4) Do predictions by loading model attributes, creating new model and predicting using it. Now i see that model's constructors have *private* modifier and cannot be created from outside. If you want to hide implementation details and keep this constructor as developer api, why not to create at least method, which will take weights, intercept (for example) an materialize that model? A good example of model that i am talking about is: *LinearRegressionModel* I know that *LinearRegressionWithSGD* class have *createModel* method but the problem is that it have *protected* modifier as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2537) Workaround Timezone specific Hive tests
[ https://issues.apache.org/jira/browse/SPARK-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-2537: -- Description: Several Hive tests in {{HiveCompatibilitySuite}} are timezone sensitive: - {{timestamp_1}} - {{timestamp_2}} - {{timestamp_3}} - {{timestamp_udf}} Their answers differ between different timezones. Caching golden answers naively cause build failures in other timezones. Currently these tests are blacklisted. A not so clever solution is to cache golden answers of all timezones for these tests, then select the right version for the current build according to system timezone. was: Several Hive tests in {{HiveCompatibilitySuite}} are timezone sensitive: - {{timestamp_1}} - {{timestamp_2}} - {{timestamp_3}} - {{timestamp_udf}} Their answers differ between different timezones. Caching golden answers naively cause build failures in other timezones. A not so clever solution is to cache golden answers of all timezones for these tests, then select the right version for the current build according to system timezone. Workaround Timezone specific Hive tests --- Key: SPARK-2537 URL: https://issues.apache.org/jira/browse/SPARK-2537 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.1, 1.1.0 Reporter: Cheng Lian Priority: Minor Several Hive tests in {{HiveCompatibilitySuite}} are timezone sensitive: - {{timestamp_1}} - {{timestamp_2}} - {{timestamp_3}} - {{timestamp_udf}} Their answers differ between different timezones. Caching golden answers naively cause build failures in other timezones. Currently these tests are blacklisted. A not so clever solution is to cache golden answers of all timezones for these tests, then select the right version for the current build according to system timezone. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2535) Add StringComparison case to NullPropagation.
[ https://issues.apache.org/jira/browse/SPARK-2535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064381#comment-14064381 ] Takuya Ueshin commented on SPARK-2535: -- PR: https://github.com/apache/spark/pull/1451 Add StringComparison case to NullPropagation. - Key: SPARK-2535 URL: https://issues.apache.org/jira/browse/SPARK-2535 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin {{StringComparison}} expressions including {{null}} literal cases could be added to {{NullPropagation}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2539) ConnectionManager should handle Uncaught Exception
Kousuke Saruta created SPARK-2539: - Summary: ConnectionManager should handle Uncaught Exception Key: SPARK-2539 URL: https://issues.apache.org/jira/browse/SPARK-2539 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Kousuke Saruta ConnectionManager uses ThreadPool and worker threads run on spawned thread in ThreadPool. In current implementation, some uncaught exception thrown from the thread, nobody handle that. If the Exception is fatal, it should be handled and executor should exit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2540) Add More Types Support for unwarpData of HiveUDF
Cheng Hao created SPARK-2540: Summary: Add More Types Support for unwarpData of HiveUDF Key: SPARK-2540 URL: https://issues.apache.org/jira/browse/SPARK-2540 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Minor In the function HiveInspectors.unwrapData, currently the HiveVarcharObjectInspector HiveDecimalObjectInspector are not supported. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.
[ https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064421#comment-14064421 ] Xiangrui Meng commented on SPARK-2433: -- PR for branch-0.9: https://github.com/apache/spark/pull/1453 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug. Key: SPARK-2433 URL: https://issues.apache.org/jira/browse/SPARK-2433 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 0.9.1 Environment: Any Reporter: Rahul K Bhojwani Labels: easyfix, test Original Estimate: 1h Remaining Estimate: 1h Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below) In the pyspark mllib library. Path : \spark-0.9.1\python\pyspark\mllib\classification.py Class: NaiveBayesModel Method: self.predict Earlier Implementation: def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) New Implementation: No:1 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) No:2 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + dot(x,self.theta.T)) Explanation: No:1 is correct according to me. Don't know about No:2. Error one: The matrix self.theta is of dimension [n_classes , n_features]. while the matrix x is of dimension [1 , n_features]. Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features]. It will always give error: ValueError: matrices are not aligned In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error. Both Implementation no.1 and Implementation no. 2 takes care of it. Error 2: As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE) and taking the class with max value. That's what implementation 1 is doing. In Implementation 2: Its basically class with max value : ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n)) Don't know if it gives the exact result. Thanks Rahul Bhojwani rahulbhojwani2...@gmail.com -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.
[ https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2433: - Fix Version/s: 1.0.0 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug. Key: SPARK-2433 URL: https://issues.apache.org/jira/browse/SPARK-2433 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 0.9.1 Environment: Any Reporter: Rahul K Bhojwani Labels: easyfix, test Fix For: 1.0.0 Original Estimate: 1h Remaining Estimate: 1h Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below) In the pyspark mllib library. Path : \spark-0.9.1\python\pyspark\mllib\classification.py Class: NaiveBayesModel Method: self.predict Earlier Implementation: def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) New Implementation: No:1 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) No:2 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + dot(x,self.theta.T)) Explanation: No:1 is correct according to me. Don't know about No:2. Error one: The matrix self.theta is of dimension [n_classes , n_features]. while the matrix x is of dimension [1 , n_features]. Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features]. It will always give error: ValueError: matrices are not aligned In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error. Both Implementation no.1 and Implementation no. 2 takes care of it. Error 2: As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE) and taking the class with max value. That's what implementation 1 is doing. In Implementation 2: Its basically class with max value : ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n)) Don't know if it gives the exact result. Thanks Rahul Bhojwani rahulbhojwani2...@gmail.com -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2438) Streaming + MLLib
[ https://issues.apache.org/jira/browse/SPARK-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2438: - Assignee: Jeremy Freeman Streaming + MLLib - Key: SPARK-2438 URL: https://issues.apache.org/jira/browse/SPARK-2438 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Jeremy Freeman Assignee: Jeremy Freeman Labels: features This is a ticket to track progress on developing streaming analyses in MLLib. Many streaming applications benefit from or require fitting models online, where the parameters of a model (e.g. regression, clustering) are updated continually as new data arrive. This can be accomplished by incorporating MLLib algorithms into model-updating operations over DStreams. In some cases this can be achieved using existing updaters (e.g. those based on SGD), but in other cases will require custom update rules (e.g. for KMeans). The goal is to have streaming versions of many common algorithms, in particular regression, classification, clustering, and possibly dimensionality reduction. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2541) Standalone mode can't access secure HDFS anymore
Thomas Graves created SPARK-2541: Summary: Standalone mode can't access secure HDFS anymore Key: SPARK-2541 URL: https://issues.apache.org/jira/browse/SPARK-2541 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0, 1.0.1 Reporter: Thomas Graves In spark 0.9.x you could access secure HDFS from Standalone deploy, that doesn't work in 1.X anymore. It looks like the issues is in SparkHadoopUtil.runAsSparkUser. Previously it wouldn't do the doAs if the currentUser == user. Not sure how it affects when the daemons run as a super user but SPARK_USER is set to someone else. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2406) Partitioned Parquet Support
[ https://issues.apache.org/jira/browse/SPARK-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064450#comment-14064450 ] Pat McDonough commented on SPARK-2406: -- Okay, so it sounds like these are completely different use cases then, which might deserve their own JIRAs: * Native support for partitioned Parquet tables (probably what this JIRA is meant to address) * Automatic use of Native Parquet readers when using a Hive parquet table (what my previous comment described) Does that sound correct? Partitioned Parquet Support --- Key: SPARK-2406 URL: https://issues.apache.org/jira/browse/SPARK-2406 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2481) The environment variables SPARK_HISTORY_OPTS is covered in start-history-server.sh
[ https://issues.apache.org/jira/browse/SPARK-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064471#comment-14064471 ] Masayoshi TSUZUKI commented on SPARK-2481: -- Ah, I understand what you meant and I agree with you. The environment variables SPARK_HISTORY_OPTS is covered in start-history-server.sh -- Key: SPARK-2481 URL: https://issues.apache.org/jira/browse/SPARK-2481 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Guoqiang Li Assignee: Guoqiang Li If we have the following code in the conf/spark-env.sh {{export SPARK_HISTORY_OPTS=-DSpark.history.XX=XX}} The environment variables SPARK_HISTORY_OPTS is covered in [start-history-server.sh|https://github.com/apache/spark/blob/master/sbin/start-history-server.sh] {code} if [ $# != 0 ]; then echo Using command line arguments for setting the log directory is deprecated. Please echo set the spark.history.fs.logDirectory configuration option instead. export SPARK_HISTORY_OPTS=$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=$1 fi {code} -- This message was sent by Atlassian JIRA (v6.2#6252)