[jira] [Commented] (SPARK-5387) parquet writer runs into OOM during writing when number of rows is large

2015-03-18 Thread Chaozhong Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366973#comment-14366973
 ] 

Chaozhong Yang commented on SPARK-5387:
---

I also encountered the same issue.

 parquet writer runs into OOM during writing when number of rows is large
 

 Key: SPARK-5387
 URL: https://issues.apache.org/jira/browse/SPARK-5387
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.1.1
Reporter: Shirley Wu

 When the number of records is large in RDD, the saveAsParquet will have OOM.
 Here is the stack trace:
 15/01/23 10:00:02 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, 
 hdc2-s3.niara.com): java.lang.OutOfMemoryError: Java heap space
 
 parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65)
 
 parquet.bytes.CapacityByteArrayOutputStream.init(CapacityByteArrayOutputStream.java:57)
 
 parquet.column.values.rle.RunLengthBitPackingHybridEncoder.init(RunLengthBitPackingHybridEncoder.java:125)
 
 parquet.column.values.rle.RunLengthBitPackingHybridValuesWriter.init(RunLengthBitPackingHybridValuesWriter.java:36)
 
 parquet.column.ParquetProperties.getColumnDescriptorValuesWriter(ParquetProperties.java:61)
 parquet.column.impl.ColumnWriterImpl.init(ColumnWriterImpl.java:73)
 
 parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68)
 
 parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
 
 parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.init(MessageColumnIO.java:124)
 parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:315)
 
 parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:106)
 
 parquet.hadoop.InternalParquetRecordWriter.checkBlockSizeReached(InternalParquetRecordWriter.java:126)
 
 parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:117)
 parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
 parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:303)
 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 It seems the writeShard() API needs to flush to disk periodically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5818) unable to use add jar in hql

2015-03-18 Thread Venkata Ramana G (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366811#comment-14366811
 ] 

Venkata Ramana G commented on SPARK-5818:
-

TranslatingClassLoader is used for Spark-shell, while current hive's add jar 
can work only with URLClassLoader.

So jar has to be directly added to spark driver's class loader or its parent 
loader, in case of spark-shell
I am working the same.

 unable to use add jar in hql
 --

 Key: SPARK-5818
 URL: https://issues.apache.org/jira/browse/SPARK-5818
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1
Reporter: pengxu

 In the spark 1.2.1 and 1.2.0, it's unable the use the hive command add jar  
 in hql.
 It seems that the problem in spark-2219 is still existed.
 the problem can be reproduced as described in the below. Suppose the jar file 
 is named brickhouse-0.6.0.jar and is placed in the /tmp directory
 {code}
 spark-shellimport org.apache.spark.sql.hive._
 spark-shellval sqlContext = new HiveContext(sc)
 spark-shellimport sqlContext._
 spark-shellhql(add jar /tmp/brickhouse-0.6.0.jar)
 {code}
 the error message is showed as blow
 {code:title=Error Log}
 15/02/15 01:36:31 ERROR SessionState: Unable to register 
 /tmp/brickhouse-0.6.0.jar
 Exception: org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be 
 cast to java.net.URLClassLoader
 java.lang.ClassCastException: 
 org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be cast to 
 java.net.URLClassLoader
   at 
 org.apache.hadoop.hive.ql.exec.Utilities.addToClassPath(Utilities.java:1921)
   at 
 org.apache.hadoop.hive.ql.session.SessionState.registerJar(SessionState.java:599)
   at 
 org.apache.hadoop.hive.ql.session.SessionState$ResourceType$2.preHook(SessionState.java:658)
   at 
 org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:732)
   at 
 org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:717)
   at 
 org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:54)
   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:319)
   at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
   at 
 org.apache.spark.sql.hive.execution.AddJar.sideEffectResult$lzycompute(commands.scala:74)
   at 
 org.apache.spark.sql.hive.execution.AddJar.sideEffectResult(commands.scala:73)
   at 
 org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
   at org.apache.spark.sql.hive.execution.AddJar.execute(commands.scala:68)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
   at 
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
   at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108)
   at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:102)
   at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:106)
   at 
 $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24)
   at 
 $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:29)
   at 
 $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)
   at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33)
   at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:35)
   at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:37)
   at $line30.$read$$iwC$$iwC$$iwC$$iwC.init(console:39)
   at $line30.$read$$iwC$$iwC$$iwC.init(console:41)
   at $line30.$read$$iwC$$iwC.init(console:43)
   at $line30.$read$$iwC.init(console:45)
   at $line30.$read.init(console:47)
   at $line30.$read$.init(console:51)
   at $line30.$read$.clinit(console)
   at $line30.$eval$.init(console:7)
   at $line30.$eval$.clinit(console)
   at $line30.$eval.$print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
   at 
 

[jira] [Created] (SPARK-6396) Add timeout control for broadcast

2015-03-18 Thread Jun Fang (JIRA)
Jun Fang created SPARK-6396:
---

 Summary: Add timeout control for broadcast
 Key: SPARK-6396
 URL: https://issues.apache.org/jira/browse/SPARK-6396
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager, Spark Core
Affects Versions: 1.3.0, 1.3.1
Reporter: Jun Fang
Priority: Minor


TorrentBroadcast uses fetchBlockSync method of BlockTransferService.scala which 
call Await.result(result.future, Duration.Inf). In production environment this 
may cause a hang out when driver and executor are in different date centers. A 
timeout here would be better. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6388) Spark 1.3 + Hadoop 2.6 Can't work on Java 8_40

2015-03-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366982#comment-14366982
 ] 

Sean Owen commented on SPARK-6388:
--

I am just using JDK8 to compile not targeting Java 8 in javac, and running on 
Java 8. I don't think there is a Java 8 problem in how Spark uses Hadoop or how 
Spark runs. You can see Spark + Hadoop even has Java 8-specific tests you can 
enable if you want.

 Spark 1.3 + Hadoop 2.6 Can't work on Java 8_40
 --

 Key: SPARK-6388
 URL: https://issues.apache.org/jira/browse/SPARK-6388
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Submit, YARN
Affects Versions: 1.3.0
 Environment: 1. Linux version 3.16.0-30-generic (buildd@komainu) (gcc 
 version 4.9.1 (Ubuntu 4.9.1-16ubuntu6) ) #40-Ubuntu SMP Mon Jan 12 22:06:37 
 UTC 2015
 2. Oracle Java 8 update 40  for Linux X64
 3. Scala 2.10.5
 4. Hadoop 2.6 (pre-build version)
Reporter: John
   Original Estimate: 24h
  Remaining Estimate: 24h

 I build Apache Spark 1.3 munally.
 ---
 JAVA_HOME=PATH_TO_JAVA8
 mvn clean package -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -DskipTests
 ---
 Something goes wrong, akka always tell me 
 ---
 15/03/17 21:28:10 WARN remote.ReliableDeliverySupervisor: Association with 
 remote system [akka.tcp://sparkYarnAM@Server2:42161] has failed, address is 
 now gated for [5000] ms. Reason is: [Disassociated].
 ---
 I build another version of Spark 1.3 + Hadoop 2.6 under Java 7.
 Everything goes well.
 Logs
 ---
 15/03/17 21:27:06 INFO spark.SparkContext: Running Spark version 1.3.0
 15/03/17 21:27:07 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 15/03/17 21:27:08 INFO spark.SecurityManager: Changing view Servers to: hduser
 15/03/17 21:27:08 INFO spark.SecurityManager: Changing modify Servers to: 
 hduser
 15/03/17 21:27:08 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui Servers disabled; users with view permissions: Set(hduser); 
 users with modify permissions: Set(hduser)
 15/03/17 21:27:08 INFO slf4j.Slf4jLogger: Slf4jLogger started
 15/03/17 21:27:08 INFO Remoting: Starting remoting
 15/03/17 21:27:09 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkDriver@Server3:37951]
 15/03/17 21:27:09 INFO util.Utils: Successfully started service 'sparkDriver' 
 on port 37951.
 15/03/17 21:27:09 INFO spark.SparkEnv: Registering MapOutputTracker
 15/03/17 21:27:09 INFO spark.SparkEnv: Registering BlockManagerMaster
 15/03/17 21:27:09 INFO storage.DiskBlockManager: Created local directory at 
 /tmp/spark-0db692bb-cd02-40c8-a8f0-3813c6da18e2/blockmgr-a1d0ad23-ab76-4177-80a0-a6f982a64d80
 15/03/17 21:27:09 INFO storage.MemoryStore: MemoryStore started with capacity 
 265.1 MB
 15/03/17 21:27:09 INFO spark.HttpFileServer: HTTP File server directory is 
 /tmp/spark-502ef3f8-b8cd-45cf-b1df-97df297cdb35/httpd-6303e24d-4b2b-4614-bb1d-74e8d331189b
 15/03/17 21:27:09 INFO spark.HttpServer: Starting HTTP Server
 15/03/17 21:27:09 INFO server.Server: jetty-8.y.z-SNAPSHOT
 15/03/17 21:27:10 INFO server.AbstractConnector: Started 
 SocketConnector@0.0.0.0:48000
 15/03/17 21:27:10 INFO util.Utils: Successfully started service 'HTTP file 
 server' on port 48000.
 15/03/17 21:27:10 INFO spark.SparkEnv: Registering OutputCommitCoordinator
 15/03/17 21:27:10 INFO server.Server: jetty-8.y.z-SNAPSHOT
 15/03/17 21:27:10 INFO server.AbstractConnector: Started 
 SelectChannelConnector@0.0.0.0:4040
 15/03/17 21:27:10 INFO util.Utils: Successfully started service 'SparkUI' on 
 port 4040.
 15/03/17 21:27:10 INFO ui.SparkUI: Started SparkUI at http://Server3:4040
 15/03/17 21:27:10 INFO spark.SparkContext: Added JAR 
 file:/home/hduser/spark-java2.jar at 
 http://192.168.11.42:48000/jars/spark-java2.jar with timestamp 1426598830307
 15/03/17 21:27:10 INFO client.RMProxy: Connecting to ResourceManager at 
 Server3/192.168.11.42:8050
 15/03/17 21:27:11 INFO yarn.Client: Requesting a new application from cluster 
 with 3 NodeManagers
 15/03/17 21:27:11 INFO yarn.Client: Verifying our application has not 
 requested more than the maximum memory capability of the cluster (8192 MB per 
 container)
 15/03/17 21:27:11 INFO yarn.Client: Will allocate AM container, with 896 MB 
 memory including 384 MB overhead
 15/03/17 21:27:11 INFO yarn.Client: Setting up container launch context for 
 our AM
 15/03/17 21:27:11 INFO yarn.Client: Preparing resources for our AM container
 15/03/17 21:27:12 INFO yarn.Client: Uploading resource 
 file:/home/hduser/spark-1.3.0/assembly/target/scala-2.10/spark-assembly-1.3.0-hadoop2.6.0.jar
  - 
 

[jira] [Created] (SPARK-6397) Check the missingInput simply

2015-03-18 Thread Yadong Qi (JIRA)
Yadong Qi created SPARK-6397:


 Summary: Check the missingInput simply
 Key: SPARK-6397
 URL: https://issues.apache.org/jira/browse/SPARK-6397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yadong Qi






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6396) Add timeout control for broadcast

2015-03-18 Thread Dale Richardson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366799#comment-14366799
 ] 

Dale Richardson commented on SPARK-6396:


If nobody else is looking at this one I'll have a look at it.

 Add timeout control for broadcast
 -

 Key: SPARK-6396
 URL: https://issues.apache.org/jira/browse/SPARK-6396
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager, Spark Core
Affects Versions: 1.3.0, 1.3.1
Reporter: Jun Fang
Priority: Minor

 TorrentBroadcast uses fetchBlockSync method of BlockTransferService.scala 
 which call Await.result(result.future, Duration.Inf). In production 
 environment this may cause a hang out when driver and executor are in 
 different date centers. A timeout here would be better. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6397) Check the missingInput simply

2015-03-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366918#comment-14366918
 ] 

Apache Spark commented on SPARK-6397:
-

User 'watermen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5082

 Check the missingInput simply
 -

 Key: SPARK-6397
 URL: https://issues.apache.org/jira/browse/SPARK-6397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yadong Qi





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6195) Specialized in-memory column type for fixed-precision decimal

2015-03-18 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6195:
--
Summary: Specialized in-memory column type for fixed-precision decimal  
(was: Specialized in-memory column type for decimal)

 Specialized in-memory column type for fixed-precision decimal
 -

 Key: SPARK-6195
 URL: https://issues.apache.org/jira/browse/SPARK-6195
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1, 1.3.0, 1.2.1
Reporter: Cheng Lian
Assignee: Cheng Lian
 Fix For: 1.4.0


 When building in-memory columnar representation, decimal values are currently 
 serialized via a generic serializer, which is unnecessarily slow. Since some 
 decimals (precision  19) can be represented as annotated long values, we 
 should add a specialized fixed-precision decimal column type to speed up 
 in-memory decimal serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6195) Specialized in-memory column type for decimal

2015-03-18 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6195:
--
Description: When building in-memory columnar representation, decimal 
values are currently serialized via a generic serializer, which is 
unnecessarily slow. Since some decimals (precision  19) can be represented as 
annotated long values, we should add a specialized fixed-precision decimal 
column type to speed up in-memory decimal serialization.  (was: When building 
in-memory columnar representation, decimal values are currently serialized via 
a generic serializer, which is unnecessarily slow. Since decimals are actually 
annotated long values, we should add a specialized decimal column type to speed 
up in-memory decimal serialization.)

 Specialized in-memory column type for decimal
 -

 Key: SPARK-6195
 URL: https://issues.apache.org/jira/browse/SPARK-6195
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1, 1.3.0, 1.2.1
Reporter: Cheng Lian
Assignee: Cheng Lian
 Fix For: 1.4.0


 When building in-memory columnar representation, decimal values are currently 
 serialized via a generic serializer, which is unnecessarily slow. Since some 
 decimals (precision  19) can be represented as annotated long values, we 
 should add a specialized fixed-precision decimal column type to speed up 
 in-memory decimal serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6396) Add timeout control for broadcast

2015-03-18 Thread Jun Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366836#comment-14366836
 ] 

Jun Fang commented on SPARK-6396:
-

I am working on it right now, sooner i will give a pull request.




 Add timeout control for broadcast
 -

 Key: SPARK-6396
 URL: https://issues.apache.org/jira/browse/SPARK-6396
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager, Spark Core
Affects Versions: 1.3.0, 1.3.1
Reporter: Jun Fang
Priority: Minor

 TorrentBroadcast uses fetchBlockSync method of BlockTransferService.scala 
 which call Await.result(result.future, Duration.Inf). In production 
 environment this may cause a hang out when driver and executor are in 
 different date centers. A timeout here would be better. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6397) Check the missingInput simply

2015-03-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6397:
-
Priority: Minor  (was: Major)

There's no description and no real info in the title. JIRAs need to state the 
issue briefly but specifically.

 Check the missingInput simply
 -

 Key: SPARK-6397
 URL: https://issues.apache.org/jira/browse/SPARK-6397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yadong Qi
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6396) Add timeout control for broadcast

2015-03-18 Thread Dale Richardson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366849#comment-14366849
 ] 

Dale Richardson commented on SPARK-6396:


No problems.

 Add timeout control for broadcast
 -

 Key: SPARK-6396
 URL: https://issues.apache.org/jira/browse/SPARK-6396
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager, Spark Core
Affects Versions: 1.3.0, 1.3.1
Reporter: Jun Fang
Priority: Minor

 TorrentBroadcast uses fetchBlockSync method of BlockTransferService.scala 
 which call Await.result(result.future, Duration.Inf). In production 
 environment this may cause a hang out when driver and executor are in 
 different date centers. A timeout here would be better. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6398) Improve utility of GaussianMixture for higer dimensional data

2015-03-18 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367028#comment-14367028
 ] 

Travis Galoppo commented on SPARK-6398:
---

Please assign to me

 Improve utility of GaussianMixture for higer dimensional data
 -

 Key: SPARK-6398
 URL: https://issues.apache.org/jira/browse/SPARK-6398
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Travis Galoppo

 The current EM implementation for GaussianMixture protects itself from 
 numerical instability at the expense of utility in high dimensions.  A few 
 options exist for extending utility into higher dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6372) spark-submit --conf is not being propagated to child processes

2015-03-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6372:
-
Assignee: Marcelo Vanzin

 spark-submit --conf is not being propagated to child processes
 

 Key: SPARK-6372
 URL: https://issues.apache.org/jira/browse/SPARK-6372
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Blocker
 Fix For: 1.4.0


 Thanks to [~irashid] for bringing this up. It seems that the new launcher 
 library is incorrectly handling --conf and not passing it down to the child 
 processes. Fix is simple, PR coming up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?

2015-03-18 Thread Abou Haydar Elias (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367089#comment-14367089
 ] 

Abou Haydar Elias commented on SPARK-5874:
--

The tokenizer as for now converts the input string to lowercase and then splits 
it by white spaces only. 

I suggest more flexibility for the Tokenizer pipeline stage. So we can 
eventually add stemming and text analysis directly into the Tokenizer.

There are many post-tokenization steps that can be done, including (but not 
limited to):

- [Stemming|http://en.wikipedia.org/wiki/Stemming] – Replacing words with their 
stems. For instance with English stemming bikes is replaced with bike; now 
query bike can find both documents containing bike and those containing 
bikes.
- Stop Words Filtering – Common words like the, and and a rarely add any 
value to a search. Removing them shrinks the index size and increases 
performance. It may also reduce some noise and actually improve search 
quality.
- [Text Normalization|http://en.wikipedia.org/wiki/Text_normalization] – 
Stripping accents and other character markings can make for better searching.
- Synonym Expansion – Adding in synonyms at the same token position as the 
current word can mean better matching when users search with words in the 
synonym set.

so what do you think?

 How to improve the current ML pipeline API?
 ---

 Key: SPARK-5874
 URL: https://issues.apache.org/jira/browse/SPARK-5874
 Project: Spark
  Issue Type: Brainstorming
  Components: ML
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

 I created this JIRA to collect feedbacks about the ML pipeline API we 
 introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 
 with confidence, which requires valuable input from the community. I'll 
 create sub-tasks for each major issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6325) YarnAllocator crash with dynamic allocation on

2015-03-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6325:
-
Assignee: Marcelo Vanzin

 YarnAllocator crash with dynamic allocation on
 --

 Key: SPARK-6325
 URL: https://issues.apache.org/jira/browse/SPARK-6325
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.3.0
Reporter: Marcelo Vanzin
Assignee: Marcelo Vanzin
Priority: Critical
 Fix For: 1.4.0, 1.3.1


 Run spark-shell like this:
 {noformat}
 spark-shell --conf spark.shuffle.service.enabled=true \
 --conf spark.dynamicAllocation.enabled=true  \
 --conf spark.dynamicAllocation.minExecutors=1  \
 --conf spark.dynamicAllocation.maxExecutors=20 \
 --conf spark.dynamicAllocation.executorIdleTimeout=10  \
 --conf spark.dynamicAllocation.schedulerBacklogTimeout=5  \
 --conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=5
 {noformat}
 Then run this simple test:
 {code}
 scala val verySmallRdd = sc.parallelize(1 to 10, 10).map { i = 
  |   if (i % 2 == 0) { Thread.sleep(30 * 1000); i } else 0
  | }
 verySmallRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at 
 console:21
 scala verySmallRdd.collect()
 {code}
 When Spark starts ramping down the number of allocated executors, it will hit 
 an assert in YarnAllocator.scala:
 {code}
 assert(targetNumExecutors = 0, Allocator killed more executors than are 
 allocated!)
 {code}
 This assert will cause the akka backend to die, but not the AM itself. So the 
 app will be in a zombie-like state, where the driver is alive but can't talk 
 to the AM. Sadness ensues.
 I have a working fix, just need to add unit tests. Stay tuned.
 Thanks to [~wypoon] for finding the problem, and for the test case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6325) YarnAllocator crash with dynamic allocation on

2015-03-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6325.
--
   Resolution: Fixed
Fix Version/s: 1.3.1
   1.4.0

Issue resolved by pull request 5018
[https://github.com/apache/spark/pull/5018]

 YarnAllocator crash with dynamic allocation on
 --

 Key: SPARK-6325
 URL: https://issues.apache.org/jira/browse/SPARK-6325
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.3.0
Reporter: Marcelo Vanzin
Priority: Critical
 Fix For: 1.4.0, 1.3.1


 Run spark-shell like this:
 {noformat}
 spark-shell --conf spark.shuffle.service.enabled=true \
 --conf spark.dynamicAllocation.enabled=true  \
 --conf spark.dynamicAllocation.minExecutors=1  \
 --conf spark.dynamicAllocation.maxExecutors=20 \
 --conf spark.dynamicAllocation.executorIdleTimeout=10  \
 --conf spark.dynamicAllocation.schedulerBacklogTimeout=5  \
 --conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=5
 {noformat}
 Then run this simple test:
 {code}
 scala val verySmallRdd = sc.parallelize(1 to 10, 10).map { i = 
  |   if (i % 2 == 0) { Thread.sleep(30 * 1000); i } else 0
  | }
 verySmallRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at 
 console:21
 scala verySmallRdd.collect()
 {code}
 When Spark starts ramping down the number of allocated executors, it will hit 
 an assert in YarnAllocator.scala:
 {code}
 assert(targetNumExecutors = 0, Allocator killed more executors than are 
 allocated!)
 {code}
 This assert will cause the akka backend to die, but not the AM itself. So the 
 app will be in a zombie-like state, where the driver is alive but can't talk 
 to the AM. Sadness ensues.
 I have a working fix, just need to add unit tests. Stay tuned.
 Thanks to [~wypoon] for finding the problem, and for the test case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6398) Improve utility of GaussianMixture for higer dimensional data

2015-03-18 Thread Travis Galoppo (JIRA)
Travis Galoppo created SPARK-6398:
-

 Summary: Improve utility of GaussianMixture for higer dimensional 
data
 Key: SPARK-6398
 URL: https://issues.apache.org/jira/browse/SPARK-6398
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Travis Galoppo


The current EM implementation for GaussianMixture protects itself from 
numerical instability at the expense of utility in high dimensions.  A few 
options exist for extending utility into higher dimensions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6399) Code compiled against 1.3.0 may not run against older Spark versions

2015-03-18 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-6399:
-

 Summary: Code compiled against 1.3.0 may not run against older 
Spark versions
 Key: SPARK-6399
 URL: https://issues.apache.org/jira/browse/SPARK-6399
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Spark Core
Affects Versions: 1.3.0
Reporter: Marcelo Vanzin


Commit 65b987c3 re-organized the implicit conversions of RDDs so that they're 
easier to use. The problem is that scalac now generates code that will not run 
on older Spark versions if those conversions are used.

Basically, even if you explicitly import {{SparkContext._}}, scalac will 
generate references to the new methods in the {{RDD}} object instead. So the 
compiled code will reference code that doesn't exist in older versions of Spark.

You can work around this by explicitly calling the methods in the 
{{SparkContext}} object, although that's a little ugly.

We should at least document this limitation (if there's no way to fix it), 
since I believe forwards compatibility in the API was also a goal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6096) Support model save/load in Python's naive Bayes

2015-03-18 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367125#comment-14367125
 ] 

Xusen Yin commented on SPARK-6096:
--

[~mengxr] Pls assign it to me.

 Support model save/load in Python's naive Bayes
 ---

 Key: SPARK-6096
 URL: https://issues.apache.org/jira/browse/SPARK-6096
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6286) Handle TASK_ERROR in TaskState

2015-03-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6286.
--
   Resolution: Fixed
Fix Version/s: 1.3.1
   1.4.0
 Assignee: Iulian Dragos

Resolved by https://github.com/apache/spark/pull/5000

 Handle TASK_ERROR in TaskState
 --

 Key: SPARK-6286
 URL: https://issues.apache.org/jira/browse/SPARK-6286
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Iulian Dragos
Assignee: Iulian Dragos
Priority: Minor
  Labels: mesos
 Fix For: 1.4.0, 1.3.1


 Scala warning:
 {code}
 match may not be exhaustive. It would fail on the following input: TASK_ERROR
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6372) spark-submit --conf is not being propagated to child processes

2015-03-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6372.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5057
[https://github.com/apache/spark/pull/5057]

 spark-submit --conf is not being propagated to child processes
 

 Key: SPARK-6372
 URL: https://issues.apache.org/jira/browse/SPARK-6372
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Marcelo Vanzin
Priority: Blocker
 Fix For: 1.4.0


 Thanks to [~irashid] for bringing this up. It seems that the new launcher 
 library is incorrectly handling --conf and not passing it down to the child 
 processes. Fix is simple, PR coming up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6389) YARN app diagnostics report doesn't report NPEs

2015-03-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6389.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
 Assignee: Steve Loughran

Resolved by https://github.com/apache/spark/pull/5070

 YARN app diagnostics report doesn't report NPEs
 ---

 Key: SPARK-6389
 URL: https://issues.apache.org/jira/browse/SPARK-6389
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0
Reporter: Steve Loughran
Assignee: Steve Loughran
Priority: Trivial
 Fix For: 1.4.0


 {{ApplicationMaster.run()}} catches exceptions and calls {{toMessage()}} to 
 get their message included in the YARN diagnostics report visible in the RM 
 UI.
 Except, NPEs don't have a message —if one is raised their report becomes 
 {{Uncaught exception: null}}, which isn't that useful. The full text  stack 
 trace is logged correctly in the AM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4416) Support Mesos framework authentication

2015-03-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4416.
--
Resolution: Duplicate

 Support Mesos framework authentication
 --

 Key: SPARK-4416
 URL: https://issues.apache.org/jira/browse/SPARK-4416
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen

 Support Mesos framework authentication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5874) How to improve the current ML pipeline API?

2015-03-18 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367309#comment-14367309
 ] 

Xiangrui Meng commented on SPARK-5874:
--

[~Elie A.] Thanks for your feedback! This JIRA is to discuss the pipeline API 
but not specific components. For text preprocessing, we will certainly add 
standard stemmer and stop words filter as transformers. There is also a 
RegexTokenizer in review: https://github.com/apache/spark/pull/4504



 How to improve the current ML pipeline API?
 ---

 Key: SPARK-5874
 URL: https://issues.apache.org/jira/browse/SPARK-5874
 Project: Spark
  Issue Type: Brainstorming
  Components: ML
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

 I created this JIRA to collect feedbacks about the ML pipeline API we 
 introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 
 with confidence, which requires valuable input from the community. I'll 
 create sub-tasks for each major issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6400) It would be great if you could share your test jars in Maven central repository for the Spark SQL module

2015-03-18 Thread JIRA
Óscar Puertas created SPARK-6400:


 Summary: It would be great if you could share your test jars in 
Maven central repository for the Spark SQL module
 Key: SPARK-6400
 URL: https://issues.apache.org/jira/browse/SPARK-6400
 Project: Spark
  Issue Type: Wish
  Components: SQL, Tests
Affects Versions: 1.3.0
Reporter: Óscar Puertas


It would be great if you could share your test jars in Maven central repository 
for the Spark SQL module



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6096) Support model save/load in Python's naive Bayes

2015-03-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6096:
-
Assignee: Xusen Yin

 Support model save/load in Python's naive Bayes
 ---

 Key: SPARK-6096
 URL: https://issues.apache.org/jira/browse/SPARK-6096
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Xusen Yin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6096) Support model save/load in Python's naive Bayes

2015-03-18 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367280#comment-14367280
 ] 

Xiangrui Meng commented on SPARK-6096:
--

Done.

 Support model save/load in Python's naive Bayes
 ---

 Key: SPARK-6096
 URL: https://issues.apache.org/jira/browse/SPARK-6096
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Xusen Yin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6401) Unable to load a old API input format in Spark streaming

2015-03-18 Thread JIRA
Rémy DUBOIS created SPARK-6401:
--

 Summary: Unable to load a old API input format in Spark streaming
 Key: SPARK-6401
 URL: https://issues.apache.org/jira/browse/SPARK-6401
 Project: Spark
  Issue Type: Improvement
Reporter: Rémy DUBOIS


The fileStream method of the JavaStreamingContext class does not allow using a 
old API InputFormat.
This feature exists in Spark batch but not in streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6402) EC2 script and job scheduling documentation still refer to Shark

2015-03-18 Thread Pierre Borckmans (JIRA)
Pierre Borckmans created SPARK-6402:
---

 Summary: EC2 script and job scheduling documentation still refer 
to Shark
 Key: SPARK-6402
 URL: https://issues.apache.org/jira/browse/SPARK-6402
 Project: Spark
  Issue Type: Documentation
Reporter: Pierre Borckmans






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6402) EC2 script and job scheduling documentation still refer to Shark

2015-03-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367440#comment-14367440
 ] 

Apache Spark commented on SPARK-6402:
-

User 'pierre-borckmans' has created a pull request for this issue:
https://github.com/apache/spark/pull/5083

 EC2 script and job scheduling documentation still refer to Shark
 

 Key: SPARK-6402
 URL: https://issues.apache.org/jira/browse/SPARK-6402
 Project: Spark
  Issue Type: Documentation
Reporter: Pierre Borckmans





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6397) Check the missingInput simply

2015-03-18 Thread Santiago M. Mola (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367437#comment-14367437
 ] 

Santiago M. Mola commented on SPARK-6397:
-

I think a proper title would be: Override QueryPlan.missingInput when necessary 
and rely on it CheckAnalysis.
And description: Currently, some LogicalPlans do not override missingInput, but 
they should. Then, the lack of proper missingInput implementations leaks to 
CheckAnalysis.

(I'm about to create a pull request that fixes this problem in some more places)



 Check the missingInput simply
 -

 Key: SPARK-6397
 URL: https://issues.apache.org/jira/browse/SPARK-6397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yadong Qi
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6374) Add getter for GeneralizedLinearAlgorithm

2015-03-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6374.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5058
[https://github.com/apache/spark/pull/5058]

 Add getter for GeneralizedLinearAlgorithm
 -

 Key: SPARK-6374
 URL: https://issues.apache.org/jira/browse/SPARK-6374
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: yuhao yang
Priority: Minor
 Fix For: 1.4.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 I find it's better to have getter for NumFeatures and addIntercept within 
 GeneralizedLinearAlgorithm during actual usage, otherwise I 'll have to get 
 the value through debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3632) ConnectionManager can run out of receive threads with authentication on

2015-03-18 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367863#comment-14367863
 ] 

Thomas Graves commented on SPARK-3632:
--

[~andrewor14] At this point doesn't seem like we need to port back to branch1, 
you ok if we close this?

 ConnectionManager can run out of receive threads with authentication on
 ---

 Key: SPARK-3632
 URL: https://issues.apache.org/jira/browse/SPARK-3632
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Thomas Graves
Assignee: Thomas Graves
Priority: Critical
  Labels: backport-needed
 Fix For: 1.2.0


 If you turn authentication on and you are using a lot of executors. There is 
 a chance that all the of the threads in the handleMessageExecutor could be 
 waiting to send a message because they are blocked waiting on authentication 
 to happen. This can cause a temporary deadlock until the connection times out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6403) Launch master as spot instance on EC2

2015-03-18 Thread Adam Vogel (JIRA)
Adam Vogel created SPARK-6403:
-

 Summary: Launch master as spot instance on EC2
 Key: SPARK-6403
 URL: https://issues.apache.org/jira/browse/SPARK-6403
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Affects Versions: 1.2.1
Reporter: Adam Vogel
Priority: Minor
 Fix For: 1.2.1


Currently the spark_ec2.py script only supports requesting slaves as spot 
instances. Launching the master as a spot instance has potential cost savings, 
at the risk of losing the Spark cluster without warning. Unless users include 
logic for relaunching slaves when lost, it is usually the case that all slaves 
are lost simultaneously. Thus, for jobs which do not require resilience to 
losing spot instances, being able to launch the master as a spot instance saves 
money.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6192) Enhance MLlib's Python API (GSoC 2015)

2015-03-18 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367701#comment-14367701
 ] 

Manoj Kumar commented on SPARK-6192:


Thanks for your feedback. I've fixed it up (same link) adding an Importance 
section, denoting the importance of the project. Let me know if there is 
anything else to be done.

 Enhance MLlib's Python API (GSoC 2015)
 --

 Key: SPARK-6192
 URL: https://issues.apache.org/jira/browse/SPARK-6192
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Manoj Kumar
  Labels: gsoc, gsoc2015, mentor

 This is an umbrella JIRA for [~MechCoder]'s GSoC 2015 project. The main theme 
 is to enhance MLlib's Python API, to make it on par with the Scala/Java API. 
 The main tasks are:
 1. For all models in MLlib, provide save/load method. This also
 includes save/load in Scala.
 2. Python API for evaluation metrics.
 3. Python API for streaming ML algorithms.
 4. Python API for distributed linear algebra.
 5. Simplify MLLibPythonAPI using DataFrames. Currently, we use
 customized serialization, making MLLibPythonAPI hard to maintain. It
 would be nice to use the DataFrames for serialization.
 I'll link the JIRAs for each of the tasks.
 Note that this doesn't mean all these JIRAs are pre-assigned to [~MechCoder]. 
 The TODO list will be dynamic based on the backlog.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5078) Allow setting Akka host name from env vars

2015-03-18 Thread Timothy St. Clair (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367704#comment-14367704
 ] 

Timothy St. Clair commented on SPARK-5078:
--

Cross listing details of the issue here, for posterity: 
https://groups.google.com/forum/#!topic/akka-user/9RQdf2NjciE 
+ mattfs example https://github.com/mattf/docker-spark fix for k8's 

 Allow setting Akka host name from env vars
 --

 Key: SPARK-5078
 URL: https://issues.apache.org/jira/browse/SPARK-5078
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical
 Fix For: 1.3.0, 1.2.1


 Current spark lets you set the ip address using SPARK_LOCAL_IP, but then this 
 is given to akka after doing a reverse DNS lookup.  This makes it difficult 
 to run spark in Docker.  You can already change the hostname that is used 
 programmatically, but it would be nice to be able to do this with an 
 environment variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6395) Rebuild the schema from a GenericRow

2015-03-18 Thread Chen Song (JIRA)
Chen Song created SPARK-6395:


 Summary: Rebuild the schema from a GenericRow
 Key: SPARK-6395
 URL: https://issues.apache.org/jira/browse/SPARK-6395
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Chen Song


Sometimes we need the schema of the row, but GenericRow doesn't contain schema 
information. So we need a method such as  
val schema = ScalaReflection.rebuildSchema(row)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6404) Call broadcast() in each interval for spark streaming programs.

2015-03-18 Thread Yifan Wang (JIRA)
Yifan Wang created SPARK-6404:
-

 Summary: Call broadcast() in each interval for spark streaming 
programs.
 Key: SPARK-6404
 URL: https://issues.apache.org/jira/browse/SPARK-6404
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Yifan Wang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6406) Launcher backward compatibility issues

2015-03-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368380#comment-14368380
 ] 

Apache Spark commented on SPARK-6406:
-

User 'nishkamravi2' has created a pull request for this issue:
https://github.com/apache/spark/pull/5085

 Launcher backward compatibility issues
 --

 Key: SPARK-6406
 URL: https://issues.apache.org/jira/browse/SPARK-6406
 Project: Spark
  Issue Type: Bug
Reporter: Nishkam Ravi
Priority: Blocker

 The new launcher library breaks backward compatibility. hadoop string in 
 the spark assembly should not be mandatory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6152) Spark does not support Java 8 compiled Scala classes

2015-03-18 Thread Jonathan Neufeld (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368138#comment-14368138
 ] 

Jonathan Neufeld edited comment on SPARK-6152 at 3/18/15 11:31 PM:
---

The exception is raised in 
com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader

It looks like it's this line:
if(this.readShort(6)  51) {
throw new IllegalArgumentException();

I checked the result of readShort and found it to be 52 in my environment.

It bears the mark of a compiled class file version compatibility problem to me.


was (Author: madmartian):
The exception is raised in 
com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader

It looks like it's this line:
if(this.readShort(6)  51) {
throw new IllegalArgumentException();

Which bears the mark of a compiled class file version compatibility problem to 
me.

 Spark does not support Java 8 compiled Scala classes
 

 Key: SPARK-6152
 URL: https://issues.apache.org/jira/browse/SPARK-6152
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: Java 8+
 Scala 2.11
Reporter: Ronald Chen
Priority: Minor

 Spark uses reflectasm to check Scala closures which fails if the *user 
 defined Scala closures* are compiled to Java 8 class version
 The cause is reflectasm does not support Java 8
 https://github.com/EsotericSoftware/reflectasm/issues/35
 Workaround:
 Don't compile Scala classes to Java 8, Scala 2.11 does not support nor 
 require any Java 8 features
 Stack trace:
 {code}
 java.lang.IllegalArgumentException
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown
  Source)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown
  Source)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown
  Source)
   at 
 org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$getClassReader(ClosureCleaner.scala:41)
   at 
 org.apache.spark.util.ClosureCleaner$.getInnerClasses(ClosureCleaner.scala:84)
   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:107)
   at org.apache.spark.SparkContext.clean(SparkContext.scala:1478)
   at org.apache.spark.rdd.RDD.map(RDD.scala:288)
   at ...my Scala 2.11 compiled to Java 8 code calling into spark
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6405) Spark Kryo buffer should be forced to be max. 2GB

2015-03-18 Thread Matt Cheah (JIRA)
Matt Cheah created SPARK-6405:
-

 Summary: Spark Kryo buffer should be forced to be max. 2GB
 Key: SPARK-6405
 URL: https://issues.apache.org/jira/browse/SPARK-6405
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Matt Cheah
 Fix For: 1.4.0


Kryo buffers used in serialization are backed by Java byte arrays, which have a 
maximum size of 2GB. However, we blindly set the size without worrying about 
numeric overflow or regards to the maximum array size. We should enforce the 
maximum buffer size to be 2GB and warn the user when they have exceeded that 
amount.

I'm open to the idea of flat-out failing the initialization of the Spark 
Context if the buffer size is over 2GB, but I'm afraid that could break 
backwards-compatability... although one can argue that the user had incorrect 
buffer sizes in the first place.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6152) Spark does not support Java 8 compiled Scala classes

2015-03-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368242#comment-14368242
 ] 

Sean Owen commented on SPARK-6152:
--

To your deleted comment -- yes indeed it looks like the library explicitly does 
not work with Java 8 since class file version 52 == Java 8

 Spark does not support Java 8 compiled Scala classes
 

 Key: SPARK-6152
 URL: https://issues.apache.org/jira/browse/SPARK-6152
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: Java 8+
 Scala 2.11
Reporter: Ronald Chen
Priority: Minor

 Spark uses reflectasm to check Scala closures which fails if the *user 
 defined Scala closures* are compiled to Java 8 class version
 The cause is reflectasm does not support Java 8
 https://github.com/EsotericSoftware/reflectasm/issues/35
 Workaround:
 Don't compile Scala classes to Java 8, Scala 2.11 does not support nor 
 require any Java 8 features
 Stack trace:
 {code}
 java.lang.IllegalArgumentException
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown
  Source)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown
  Source)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown
  Source)
   at 
 org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$getClassReader(ClosureCleaner.scala:41)
   at 
 org.apache.spark.util.ClosureCleaner$.getInnerClasses(ClosureCleaner.scala:84)
   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:107)
   at org.apache.spark.SparkContext.clean(SparkContext.scala:1478)
   at org.apache.spark.rdd.RDD.map(RDD.scala:288)
   at ...my Scala 2.11 compiled to Java 8 code calling into spark
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6404) Call broadcast() in each interval for spark streaming programs.

2015-03-18 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368283#comment-14368283
 ] 

Saisai Shao commented on SPARK-6404:


Hi [~heavens...@gmail.com], I think for current broadcast, you cannot update 
the data that is already broadcasted. If you want to update the data in each 
interval, I'm not sure is the below snippet works for you:

{code}
stream.foreachRDD { rdd =
  val foo = sc.broadcast(Foo) 
   rdd.foreach {
   foo.value.xxx
   }
}
{code}

 Call broadcast() in each interval for spark streaming programs.
 ---

 Key: SPARK-6404
 URL: https://issues.apache.org/jira/browse/SPARK-6404
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Yifan Wang

 If I understand it correctly, Spark’s broadcast() function will be called 
 only once at the beginning of the batch. For streaming applications that need 
 to run for 24/7, it is often needed to update variables that shared by 
 broadcast() dynamically. It would be ideal if broadcast() could be called at 
 the beginning of each interval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6397) Override QueryPlan.missingInput when necessary and rely on CheckAnalysis

2015-03-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6397:
-
Description: Currently, some LogicalPlans do not override missingInput, but 
they should. Then, the lack of proper missingInput implementations leaks to 
CheckAnalysis.
Summary: Override QueryPlan.missingInput when necessary and rely on 
CheckAnalysis  (was: Check the missingInput simply)

 Override QueryPlan.missingInput when necessary and rely on CheckAnalysis
 

 Key: SPARK-6397
 URL: https://issues.apache.org/jira/browse/SPARK-6397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yadong Qi
Priority: Minor

 Currently, some LogicalPlans do not override missingInput, but they should. 
 Then, the lack of proper missingInput implementations leaks to CheckAnalysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6401) Unable to load a old API input format in Spark streaming

2015-03-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368241#comment-14368241
 ] 

Sean Owen commented on SPARK-6401:
--

Yeah it would be more consistent. I suppose I'd be interested to hear whether 
others think it's worth continuing to add support for it or not though. It's 
pretty easy to port or carry a parallel version that uses the new InputFormat, 
right? I think you can even make an adapter.

 Unable to load a old API input format in Spark streaming
 

 Key: SPARK-6401
 URL: https://issues.apache.org/jira/browse/SPARK-6401
 Project: Spark
  Issue Type: Improvement
Reporter: Rémy DUBOIS
Priority: Minor

 The fileStream method of the JavaStreamingContext class does not allow using 
 a old API InputFormat.
 This feature exists in Spark batch but not in streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6152) Spark does not support Java 8 compiled Scala classes

2015-03-18 Thread Jonathan Neufeld (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368138#comment-14368138
 ] 

Jonathan Neufeld commented on SPARK-6152:
-

The exception is raised in 
com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader

It looks like it's this line:
if(this.readShort(6)  51) {
throw new IllegalArgumentException();

Which bears the mark of a compiled class file version compatibility problem to 
me.

 Spark does not support Java 8 compiled Scala classes
 

 Key: SPARK-6152
 URL: https://issues.apache.org/jira/browse/SPARK-6152
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: Java 8+
 Scala 2.11
Reporter: Ronald Chen
Priority: Minor

 Spark uses reflectasm to check Scala closures which fails if the *user 
 defined Scala closures* are compiled to Java 8 class version
 The cause is reflectasm does not support Java 8
 https://github.com/EsotericSoftware/reflectasm/issues/35
 Workaround:
 Don't compile Scala classes to Java 8, Scala 2.11 does not support nor 
 require any Java 8 features
 Stack trace:
 {code}
 java.lang.IllegalArgumentException
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown
  Source)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown
  Source)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown
  Source)
   at 
 org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$getClassReader(ClosureCleaner.scala:41)
   at 
 org.apache.spark.util.ClosureCleaner$.getInnerClasses(ClosureCleaner.scala:84)
   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:107)
   at org.apache.spark.SparkContext.clean(SparkContext.scala:1478)
   at org.apache.spark.rdd.RDD.map(RDD.scala:288)
   at ...my Scala 2.11 compiled to Java 8 code calling into spark
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6404) Call broadcast() in each interval for spark streaming programs.

2015-03-18 Thread Yifan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368383#comment-14368383
 ] 

Yifan Wang commented on SPARK-6404:
---

I got an error. Is that expected?

{code}
Traceback (most recent call last):
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/streaming/util.py,
 line 90, in dumps
return bytearray(self.serializer.dumps((func.func, func.deserializers)))
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/serializers.py,
 line 405, in dumps
return cloudpickle.dumps(obj, 2)
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py,
 line 816, in dumps
cp.dump(obj)
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py,
 line 133, in dump
return pickle.Pickler.dump(self, obj)
  File /usr/lib/python2.6/pickle.py, line 224, in dump
self.save(obj)
  File /usr/lib/python2.6/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File /usr/lib/python2.6/pickle.py, line 548, in save_tuple
save(element)
  File /usr/lib/python2.6/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py,
 line 249, in save_function
self.save_function_tuple(obj, modList)
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py,
 line 304, in save_function_tuple
save((code, closure, base_globals))
  File /usr/lib/python2.6/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File /usr/lib/python2.6/pickle.py, line 548, in save_tuple
save(element)
  File /usr/lib/python2.6/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File /usr/lib/python2.6/pickle.py, line 600, in save_list
self._batch_appends(iter(obj))
  File /usr/lib/python2.6/pickle.py, line 636, in _batch_appends
save(tmp[0])
  File /usr/lib/python2.6/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py,
 line 249, in save_function
self.save_function_tuple(obj, modList)
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py,
 line 309, in save_function_tuple
save(f_globals)
  File /usr/lib/python2.6/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py,
 line 174, in save_dict
pickle.Pickler.save_dict(self, obj)
  File /usr/lib/python2.6/pickle.py, line 649, in save_dict
self._batch_setitems(obj.iteritems())
  File /usr/lib/python2.6/pickle.py, line 681, in _batch_setitems
save(v)
  File /usr/lib/python2.6/pickle.py, line 306, in save
rv = reduce(self.proto)
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/context.py,
 line 236, in __getnewargs__
It appears that you are attempting to reference SparkContext from a 
broadcast 
Exception: It appears that you are attempting to reference SparkContext from a 
broadcast variable, action, or transforamtion. SparkContext can only be used on 
the driver, not in code that it run on workers. For more information, see 
SPARK-5063.
{code}

 Call broadcast() in each interval for spark streaming programs.
 ---

 Key: SPARK-6404
 URL: https://issues.apache.org/jira/browse/SPARK-6404
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Yifan Wang

 If I understand it correctly, Spark’s broadcast() function will be called 
 only once at the beginning of the batch. For streaming applications that need 
 to run for 24/7, it is often needed to update variables that shared by 
 broadcast() dynamically. It would be ideal if broadcast() could be called at 
 the beginning of each interval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6404) Call broadcast() in each interval for spark streaming programs.

2015-03-18 Thread Yifan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368383#comment-14368383
 ] 

Yifan Wang edited comment on SPARK-6404 at 3/19/15 2:28 AM:


I got an error. Is that expected? It seems I can't call broadcast() in the 
worker node.

{code}
Traceback (most recent call last):
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/streaming/util.py,
 line 90, in dumps
return bytearray(self.serializer.dumps((func.func, func.deserializers)))
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/serializers.py,
 line 405, in dumps
return cloudpickle.dumps(obj, 2)
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py,
 line 816, in dumps
cp.dump(obj)
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py,
 line 133, in dump
return pickle.Pickler.dump(self, obj)
  File /usr/lib/python2.6/pickle.py, line 224, in dump
self.save(obj)
  File /usr/lib/python2.6/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File /usr/lib/python2.6/pickle.py, line 548, in save_tuple
save(element)
  File /usr/lib/python2.6/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py,
 line 249, in save_function
self.save_function_tuple(obj, modList)
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py,
 line 304, in save_function_tuple
save((code, closure, base_globals))
  File /usr/lib/python2.6/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File /usr/lib/python2.6/pickle.py, line 548, in save_tuple
save(element)
  File /usr/lib/python2.6/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File /usr/lib/python2.6/pickle.py, line 600, in save_list
self._batch_appends(iter(obj))
  File /usr/lib/python2.6/pickle.py, line 636, in _batch_appends
save(tmp[0])
  File /usr/lib/python2.6/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py,
 line 249, in save_function
self.save_function_tuple(obj, modList)
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py,
 line 309, in save_function_tuple
save(f_globals)
  File /usr/lib/python2.6/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py,
 line 174, in save_dict
pickle.Pickler.save_dict(self, obj)
  File /usr/lib/python2.6/pickle.py, line 649, in save_dict
self._batch_setitems(obj.iteritems())
  File /usr/lib/python2.6/pickle.py, line 681, in _batch_setitems
save(v)
  File /usr/lib/python2.6/pickle.py, line 306, in save
rv = reduce(self.proto)
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/context.py,
 line 236, in __getnewargs__
It appears that you are attempting to reference SparkContext from a 
broadcast 
Exception: It appears that you are attempting to reference SparkContext from a 
broadcast variable, action, or transforamtion. SparkContext can only be used on 
the driver, not in code that it run on workers. For more information, see 
SPARK-5063.
{code}


was (Author: heavens...@gmail.com):
I got an error. Is that expected?

{code}
Traceback (most recent call last):
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/streaming/util.py,
 line 90, in dumps
return bytearray(self.serializer.dumps((func.func, func.deserializers)))
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/serializers.py,
 line 405, in dumps
return cloudpickle.dumps(obj, 2)
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py,
 line 816, in dumps
cp.dump(obj)
  File 
/nail/home/yifan/pg/yifan_spark_mr_steps/spark-1.2.1-bin-hadoop2.4/python/pyspark/cloudpickle.py,
 line 133, in dump
return pickle.Pickler.dump(self, obj)
  File /usr/lib/python2.6/pickle.py, line 224, in dump
self.save(obj)
  File /usr/lib/python2.6/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File /usr/lib/python2.6/pickle.py, line 548, in save_tuple
save(element)
  File /usr/lib/python2.6/pickle.py, line 286, in save
f(self, obj) # Call unbound method with explicit self
  File 

[jira] [Created] (SPARK-6406) Launcher backward compatibility issues

2015-03-18 Thread Nishkam Ravi (JIRA)
Nishkam Ravi created SPARK-6406:
---

 Summary: Launcher backward compatibility issues
 Key: SPARK-6406
 URL: https://issues.apache.org/jira/browse/SPARK-6406
 Project: Spark
  Issue Type: Bug
Reporter: Nishkam Ravi
Priority: Blocker


The new launcher library breaks backward compatibility. hadoop string in the 
spark assembly should not be mandatory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6394) cleanup BlockManager companion object and improve the getCacheLocs method in DAGScheduler

2015-03-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-6394.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5043
[https://github.com/apache/spark/pull/5043]

 cleanup BlockManager companion object and improve the getCacheLocs method in 
 DAGScheduler
 -

 Key: SPARK-6394
 URL: https://issues.apache.org/jira/browse/SPARK-6394
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Wenchen Fan
Priority: Minor
 Fix For: 1.4.0


 The current implementation of getCacheLocs include searching a HashMap many 
 times, we can avoid this.
 And BlockManager.blockIdsToExecutorIds isn't called anywhere, we can remove 
 it. Also we can combine BlockManager.blockIdsToHosts and 
 blockIdsToBlockManagers into a single method in order to remove some 
 unnecessary layers of indirection / collection creation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6394) cleanup BlockManager companion object and improve the getCacheLocs method in DAGScheduler

2015-03-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-6394:
--
Assignee: Wenchen Fan

 cleanup BlockManager companion object and improve the getCacheLocs method in 
 DAGScheduler
 -

 Key: SPARK-6394
 URL: https://issues.apache.org/jira/browse/SPARK-6394
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Wenchen Fan
Assignee: Wenchen Fan
Priority: Minor
 Fix For: 1.4.0


 The current implementation of getCacheLocs include searching a HashMap many 
 times, we can avoid this.
 And BlockManager.blockIdsToExecutorIds isn't called anywhere, we can remove 
 it. Also we can combine BlockManager.blockIdsToHosts and 
 blockIdsToBlockManagers into a single method in order to remove some 
 unnecessary layers of indirection / collection creation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6152) Spark does not support Java 8 compiled Scala classes

2015-03-18 Thread Jonathan Neufeld (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Neufeld updated SPARK-6152:

Comment: was deleted

(was: The exception is raised in 
com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader

It looks like it's this line:
if(this.readShort(6)  51) {
throw new IllegalArgumentException();

I checked the result of readShort and found it to be 52 in my environment.

It bears the mark of a compiled class file version compatibility problem to me.)

 Spark does not support Java 8 compiled Scala classes
 

 Key: SPARK-6152
 URL: https://issues.apache.org/jira/browse/SPARK-6152
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: Java 8+
 Scala 2.11
Reporter: Ronald Chen
Priority: Minor

 Spark uses reflectasm to check Scala closures which fails if the *user 
 defined Scala closures* are compiled to Java 8 class version
 The cause is reflectasm does not support Java 8
 https://github.com/EsotericSoftware/reflectasm/issues/35
 Workaround:
 Don't compile Scala classes to Java 8, Scala 2.11 does not support nor 
 require any Java 8 features
 Stack trace:
 {code}
 java.lang.IllegalArgumentException
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown
  Source)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown
  Source)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown
  Source)
   at 
 org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$getClassReader(ClosureCleaner.scala:41)
   at 
 org.apache.spark.util.ClosureCleaner$.getInnerClasses(ClosureCleaner.scala:84)
   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:107)
   at org.apache.spark.SparkContext.clean(SparkContext.scala:1478)
   at org.apache.spark.rdd.RDD.map(RDD.scala:288)
   at ...my Scala 2.11 compiled to Java 8 code calling into spark
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6146) Support more datatype in SqlParser

2015-03-18 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6146:

Target Version/s: 1.3.1  (was: 1.4.0)

 Support more datatype in SqlParser
 --

 Key: SPARK-6146
 URL: https://issues.apache.org/jira/browse/SPARK-6146
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Priority: Critical

 Right now, I cannot do 
 {code}
 df.selectExpr(cast(a as bigint))
 {code}
 because only the following data types are supported in SqlParser
 {code}
 protected lazy val dataType: Parser[DataType] =
 ( STRING ^^^ StringType
 | TIMESTAMP ^^^ TimestampType
 | DOUBLE ^^^ DoubleType
 | fixedDecimalType
 | DECIMAL ^^^ DecimalType.Unlimited
 | DATE ^^^ DateType
 | INT ^^^ IntegerType
 )
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5911) Make Column.cast(to: String) support fixed precision and scale decimal type

2015-03-18 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-5911:

Target Version/s: 1.3.1  (was: 1.4.0)

 Make Column.cast(to: String) support fixed precision and scale decimal type
 ---

 Key: SPARK-5911
 URL: https://issues.apache.org/jira/browse/SPARK-5911
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6146) Support more datatype in SqlParser

2015-03-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368204#comment-14368204
 ] 

Apache Spark commented on SPARK-6146:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/5078

 Support more datatype in SqlParser
 --

 Key: SPARK-6146
 URL: https://issues.apache.org/jira/browse/SPARK-6146
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Priority: Critical

 Right now, I cannot do 
 {code}
 df.selectExpr(cast(a as bigint))
 {code}
 because only the following data types are supported in SqlParser
 {code}
 protected lazy val dataType: Parser[DataType] =
 ( STRING ^^^ StringType
 | TIMESTAMP ^^^ TimestampType
 | DOUBLE ^^^ DoubleType
 | fixedDecimalType
 | DECIMAL ^^^ DecimalType.Unlimited
 | DATE ^^^ DateType
 | INT ^^^ IntegerType
 )
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5911) Make Column.cast(to: String) support fixed precision and scale decimal type

2015-03-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368205#comment-14368205
 ] 

Apache Spark commented on SPARK-5911:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/5078

 Make Column.cast(to: String) support fixed precision and scale decimal type
 ---

 Key: SPARK-5911
 URL: https://issues.apache.org/jira/browse/SPARK-5911
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5508) Arrays and Maps stored with Hive Parquet Serde may not be able to read by the Parquet support in the Data Souce API

2015-03-18 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368505#comment-14368505
 ] 

Yin Huai edited comment on SPARK-5508 at 3/19/15 5:00 AM:
--

Seems the root cause of this problem is the array element schema name. For 
Hive's Parquet Serde, array element's name is array_element ([see here 
|https://github.com/apache/incubator-parquet-mr/blob/parquet-1.6.0rc3/parquet-hive/parquet-hive-storage-handler/src/main/java/org/apache/hadoop/hive/ql/io/parquet/convert/HiveSchemaConverter.java#L104]).
 While, for us, we use array ([see here| 
https://github.com/apache/spark/blob/v1.3.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala#L68]
 and we are following [parquet avro| 
https://github.com/apache/incubator-parquet-mr/blob/master/parquet-avro/src/main/java/parquet/avro/AvroSchemaConverter.java#L131]).


was (Author: yhuai):
Seems the root cause of this problem is the array element schema name. For 
Hive's Parquet Serde, array element's name is array_element (see 
https://github.com/apache/incubator-parquet-mr/blob/parquet-1.6.0rc3/parquet-hive/parquet-hive-storage-handler/src/main/java/org/apache/hadoop/hive/ql/io/parquet/convert/HiveSchemaConverter.java#L104).
 While, for us, we use array (see 
https://github.com/apache/spark/blob/v1.3.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala#L68
 and we are following parquet avro 
https://github.com/apache/incubator-parquet-mr/blob/master/parquet-avro/src/main/java/parquet/avro/AvroSchemaConverter.java#L131).

 Arrays and Maps stored with Hive Parquet Serde may not be able to read by the 
 Parquet support in the Data Souce API
 ---

 Key: SPARK-5508
 URL: https://issues.apache.org/jira/browse/SPARK-5508
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
 Environment: mesos, cdh
Reporter: Ayoub Benali
  Labels: hivecontext, parquet

 When the table is saved as parquet, we cannot query a field which is an array 
 of struct after an INSERT statement, like show bellow:  
 {noformat}
 scala val data1={
  | timestamp: 1422435598,
  | data_array: [
  | {
  | field1: 1,
  | field2: 2
  | }
  | ]
  | }
 scala val data2={
  | timestamp: 1422435598,
  | data_array: [
  | {
  | field1: 3,
  | field2: 4
  | }
  | ]
 scala val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil)
 scala val rdd = hiveContext.jsonRDD(jsonRDD)
 scala rdd.printSchema
 root
  |-- data_array: array (nullable = true)
  ||-- element: struct (containsNull = false)
  |||-- field1: integer (nullable = true)
  |||-- field2: integer (nullable = true)
  |-- timestamp: integer (nullable = true)
 scala rdd.registerTempTable(tmp_table)
 scala hiveContext.sql(select data.field1 from tmp_table LATERAL VIEW 
 explode(data_array) nestedStuff AS data).collect
 res3: Array[org.apache.spark.sql.Row] = Array([1], [3])
 scala hiveContext.sql(SET hive.exec.dynamic.partition = true)
 scala hiveContext.sql(SET hive.exec.dynamic.partition.mode = nonstrict)
 scala hiveContext.sql(set parquet.compression=GZIP)
 scala hiveContext.setConf(spark.sql.parquet.binaryAsString, true)
 scala hiveContext.sql(create external table if not exists 
 persisted_table(data_array ARRAY STRUCTfield1: INT, field2: INT, 
 timestamp INT) STORED AS PARQUET Location 'hdfs:///test_table')
 scala hiveContext.sql(insert into table persisted_table select * from 
 tmp_table).collect
 scala hiveContext.sql(select data.field1 from persisted_table LATERAL VIEW 
 explode(data_array) nestedStuff AS data).collect
 parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in 
 file hdfs://*/test_table/part-1
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
   at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
   at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   

[jira] [Commented] (SPARK-5508) Arrays and Maps stored with Hive Parquet Serde may not be able to read by the Parquet support in the Data Souce API

2015-03-18 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368493#comment-14368493
 ] 

Yin Huai commented on SPARK-5508:
-

I tried the following snippet in sparkShell
{code}
import sqlContext._
jsonRDD(sc.parallelize({a:[1,2,3,4]} :: Nil)).registerTempTable(jt)
sql(set spark.sql.parquet.useDataSourceApi=false)
sql(create table jt_hive_parquet stored as parquet as select * from jt)
sql(set spark.sql.parquet.useDataSourceApi=true)
table(jt_hive_parquet).show
{code}
Seems the array value is null. 

A workaround of this is set the following and then use HiveTableScan to read 
the data.
{code}
sql(set spark.sql.parquet.useDataSourceApi=false)
sql(set spark.sql.hive.convertMetastoreParquet=false)
{code}
The array will be correctly read.

The metadata dump and file dump of the file is 
{code}
creator: parquet-mr version 1.6.0rc3 (build 
d4d5a07ec9bd262ca1e93c309f1d7d4a74ebda4c) 

file schema: hive_schema 

a:   OPTIONAL F:1 
.bag:REPEATED F:1 
..array_element: OPTIONAL INT64 R:1 D:3

row group 1: RC:1 TS:86 OFFSET:4 

a:   
.bag:
..array_element:  INT64 UNCOMPRESSED DO:0 FPO:4 SZ:86/86/1.00 VC:4 ENC:RLE,PLAIN
{code}
{code}
row group 0 

a:   
.bag:
..array_element:  INT64 UNCOMPRESSED DO:0 FPO:4 SZ:86/86/1.00 VC:4 ENC:RLE,PLAIN

a.bag.array_element TV=4 RL=1 DL=3

page 0:  DLE:RLE RLE:RLE VLE:PLAIN SZ:45 VC:4

INT64 a.bag.array_element 

*** row group 1 of 1, values 1 to 4 *** 
value 1: R:0 D:3 V:1
value 2: R:1 D:3 V:2
value 3: R:1 D:3 V:3
value 4: R:1 D:3 V:4
{code}

The metadata dump and file dump of a data source parquet file (generated 
through Parquet2Relation2) is 
{code}
creator: parquet-mr version 1.6.0rc3 (build 
d4d5a07ec9bd262ca1e93c309f1d7d4a74ebda4c) 
extra:   org.apache.spark.sql.parquet.row.metadata = 
{type:struct,fields:[{name:a,type:{type:array,elementType:long,containsNull:true},nullable:true,metadata:{}}]}
 

file schema: root 

a:   OPTIONAL F:1 
.bag:REPEATED F:1 
..array: OPTIONAL INT64 R:1 D:3

row group 1: RC:1 TS:86 OFFSET:4 

a:   
.bag:
..array:  INT64 UNCOMPRESSED DO:0 FPO:4 SZ:86/86/1.00 VC:4 ENC:PLAIN,RLE
{code}
{code}
row group 0 

a:   
.bag:
..array:  INT64 UNCOMPRESSED DO:0 FPO:4 SZ:86/86/1.00 VC:4 ENC:PLAIN,RLE

a.bag.array TV=4 RL=1 DL=3

page 0:  DLE:RLE RLE:RLE VLE:PLAIN SZ:45 VC:4

INT64 a.bag.array 

*** row group 1 of 1, values 1 to 4 *** 
value 1: R:0 D:3 V:1
value 2: R:1 D:3 V:2
value 3: R:1 D:3 V:3
value 4: R:1 D:3 V:4
{code}

 Arrays and Maps stored with Hive Parquet Serde may not be able to read by the 
 Parquet support in the Data Souce API
 ---

 Key: SPARK-5508
 URL: https://issues.apache.org/jira/browse/SPARK-5508
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
 Environment: mesos, cdh
Reporter: Ayoub Benali
  Labels: hivecontext, parquet

 When the table is saved as parquet, we cannot query a field which is an array 
 of struct after an INSERT statement, like show bellow:  
 {noformat}
 scala val data1={
  | timestamp: 1422435598,
  | data_array: [
  | {
  | field1: 1,
  | field2: 2
  | }
  | ]
  | }
 scala val data2={
  | timestamp: 1422435598,
  | data_array: [
  | {
  | field1: 3,
  | field2: 4
  | }
  | ]
 scala val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil)
 scala val rdd = hiveContext.jsonRDD(jsonRDD)
 scala rdd.printSchema
 root
  |-- data_array: array (nullable = true)
  ||-- element: struct (containsNull = false)
  |||-- field1: integer (nullable = true)
  |||-- field2: integer (nullable = true)
  |-- timestamp: integer (nullable = true)
 scala rdd.registerTempTable(tmp_table)
 

[jira] [Updated] (SPARK-6407) Streaming ALS for Collaborative Filtering

2015-03-18 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-6407:

Description: 
Like MLLib's ALS implementation for recommendation, and applying to streaming.
Similar to streaming linear regression, logistic regression, could we apply 
gradient updates to batches of data and reuse existing MLLib implementation?


  was:Like MLLib's ALS implementation for recommendation, and applying to 
streaming


 Streaming ALS for Collaborative Filtering
 -

 Key: SPARK-6407
 URL: https://issues.apache.org/jira/browse/SPARK-6407
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Felix Cheung
Priority: Minor

 Like MLLib's ALS implementation for recommendation, and applying to streaming.
 Similar to streaming linear regression, logistic regression, could we apply 
 gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5508) Arrays and Maps stored with Hive Parquet Serde may not be able to read by the Parquet support in the Data Souce API

2015-03-18 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368505#comment-14368505
 ] 

Yin Huai commented on SPARK-5508:
-

Seems the root cause of this problem is the array element schema name. For 
Hive's Parquet Serde, array element's name is array_element (see 
https://github.com/apache/incubator-parquet-mr/blob/parquet-1.6.0rc3/parquet-hive/parquet-hive-storage-handler/src/main/java/org/apache/hadoop/hive/ql/io/parquet/convert/HiveSchemaConverter.java#L104).
 While, for us, we use array (see 
https://github.com/apache/spark/blob/v1.3.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala#L68
 and we are following parquet avro 
https://github.com/apache/incubator-parquet-mr/blob/master/parquet-avro/src/main/java/parquet/avro/AvroSchemaConverter.java#L131).

 Arrays and Maps stored with Hive Parquet Serde may not be able to read by the 
 Parquet support in the Data Souce API
 ---

 Key: SPARK-5508
 URL: https://issues.apache.org/jira/browse/SPARK-5508
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
 Environment: mesos, cdh
Reporter: Ayoub Benali
  Labels: hivecontext, parquet

 When the table is saved as parquet, we cannot query a field which is an array 
 of struct after an INSERT statement, like show bellow:  
 {noformat}
 scala val data1={
  | timestamp: 1422435598,
  | data_array: [
  | {
  | field1: 1,
  | field2: 2
  | }
  | ]
  | }
 scala val data2={
  | timestamp: 1422435598,
  | data_array: [
  | {
  | field1: 3,
  | field2: 4
  | }
  | ]
 scala val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil)
 scala val rdd = hiveContext.jsonRDD(jsonRDD)
 scala rdd.printSchema
 root
  |-- data_array: array (nullable = true)
  ||-- element: struct (containsNull = false)
  |||-- field1: integer (nullable = true)
  |||-- field2: integer (nullable = true)
  |-- timestamp: integer (nullable = true)
 scala rdd.registerTempTable(tmp_table)
 scala hiveContext.sql(select data.field1 from tmp_table LATERAL VIEW 
 explode(data_array) nestedStuff AS data).collect
 res3: Array[org.apache.spark.sql.Row] = Array([1], [3])
 scala hiveContext.sql(SET hive.exec.dynamic.partition = true)
 scala hiveContext.sql(SET hive.exec.dynamic.partition.mode = nonstrict)
 scala hiveContext.sql(set parquet.compression=GZIP)
 scala hiveContext.setConf(spark.sql.parquet.binaryAsString, true)
 scala hiveContext.sql(create external table if not exists 
 persisted_table(data_array ARRAY STRUCTfield1: INT, field2: INT, 
 timestamp INT) STORED AS PARQUET Location 'hdfs:///test_table')
 scala hiveContext.sql(insert into table persisted_table select * from 
 tmp_table).collect
 scala hiveContext.sql(select data.field1 from persisted_table LATERAL VIEW 
 explode(data_array) nestedStuff AS data).collect
 parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in 
 file hdfs://*/test_table/part-1
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
   at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
   at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
   at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
   at 
 

[jira] [Commented] (SPARK-6404) Call broadcast() in each interval for spark streaming programs.

2015-03-18 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368409#comment-14368409
 ] 

Saisai Shao commented on SPARK-6404:


Would you please paste your code snippet to identify the problem.

 Call broadcast() in each interval for spark streaming programs.
 ---

 Key: SPARK-6404
 URL: https://issues.apache.org/jira/browse/SPARK-6404
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Yifan Wang

 If I understand it correctly, Spark’s broadcast() function will be called 
 only once at the beginning of the batch. For streaming applications that need 
 to run for 24/7, it is often needed to update variables that shared by 
 broadcast() dynamically. It would be ideal if broadcast() could be called at 
 the beginning of each interval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6200) Support dialect in SQL

2015-03-18 Thread haiyang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368486#comment-14368486
 ] 

haiyang commented on SPARK-6200:


You are right!

In fact,I haven't add the corresponding code of making DialectManager 
configurable.Because I thought if I add a lot of the public API will make it 
even more difficult for you to accept this PR :).

I think the implementation that you linked to below is a good 
implementation.It's simpler than what I provided.But sometimes maybe I want to 
give my own dialect add a short name like sql or hiveql,or load some default 
dialects with a short name when spark sql started.I guess  I don't see the 
corresponding code of this implementation.If I misunderstood, please forgive 
me:).What do you think of this idea?Maybe I can give some comments to make that 
implementation better:).

Thank you for your attention to my PR.


 Support dialect in SQL
 --

 Key: SPARK-6200
 URL: https://issues.apache.org/jira/browse/SPARK-6200
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: haiyang

 Created a new dialect manager,support dialect command and add new dialect use 
 sql statement etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6407) Streaming ALS for Collaborative Filtering

2015-03-18 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-6407:
---

 Summary: Streaming ALS for Collaborative Filtering
 Key: SPARK-6407
 URL: https://issues.apache.org/jira/browse/SPARK-6407
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Felix Cheung
Priority: Minor


Like MLLib's ALS implementation for recommendation, and applying to streaming



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4258) NPE with new Parquet Filters

2015-03-18 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14368528#comment-14368528
 ] 

Yin Huai commented on SPARK-4258:
-

[~liancheng] Does the version that we are currently using have the fix?

 NPE with new Parquet Filters
 

 Key: SPARK-4258
 URL: https://issues.apache.org/jira/browse/SPARK-4258
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Critical
 Fix For: 1.2.0


 {code}
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 0 in stage 21.0 failed 4 times, most recent failure: Lost task 0.3 in 
 stage 21.0 (TID 160, ip-10-0-247-144.us-west-2.compute.internal): 
 java.lang.NullPointerException: 
 parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:206)
 parquet.io.api.Binary$ByteArrayBackedBinary.compareTo(Binary.java:162)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:100)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
 parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:210)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
 parquet.filter2.predicate.Operators$Or.accept(Operators.java:302)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:201)
 
 parquet.filter2.statisticslevel.StatisticsFilter.visit(StatisticsFilter.java:47)
 parquet.filter2.predicate.Operators$And.accept(Operators.java:290)
 
 parquet.filter2.statisticslevel.StatisticsFilter.canDrop(StatisticsFilter.java:52)
 parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:46)
 parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
 
 parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
 
 parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
 
 parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
 
 parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
 {code}
 This occurs when reading parquet data encoded with the older version of the 
 library for TPC-DS query 34.  Will work on coming up with a smaller 
 reproduction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4485) Add broadcast outer join to optimize left outer join and right outer join

2015-03-18 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-4485:

Target Version/s: 1.4.0  (was: 1.1.0)

 Add broadcast outer join to  optimize left outer join and right outer join
 --

 Key: SPARK-4485
 URL: https://issues.apache.org/jira/browse/SPARK-4485
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: XiaoJing wang
Priority: Minor
   Original Estimate: 0.05h
  Remaining Estimate: 0.05h

 For now, spark use broadcast join instead of hash join to optimize {{inner 
 join}} when the size of one side data did not reach the 
 {{AUTO_BROADCASTJOIN_THRESHOLD}}
 However,Spark SQL will perform shuffle operations on each child relations 
 while executing 
 {{left outer join}} and {{right outer join}}.   {outer join}} is more 
 suitable for optimiztion with broadcast join. 
 We are planning to create a {{BroadcastHashouterJoin}} to implement the 
 broadcast join for {{left outer join}} and {{right outer join}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6374) Add getter for GeneralizedLinearAlgorithm

2015-03-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6374:
-
Assignee: yuhao yang

 Add getter for GeneralizedLinearAlgorithm
 -

 Key: SPARK-6374
 URL: https://issues.apache.org/jira/browse/SPARK-6374
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.1
Reporter: yuhao yang
Assignee: yuhao yang
Priority: Minor
 Fix For: 1.4.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 I find it's better to have getter for NumFeatures and addIntercept within 
 GeneralizedLinearAlgorithm during actual usage, otherwise I 'll have to get 
 the value through debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6401) Unable to load a old API input format in Spark streaming

2015-03-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6401:
-
Priority: Minor  (was: Major)

You mean the .mapred. MapReduce API right?
I think it's not unreasonable to add, but, the new APIs have been around for 
so long (Hadoop 1.x) that I don't know if people would be using the old API at 
all? That is I'd almost say these should be deprecated.

 Unable to load a old API input format in Spark streaming
 

 Key: SPARK-6401
 URL: https://issues.apache.org/jira/browse/SPARK-6401
 Project: Spark
  Issue Type: Improvement
Reporter: Rémy DUBOIS
Priority: Minor

 The fileStream method of the JavaStreamingContext class does not allow using 
 a old API InputFormat.
 This feature exists in Spark batch but not in streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6401) Unable to load a old API input format in Spark streaming

2015-03-18 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367789#comment-14367789
 ] 

Rémy DUBOIS edited comment on SPARK-6401 at 3/18/15 8:18 PM:
-

Yes I mean the mapred API.

All our input formats are developed in mapred API so it would allow us to avoid 
rewriting them in mapreduce API.
Since the batch API can read from a mapred InputFormat, don't you think it 
would be more consistent to have the same possibility in the streaming API?


was (Author: rdubois):
Yes I mean the mapred API.

All our input formats are developed in mapred API so it would allow us to avoid 
rewriting them in mapreduce API. Or at least, it would allow us to do it 
gradually.
Since the batch API can read from a mapred InputFormat, don't you think it 
would be more consistent to have the same possibility in the streaming API?

 Unable to load a old API input format in Spark streaming
 

 Key: SPARK-6401
 URL: https://issues.apache.org/jira/browse/SPARK-6401
 Project: Spark
  Issue Type: Improvement
Reporter: Rémy DUBOIS
Priority: Minor

 The fileStream method of the JavaStreamingContext class does not allow using 
 a old API InputFormat.
 This feature exists in Spark batch but not in streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6168) Expose some of the collection classes as DeveloperApi

2015-03-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367787#comment-14367787
 ] 

Apache Spark commented on SPARK-6168:
-

User 'mridulm' has created a pull request for this issue:
https://github.com/apache/spark/pull/5084

 Expose some of the collection classes as DeveloperApi
 -

 Key: SPARK-6168
 URL: https://issues.apache.org/jira/browse/SPARK-6168
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Mridul Muralidharan
Priority: Minor

 It would be very useful to mark some of the collection and util classes we 
 have as @DeveloperApi or even @Experimental.
 For example :
 - org.apache.spark.util.BoundedPriorityQueue
 - org.apache.spark.util.collection.Utils
 - org.apache.spark.util.collection.CompactBuffer
 Usecases for these are for example to do a treeAggregate and/or treeReduce 
 impl for common api's within RDD which fail due to limitations of 
 aggregate/reduce at scale (ref: SPARK-6165).
 Ofcourse, they are useful for other things too :-)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6401) Unable to load a old API input format in Spark streaming

2015-03-18 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367789#comment-14367789
 ] 

Rémy DUBOIS commented on SPARK-6401:


Yes I mean the mapred API.

All our input formats are developed in mapred API so it would allow us to avoid 
rewriting them in mapreduce API. Or at least, it would allow us to do it 
gradually.
Since the batch API can read from a mapred InputFormat, don't you think it 
would be more consistent to have the same possibility in the streaming API?

 Unable to load a old API input format in Spark streaming
 

 Key: SPARK-6401
 URL: https://issues.apache.org/jira/browse/SPARK-6401
 Project: Spark
  Issue Type: Improvement
Reporter: Rémy DUBOIS
Priority: Minor

 The fileStream method of the JavaStreamingContext class does not allow using 
 a old API InputFormat.
 This feature exists in Spark batch but not in streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6404) Call broadcast() in each interval for spark streaming programs.

2015-03-18 Thread Yifan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifan Wang updated SPARK-6404:
--
Description: If I understand it correctly, Spark’s broadcast() function 
will be called only once at the beginning of the batch. For streaming 
applications that need to run for 24/7, it is often needed to update variables 
that shared by broadcast() dynamically. It would be ideal if broadcast() could 
be called at the beginning of each interval.

 Call broadcast() in each interval for spark streaming programs.
 ---

 Key: SPARK-6404
 URL: https://issues.apache.org/jira/browse/SPARK-6404
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Yifan Wang

 If I understand it correctly, Spark’s broadcast() function will be called 
 only once at the beginning of the batch. For streaming applications that need 
 to run for 24/7, it is often needed to update variables that shared by 
 broadcast() dynamically. It would be ideal if broadcast() could be called at 
 the beginning of each interval.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException

2015-03-18 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367891#comment-14367891
 ] 

Ilya Ganelin commented on SPARK-5945:
-

Hi Imran - I'd be happy to tackle this. Could you please assign it to me? Thank 
you. 

 Spark should not retry a stage infinitely on a FetchFailedException
 ---

 Key: SPARK-5945
 URL: https://issues.apache.org/jira/browse/SPARK-5945
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid

 While investigating SPARK-5928, I noticed some very strange behavior in the 
 way spark retries stages after a FetchFailedException.  It seems that on a 
 FetchFailedException, instead of simply killing the task and retrying, Spark 
 aborts the stage and retries.  If it just retried the task, the task might 
 fail 4 times and then trigger the usual job killing mechanism.  But by 
 killing the stage instead, the max retry logic is skipped (it looks to me 
 like there is no limit for retries on a stage).
 After a bit of discussion with Kay Ousterhout, it seems the idea is that if a 
 fetch fails, we assume that the block manager we are fetching from has 
 failed, and that it will succeed if we retry the stage w/out that block 
 manager.  In that case, it wouldn't make any sense to retry the task, since 
 its doomed to fail every time, so we might as well kill the whole stage.  But 
 this raises two questions:
 1) Is it really safe to assume that a FetchFailedException means that the 
 BlockManager has failed, and ti will work if we just try another one?  
 SPARK-5928 shows that there are at least some cases where that assumption is 
 wrong.  Even if we fix that case, this logic seems brittle to the next case 
 we find.  I guess the idea is that this behavior is what gives us the R in 
 RDD ... but it seems like its not really that robust and maybe should be 
 reconsidered.
 2) Should stages only be retried a limited number of times?  It would be 
 pretty easy to put in a limited number of retries per stage.  Though again, 
 we encounter issues with keeping things resilient.  Theoretically one stage 
 could have many retries, but due to failures in different stages further 
 downstream, so we might need to track the cause of each retry as well to 
 still have the desired behavior.
 In general it just seems there is some flakiness in the retry logic.  This is 
 the only reproducible example I have at the moment, but I vaguely recall 
 hitting other cases of strange behavior w/ retries when trying to run long 
 pipelines.  Eg., if one executor is stuck in a GC during a fetch, the fetch 
 fails, but the executor eventually comes back and the stage gets retried 
 again, but the same GC issues happen the second time around, etc.
 Copied from SPARK-5928, here's the example program that can regularly produce 
 a loop of stage failures.  Note that it will only fail from a remote fetch, 
 so it can't be run locally -- I ran with {{MASTER=yarn-client spark-shell 
 --num-executors 2 --executor-memory 4000m}}
 {code}
 val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
   val n = 3e3.toInt
   val arr = new Array[Byte](n)
   //need to make sure the array doesn't compress to something small
   scala.util.Random.nextBytes(arr)
   arr
 }
 rdd.map { x = (1, x)}.groupByKey().count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5932) Use consistent naming for byte properties

2015-03-18 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367919#comment-14367919
 ] 

Ilya Ganelin commented on SPARK-5932:
-

[~andrewor14] - I can take this out. Thanks.

 Use consistent naming for byte properties
 -

 Key: SPARK-5932
 URL: https://issues.apache.org/jira/browse/SPARK-5932
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Andrew Or

 This is SPARK-5931's sister issue.
 The naming of existing byte configs is inconsistent. We currently have the 
 following throughout the code base:
 {code}
 spark.reducer.maxMbInFlight // megabytes
 spark.kryoserializer.buffer.mb // megabytes
 spark.shuffle.file.buffer.kb // kilobytes
 spark.broadcast.blockSize // kilobytes
 spark.executor.logs.rolling.size.maxBytes // bytes
 spark.io.compression.snappy.block.size // bytes
 {code}
 Instead, my proposal is to simplify the config name itself and make 
 everything accept time using the following format: 500b, 2k, 100m, 46g, 
 similar to what we currently use for our memory settings. For instance:
 {code}
 spark.reducer.maxSizeInFlight = 10m
 spark.kryoserializer.buffer = 2m
 spark.shuffle.file.buffer = 10k
 spark.broadcast.blockSize = 20k
 spark.executor.logs.rolling.maxSize = 500b
 spark.io.compression.snappy.blockSize = 200b
 {code}
 All existing configs that are relevant will be deprecated in favor of the new 
 ones. We should do this soon before we keep introducing more time configs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5931) Use consistent naming for time properties

2015-03-18 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367915#comment-14367915
 ] 

Ilya Ganelin edited comment on SPARK-5931 at 3/18/15 9:09 PM:
--

[~andrewor14] - I can take this out. Thanks.


was (Author: ilganeli):
@andrewor - I can take this out. Thanks.

 Use consistent naming for time properties
 -

 Key: SPARK-5931
 URL: https://issues.apache.org/jira/browse/SPARK-5931
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Andrew Or

 This is SPARK-5932's sister issue.
 The naming of existing time configs is inconsistent. We currently have the 
 following throughout the code base:
 {code}
 spark.network.timeout // seconds
 spark.executor.heartbeatInterval // milliseconds
 spark.storage.blockManagerSlaveTimeoutMs // milliseconds
 spark.yarn.scheduler.heartbeat.interval-ms // milliseconds
 {code}
 Instead, my proposal is to simplify the config name itself and make 
 everything accept time using the following format: 5s, 2ms, 100us. For 
 instance:
 {code}
 spark.network.timeout = 5s
 spark.executor.heartbeatInterval = 500ms
 spark.storage.blockManagerSlaveTimeout = 100ms
 spark.yarn.scheduler.heartbeatInterval = 400ms
 {code}
 All existing configs that are relevant will be deprecated in favor of the new 
 ones. We should do this soon before we keep introducing more time configs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5931) Use consistent naming for time properties

2015-03-18 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14367915#comment-14367915
 ] 

Ilya Ganelin commented on SPARK-5931:
-

@andrewor - I can take this out. Thanks.

 Use consistent naming for time properties
 -

 Key: SPARK-5931
 URL: https://issues.apache.org/jira/browse/SPARK-5931
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Andrew Or

 This is SPARK-5932's sister issue.
 The naming of existing time configs is inconsistent. We currently have the 
 following throughout the code base:
 {code}
 spark.network.timeout // seconds
 spark.executor.heartbeatInterval // milliseconds
 spark.storage.blockManagerSlaveTimeoutMs // milliseconds
 spark.yarn.scheduler.heartbeat.interval-ms // milliseconds
 {code}
 Instead, my proposal is to simplify the config name itself and make 
 everything accept time using the following format: 5s, 2ms, 100us. For 
 instance:
 {code}
 spark.network.timeout = 5s
 spark.executor.heartbeatInterval = 500ms
 spark.storage.blockManagerSlaveTimeout = 100ms
 spark.yarn.scheduler.heartbeatInterval = 400ms
 {code}
 All existing configs that are relevant will be deprecated in favor of the new 
 ones. We should do this soon before we keep introducing more time configs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6364) hashCode and equals for Matrices

2015-03-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366720#comment-14366720
 ] 

Apache Spark commented on SPARK-6364:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/5081

 hashCode and equals for Matrices
 

 Key: SPARK-6364
 URL: https://issues.apache.org/jira/browse/SPARK-6364
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Manoj Kumar

 hashCode implementation should be similar to Vector's. But we may want to 
 reduce the complexity by scanning only a few nonzeros instead of all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org