[jira] [Updated] (SPARK-5821) CTAS command failure when your don't have write permission of the parent directory
[ https://issues.apache.org/jira/browse/SPARK-5821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5821: Summary: CTAS command failure when your don't have write permission of the parent directory (was: [SQL] CTAS command failure when your don't have write permission of the parent directory) CTAS command failure when your don't have write permission of the parent directory -- Key: SPARK-5821 URL: https://issues.apache.org/jira/browse/SPARK-5821 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Yanbo Liang When you run CTAS command such as CREATE TEMPORARY TABLE jsonTable USING org.apache.spark.sql.json.DefaultSource OPTIONS ( path /a/b/c/d ) AS SELECT a, b FROM jt, you will run into failure if you don't have write permission for directory /a/b/c whether d is a directory or file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5821) JSONRelation should check if delete is successful for the overwrite operation.
[ https://issues.apache.org/jira/browse/SPARK-5821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5821: Summary: JSONRelation should check if delete is successful for the overwrite operation. (was: CTAS command failure when your don't have write permission of the parent directory) JSONRelation should check if delete is successful for the overwrite operation. -- Key: SPARK-5821 URL: https://issues.apache.org/jira/browse/SPARK-5821 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Yanbo Liang When you run CTAS command such as CREATE TEMPORARY TABLE jsonTable USING org.apache.spark.sql.json.DefaultSource OPTIONS ( path /a/b/c/d ) AS SELECT a, b FROM jt, you will run into failure if you don't have write permission for directory /a/b/c whether d is a directory or file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5738) Reuse mutable row for each record at jsonStringToRow
[ https://issues.apache.org/jira/browse/SPARK-5738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5738: Summary: Reuse mutable row for each record at jsonStringToRow (was: [SQL] Reuse mutable row for each record at jsonStringToRow) Reuse mutable row for each record at jsonStringToRow Key: SPARK-5738 URL: https://issues.apache.org/jira/browse/SPARK-5738 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.0 Reporter: Yanbo Liang Other table scan like operations include ParquetTableScan, HiveTableScan use a reusable mutable row for seralization to decrease garbage. We also make JSONRelation#buildScan() with this optimization. When serialize json string to row, reuse a mutable row for both each record and inner nested structure instead of creating a new one for each. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5821) [SQL] CTAS command failure when your don't have write permission of the parent directory
[ https://issues.apache.org/jira/browse/SPARK-5821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5821: Target Version/s: 1.3.0 [SQL] CTAS command failure when your don't have write permission of the parent directory Key: SPARK-5821 URL: https://issues.apache.org/jira/browse/SPARK-5821 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Yanbo Liang When you run CTAS command such as CREATE TEMPORARY TABLE jsonTable USING org.apache.spark.sql.json.DefaultSource OPTIONS ( path /a/b/c/d ) AS SELECT a, b FROM jt, you will run into failure if you don't have write permission for directory /a/b/c whether d is a directory or file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5825) Failure stopping Services while command line argument is too long
[ https://issues.apache.org/jira/browse/SPARK-5825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321839#comment-14321839 ] Apache Spark commented on SPARK-5825: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/4611 Failure stopping Services while command line argument is too long - Key: SPARK-5825 URL: https://issues.apache.org/jira/browse/SPARK-5825 Project: Spark Issue Type: Bug Components: Spark Submit Reporter: Cheng Hao Stopping service in `spark-daemon.sh` will confirm the process id by fuzzy matching the class name, however, it will fail if the java process arguments is very long (greater than 4096). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5826) JavaStreamingContext.fileStream cause Configuration NotSerializableException
[ https://issues.apache.org/jira/browse/SPARK-5826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321862#comment-14321862 ] Littlestar commented on SPARK-5826: --- I put some txt files into /testspark/watchdir, It throws NullPointerException 15/02/15 16:18:20 WARN dstream.FileInputDStream: Error finding new files java.lang.NullPointerException at org.apache.spark.streaming.api.java.JavaStreamingContext$$anonfun$fn$3$1.apply(JavaStreamingContext.scala:329) at org.apache.spark.streaming.api.java.JavaStreamingContext$$anonfun$fn$3$1.apply(JavaStreamingContext.scala:329) at org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$isNewFile(FileInputDStream.scala:215) at org.apache.spark.streaming.dstream.FileInputDStream$$anon$3.accept(FileInputDStream.scala:172) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1489) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1523) at org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:174) at org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:132) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:301) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:301) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:288) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:285) at org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:301) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:301) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:288) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:285) at org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:301) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:301) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:288) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:285) at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:232) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$2.apply(JobGenerator.scala:230) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:230) at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:167) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:78) at
[jira] [Commented] (SPARK-5813) Spark-ec2: Switch to OracleJDK
[ https://issues.apache.org/jira/browse/SPARK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321764#comment-14321764 ] Florian Verhein commented on SPARK-5813: INAL but here are my thoughts: The user ends up downloading it from Oracle and accepting the license terms in that process, so as long as they are (or made) aware then I don't really see a problem. It's just providing a mechanism for them to do this. i.e. It's not a redistribution issue. I think a reasonable solution to this would be to have OpenJDK as the default, with OracleJDK as an option that the user must specifically request (and the option's documentation indicating that this entails acceptance of a license... etc) At least, *the above is true in the case where the user builds their own AMI (that's the approach I take since it best suits my requirements). With provided AMIs I think this is more complex, because I would assume that is redistribution*. I guess that applies to any software that is put on the AMI actually... so this may be an issue that needs looking at more generally... I don't know how to best approach that case other than adhering to any redistribution terms including these as part of an EULA for spark-ec2/AMIs or something? But with the work [~nchammas] has done, I suppose the easiest way would be to provide the public AMIs with OpenJDK, and add an option to build ones with OracleJDK if the user is inclined to do this themselves. Hmmm... is this worthwhile? Spark-ec2: Switch to OracleJDK -- Key: SPARK-5813 URL: https://issues.apache.org/jira/browse/SPARK-5813 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Florian Verhein Priority: Minor Currently using OpenJDK, however it is generally recommended to use Oracle JDK, esp for Hadoop deployments, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5826) JavaStreamingContext.fileStream cause Configuration NotSerializableException
Littlestar created SPARK-5826: - Summary: JavaStreamingContext.fileStream cause Configuration NotSerializableException Key: SPARK-5826 URL: https://issues.apache.org/jira/browse/SPARK-5826 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Reporter: Littlestar Priority: Minor org.apache.spark.streaming.api.java.JavaStreamingContext.fileStream(String directory, ClassLongWritable kClass, ClassText vClass, ClassTextInputFormat fClass, FunctionPath, Boolean filter, boolean newFilesOnly, Configuration conf) I use JavaStreamingContext.fileStream with 1.3.0/master with Configuration. but it throw strange exception. java.io.NotSerializableException: org.apache.hadoop.conf.Configuration at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440) at org.apache.spark.streaming.DStreamGraph$$anonfun$writeObject$1.apply$mcV$sp(DStreamGraph.scala:177) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1075) at org.apache.spark.streaming.DStreamGraph.writeObject(DStreamGraph.scala:172) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.streaming.CheckpointWriter.write(Checkpoint.scala:184) at org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:278) at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:169) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:78) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1.aroundReceive(JobGenerator.scala:76) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at
[jira] [Commented] (SPARK-5816) Add huge backward compatibility warning in DriverWrapper
[ https://issues.apache.org/jira/browse/SPARK-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321902#comment-14321902 ] Apache Spark commented on SPARK-5816: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/4612 Add huge backward compatibility warning in DriverWrapper Key: SPARK-5816 URL: https://issues.apache.org/jira/browse/SPARK-5816 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Reporter: Andrew Or Assignee: Andrew Or Priority: Critical As of Spark 1.3, we provide backward and forward compatibility in standalone cluster mode through the REST submission gateway. HOWEVER, it nevertheless goes through the legacy o.a.s.deploy.DriverWrapper, and the semantics of the command line arguments there must not change. For instance, this was broken in commit 20a6013106b56a1a1cc3e8cda092330ffbe77cc3. There is currently no warning against that in the class and so we should add one before it's too late. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5826) JavaStreamingContext.fileStream cause Configuration NotSerializableException
[ https://issues.apache.org/jira/browse/SPARK-5826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Littlestar updated SPARK-5826: -- Attachment: TestStream.java JavaStreamingContext.fileStream cause Configuration NotSerializableException Key: SPARK-5826 URL: https://issues.apache.org/jira/browse/SPARK-5826 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Littlestar Priority: Minor Attachments: TestStream.java org.apache.spark.streaming.api.java.JavaStreamingContext.fileStream(String directory, ClassLongWritable kClass, ClassText vClass, ClassTextInputFormat fClass, FunctionPath, Boolean filter, boolean newFilesOnly, Configuration conf) I use JavaStreamingContext.fileStream with 1.3.0/master with Configuration. but it throw strange exception. java.io.NotSerializableException: org.apache.hadoop.conf.Configuration at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440) at org.apache.spark.streaming.DStreamGraph$$anonfun$writeObject$1.apply$mcV$sp(DStreamGraph.scala:177) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1075) at org.apache.spark.streaming.DStreamGraph.writeObject(DStreamGraph.scala:172) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.streaming.CheckpointWriter.write(Checkpoint.scala:184) at org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:278) at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:169) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:78) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1.aroundReceive(JobGenerator.scala:76) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at
[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321777#comment-14321777 ] Yi Zhou commented on SPARK-5791: For the same input dataset size, it costs about ~2mins on hive on M/R with optimization parameters but it costs about ~1hour on SparkSQL. [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5820) Example does not work when using SOCKS proxy
[ https://issues.apache.org/jira/browse/SPARK-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321776#comment-14321776 ] Eric O. LEBIGOT (EOL) edited comment on SPARK-5820 at 2/15/15 1:48 AM: --- Good question, I should have added more details: I am running locally (and have a local SOCKS proxy connection to a remote host). The JVM uses the proxy settings from my default browser (as indicated in the Java OS X Preferences). I just tried another setting: Use proxy server (Advanced: For all protocols, on and off), Bypass proxy server for local addresses. This does not work either. was (Author: lebigot): Good question, I should have added more details: I am running locally (and have a local SOCKS proxy connection to a remote host). The JVM uses the proxy settings from my default browser (as indicated in the Java OS X Preferences). I will try with other settings. Example does not work when using SOCKS proxy Key: SPARK-5820 URL: https://issues.apache.org/jira/browse/SPARK-5820 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Reporter: Eric O. LEBIGOT (EOL) When using a SOCKS proxy (on OS X 10.10.2), running even the basic example ./bin/run-example SparkPi 10 fails. -- Partial log -- 15/02/14 23:23:00 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job 15/02/14 23:23:00 INFO TaskSchedulerImpl: Cancelling stage 0 15/02/14 23:23:00 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1) 15/02/14 23:23:00 INFO TaskSchedulerImpl: Stage 0 was cancelled 15/02/14 23:23:00 INFO DAGScheduler: Job 0 failed: reduce at SparkPi.scala:35, took 1.920223 s Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Malformed reply from SOCKS server at java.net.SocksSocketImpl.readSocksReply(SocksSocketImpl.java:129) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:503) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.init(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1003) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:951) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:582) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:433) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:356) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:353) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:778) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:777) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:353) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at
[jira] [Comment Edited] (SPARK-5820) Example does not work when using SOCKS proxy
[ https://issues.apache.org/jira/browse/SPARK-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321776#comment-14321776 ] Eric O. LEBIGOT (EOL) edited comment on SPARK-5820 at 2/15/15 2:20 AM: --- Good question, I should have added more details: I am running locally (and have a local SOCKS proxy connection to a remote host). In the Java OS X Preferences, I had set the to JVM use the proxy settings from my default browser (default), which is to use the system-wide proxy setting (from AppleSystem PreferencesNetworkAdvancedProxies). I just tried settings, to no avail (they all fail with the same error): - Use proxy server (Advanced: For all protocols, on and off), Bypass proxy server for local addresses. - Direct connection. It is strange that even the last one fails. was (Author: lebigot): Good question, I should have added more details: I am running locally (and have a local SOCKS proxy connection to a remote host). In the Java OS X Preferences, I had set the JVM uses the proxy settings from my default browser (default). I just tried settings, to no avail (they all fail with the same error): - Use proxy server (Advanced: For all protocols, on and off), Bypass proxy server for local addresses. - Direct connection. It is strange that even the last one fails. Example does not work when using SOCKS proxy Key: SPARK-5820 URL: https://issues.apache.org/jira/browse/SPARK-5820 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Reporter: Eric O. LEBIGOT (EOL) When using a SOCKS proxy (on OS X 10.10.2), running even the basic example ./bin/run-example SparkPi 10 fails. -- Partial log -- 15/02/14 23:23:00 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job 15/02/14 23:23:00 INFO TaskSchedulerImpl: Cancelling stage 0 15/02/14 23:23:00 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1) 15/02/14 23:23:00 INFO TaskSchedulerImpl: Stage 0 was cancelled 15/02/14 23:23:00 INFO DAGScheduler: Job 0 failed: reduce at SparkPi.scala:35, took 1.920223 s Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Malformed reply from SOCKS server at java.net.SocksSocketImpl.readSocksReply(SocksSocketImpl.java:129) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:503) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.init(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1003) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:951) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:582) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:433) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:356) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:353) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:778) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:777) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:353) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Driver stacktrace: at
[jira] [Comment Edited] (SPARK-5820) Example does not work when using SOCKS proxy
[ https://issues.apache.org/jira/browse/SPARK-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321776#comment-14321776 ] Eric O. LEBIGOT (EOL) edited comment on SPARK-5820 at 2/15/15 2:26 AM: --- Good question, I should have added more details: I am running locally (and have a local SOCKS proxy connection to a remote host, with localhost bypassed by the proxy, as it should be—the goal is mostly to get access to Google). In the Java OS X Preferences, I had set the to JVM use the proxy settings from my default browser (default), which is to use the system-wide proxy setting (from AppleSystem PreferencesNetworkAdvancedProxies). I just tried the following other JVM settings, to no avail (they all fail with the same error): - Use proxy server (Advanced: For all protocols, on and off), Bypass proxy server for local addresses. - Direct connection. It is strange that even the last one fails. was (Author: lebigot): Good question, I should have added more details: I am running locally (and have a local SOCKS proxy connection to a remote host, with localhost bypassed by the proxy, as it should be). In the Java OS X Preferences, I had set the to JVM use the proxy settings from my default browser (default), which is to use the system-wide proxy setting (from AppleSystem PreferencesNetworkAdvancedProxies). I just tried the following other JVM settings, to no avail (they all fail with the same error): - Use proxy server (Advanced: For all protocols, on and off), Bypass proxy server for local addresses. - Direct connection. It is strange that even the last one fails. Example does not work when using SOCKS proxy Key: SPARK-5820 URL: https://issues.apache.org/jira/browse/SPARK-5820 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Reporter: Eric O. LEBIGOT (EOL) When using a SOCKS proxy (on OS X 10.10.2), running even the basic example ./bin/run-example SparkPi 10 fails. -- Partial log -- 15/02/14 23:23:00 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job 15/02/14 23:23:00 INFO TaskSchedulerImpl: Cancelling stage 0 15/02/14 23:23:00 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1) 15/02/14 23:23:00 INFO TaskSchedulerImpl: Stage 0 was cancelled 15/02/14 23:23:00 INFO DAGScheduler: Job 0 failed: reduce at SparkPi.scala:35, took 1.920223 s Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Malformed reply from SOCKS server at java.net.SocksSocketImpl.readSocksReply(SocksSocketImpl.java:129) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:503) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.init(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1003) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:951) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:582) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:433) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:356) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:353) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:778) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:777) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:353) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) at
[jira] [Updated] (SPARK-5738) [SQL] Reuse mutable row for each record at jsonStringToRow
[ https://issues.apache.org/jira/browse/SPARK-5738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5738: Issue Type: Sub-task (was: Improvement) Parent: SPARK-3700 [SQL] Reuse mutable row for each record at jsonStringToRow -- Key: SPARK-5738 URL: https://issues.apache.org/jira/browse/SPARK-5738 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.0 Reporter: Yanbo Liang Other table scan like operations include ParquetTableScan, HiveTableScan use a reusable mutable row for seralization to decrease garbage. We also make JSONRelation#buildScan() with this optimization. When serialize json string to row, reuse a mutable row for both each record and inner nested structure instead of creating a new one for each. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5813) Spark-ec2: Switch to OracleJDK
[ https://issues.apache.org/jira/browse/SPARK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321942#comment-14321942 ] Sean Owen commented on SPARK-5813: -- I kind of misstated this. I think this issue is more fundamentally one of distribution. I don't believe others are entitled to redistribute Oracle's JDK/JRE. So I don't think Spark can provide AMIs that contain the Oracle implementation. Providing tools to help someone build an AMI with Oracle JDK is different. However there too I don't think you can actively hide and agree to the license agreement, or slip in what you think is an equivalent license agreement process. It's not our call to make. Dumb question, are AMIs being hosted and redistributed by the Spark project? I wasn't aware of these if so. Whoever does, yes, needs to think about what software licensing terms mean for redistribution. It's perhaps surprising to most people, and an artifact of history, that these OSS licenses kick in almost solely when you distribute, not use, the software! Anyway: every installer that I've seen that provides the Oracle JDK is a wrapper around their downloader and EULA script. You could embed that process in a script, if you dare. My hunch is that it's not worth the trouble, if there's no obvious demand or motivation. Spark-ec2: Switch to OracleJDK -- Key: SPARK-5813 URL: https://issues.apache.org/jira/browse/SPARK-5813 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Florian Verhein Priority: Minor Currently using OpenJDK, however it is generally recommended to use Oracle JDK, esp for Hadoop deployments, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5480) GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException:
[ https://issues.apache.org/jira/browse/SPARK-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321950#comment-14321950 ] Sean Owen commented on SPARK-5480: -- It doesn't look 100% like the same issue, but have a look at SPARK-1329 too GraphX pageRank: java.lang.ArrayIndexOutOfBoundsException: --- Key: SPARK-5480 URL: https://issues.apache.org/jira/browse/SPARK-5480 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.2.0 Environment: Yarn client Reporter: Stephane Maarek Running the following code: val subgraph = graph.subgraph ( vpred = (id,article) = //working predicate) ).cache() println( sSubgraph contains ${subgraph.vertices.count} nodes and ${subgraph.edges.count} edges) val prGraph = subgraph.staticPageRank(5).cache val titleAndPrGraph = subgraph.outerJoinVertices(prGraph.vertices) { (v, title, rank) = (rank.getOrElse(0.0), title) } titleAndPrGraph.vertices.top(13) { Ordering.by((entry: (VertexId, (Double, _))) = entry._2._1) }.foreach(t = println(t._2._2._1 + : + t._2._1 + , id: + t._1)) Returns a graph with 5000 nodes and 4000 edges. Then it crashes during the PageRank with the following: 15/01/29 05:51:07 INFO scheduler.TaskSetManager: Starting task 125.0 in stage 39.0 (TID 1808, *HIDDEN, PROCESS_LOCAL, 2059 bytes) 15/01/29 05:51:07 WARN scheduler.TaskSetManager: Lost task 107.0 in stage 39.0 (TID 1794, *HIDDEN): java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64) at org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91) at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75) at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:110) at org.apache.spark.graphx.impl.EdgeRDDImpl$$anonfun$mapEdgePartitions$1.apply(EdgeRDDImpl.scala:108) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at
[jira] [Comment Edited] (SPARK-5820) Example does not work when using SOCKS proxy
[ https://issues.apache.org/jira/browse/SPARK-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321776#comment-14321776 ] Eric O. LEBIGOT (EOL) edited comment on SPARK-5820 at 2/15/15 2:23 AM: --- Good question, I should have added more details: I am running locally (and have a local SOCKS proxy connection to a remote host, with localhost bypassed by the proxy, as it should be). In the Java OS X Preferences, I had set the to JVM use the proxy settings from my default browser (default), which is to use the system-wide proxy setting (from AppleSystem PreferencesNetworkAdvancedProxies). I just tried the following other JVM settings, to no avail (they all fail with the same error): - Use proxy server (Advanced: For all protocols, on and off), Bypass proxy server for local addresses. - Direct connection. It is strange that even the last one fails. was (Author: lebigot): Good question, I should have added more details: I am running locally (and have a local SOCKS proxy connection to a remote host, with localhost bypassed by the proxy, as it should). In the Java OS X Preferences, I had set the to JVM use the proxy settings from my default browser (default), which is to use the system-wide proxy setting (from AppleSystem PreferencesNetworkAdvancedProxies). I just tried the following other JVM settings, to no avail (they all fail with the same error): - Use proxy server (Advanced: For all protocols, on and off), Bypass proxy server for local addresses. - Direct connection. It is strange that even the last one fails. Example does not work when using SOCKS proxy Key: SPARK-5820 URL: https://issues.apache.org/jira/browse/SPARK-5820 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Reporter: Eric O. LEBIGOT (EOL) When using a SOCKS proxy (on OS X 10.10.2), running even the basic example ./bin/run-example SparkPi 10 fails. -- Partial log -- 15/02/14 23:23:00 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job 15/02/14 23:23:00 INFO TaskSchedulerImpl: Cancelling stage 0 15/02/14 23:23:00 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1) 15/02/14 23:23:00 INFO TaskSchedulerImpl: Stage 0 was cancelled 15/02/14 23:23:00 INFO DAGScheduler: Job 0 failed: reduce at SparkPi.scala:35, took 1.920223 s Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Malformed reply from SOCKS server at java.net.SocksSocketImpl.readSocksReply(SocksSocketImpl.java:129) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:503) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.init(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1003) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:951) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:582) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:433) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:356) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:353) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:778) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:777) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:353) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[jira] [Comment Edited] (SPARK-5823) Reuse mutable rows for inner structures when parsing JSON objects
[ https://issues.apache.org/jira/browse/SPARK-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321792#comment-14321792 ] Yanbo Liang edited comment on SPARK-5823 at 2/15/15 2:57 AM: - Hi [~yhuai] Actually, I have implemented reusing the mutable row for inner structures at https://issues.apache.org/jira/browse/SPARK-5738. However I have found that you have mentioned it will extend Spark SQL's JSON support to handle the case where each object in the dataset might have considerably different schema (https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html). In this scenario, the inner nested mutable row strategy will not take much performance improvements, am I right? was (Author: yanboliang): Hi [~yhuai] Actually, I have implemented reusing the mutable row for inner structures at https://issues.apache.org/jira/browse/SPARK-5738. However I have found that you have mentioned it will extend Spark SQL's JSON support to handle the case where each object in the dataset might have considerably different schema. In this scenario, the inner nested mutable row strategy will not take much performance improvements, am I right? Reuse mutable rows for inner structures when parsing JSON objects - Key: SPARK-5823 URL: https://issues.apache.org/jira/browse/SPARK-5823 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai With SPARK-5738, we will reuse a mutable row for rows when parsing JSON objects. We can do the same thing for inner structures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5746) INSERT OVERWRITE throws FileNotFoundException when the source and destination point to the same table.
[ https://issues.apache.org/jira/browse/SPARK-5746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321825#comment-14321825 ] Apache Spark commented on SPARK-5746: - User 'yanbohappy' has created a pull request for this issue: https://github.com/apache/spark/pull/4610 INSERT OVERWRITE throws FileNotFoundException when the source and destination point to the same table. -- Key: SPARK-5746 URL: https://issues.apache.org/jira/browse/SPARK-5746 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker With the newly introduced write support of data source API, {{JSONRelation}} and {{ParquetRelation2}} both suffer this bug. The root cause is that we removed the source table before insertion ([here|https://github.com/apache/spark/blob/1ac099e3e00ddb01af8e6e3a84c70f8363f04b5c/sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala#L112-L121]). The correct solution should be first insert into a temporary folder, and then overwrite the source table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation
[ https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321777#comment-14321777 ] Yi Zhou edited comment on SPARK-5791 at 2/15/15 1:49 AM: - For the same input data set size(e.g.,1TB), it costs about ~2mins on hive on M/R with optimization parameters but it costs about ~1hour on SparkSQL. was (Author: jameszhouyi): For the same input dataset size, it costs about ~2mins on hive on M/R with optimization parameters but it costs about ~1hour on SparkSQL. [Spark SQL] show poor performance when multiple table do join operation --- Key: SPARK-5791 URL: https://issues.apache.org/jira/browse/SPARK-5791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Yi Zhou Spark SQL shows poor performance when multiple tables do join operation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5820) Example does not work when using SOCKS proxy
[ https://issues.apache.org/jira/browse/SPARK-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321776#comment-14321776 ] Eric O. LEBIGOT (EOL) commented on SPARK-5820: -- Good question, I should have added more details: I am running locally (and have a local SOCKS proxy connection to a remote host). The JVM uses the proxy settings from my default browser (as indicated in the Java OS X Preferences). I will try with other settings. Example does not work when using SOCKS proxy Key: SPARK-5820 URL: https://issues.apache.org/jira/browse/SPARK-5820 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Reporter: Eric O. LEBIGOT (EOL) When using a SOCKS proxy (on OS X 10.10.2), running even the basic example ./bin/run-example SparkPi 10 fails. -- Partial log -- 15/02/14 23:23:00 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job 15/02/14 23:23:00 INFO TaskSchedulerImpl: Cancelling stage 0 15/02/14 23:23:00 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1) 15/02/14 23:23:00 INFO TaskSchedulerImpl: Stage 0 was cancelled 15/02/14 23:23:00 INFO DAGScheduler: Job 0 failed: reduce at SparkPi.scala:35, took 1.920223 s Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Malformed reply from SOCKS server at java.net.SocksSocketImpl.readSocksReply(SocksSocketImpl.java:129) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:503) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.init(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1003) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:951) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:582) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:433) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:356) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:353) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:778) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:777) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:353) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at scala.Option.foreach(Option.scala:245) at
[jira] [Updated] (SPARK-5826) JavaStreamingContext.fileStream cause Configuration NotSerializableException
[ https://issues.apache.org/jira/browse/SPARK-5826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Littlestar updated SPARK-5826: -- Attachment: (was: TestStream.java) JavaStreamingContext.fileStream cause Configuration NotSerializableException Key: SPARK-5826 URL: https://issues.apache.org/jira/browse/SPARK-5826 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Littlestar Priority: Minor Attachments: TestStream.java org.apache.spark.streaming.api.java.JavaStreamingContext.fileStream(String directory, ClassLongWritable kClass, ClassText vClass, ClassTextInputFormat fClass, FunctionPath, Boolean filter, boolean newFilesOnly, Configuration conf) I use JavaStreamingContext.fileStream with 1.3.0/master with Configuration. but it throw strange exception. java.io.NotSerializableException: org.apache.hadoop.conf.Configuration at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440) at org.apache.spark.streaming.DStreamGraph$$anonfun$writeObject$1.apply$mcV$sp(DStreamGraph.scala:177) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1075) at org.apache.spark.streaming.DStreamGraph.writeObject(DStreamGraph.scala:172) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.streaming.CheckpointWriter.write(Checkpoint.scala:184) at org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:278) at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:169) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:78) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1.aroundReceive(JobGenerator.scala:76) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at
[jira] [Commented] (SPARK-5795) api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java
[ https://issues.apache.org/jira/browse/SPARK-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321848#comment-14321848 ] Littlestar commented on SPARK-5795: --- works for me, thanks. api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java - Key: SPARK-5795 URL: https://issues.apache.org/jira/browse/SPARK-5795 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Littlestar Priority: Critical Attachments: TestStreamCompile.java import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; the following code can't compile on java. JavaPairDStreamInteger, Integer rs = rs.saveAsNewAPIHadoopFiles(prefix, txt, Integer.class, Integer.class, TextOutputFormat.class, jobConf); but similar code in JavaPairRDD works ok. JavaPairRDDString, String counts =... counts.saveAsNewAPIHadoopFile(out, Text.class, Text.class, TextOutputFormat.class, jobConf); mybe the def saveAsNewAPIHadoopFiles( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_ : NewOutputFormat[_, _]], conf: Configuration = new Configuration) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } = def saveAsNewAPIHadoopFiles[F : NewOutputFormat[_, _]]( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[F], conf: Configuration = new Configuration) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-925) Allow ec2 scripts to load default options from a json file
[ https://issues.apache.org/jira/browse/SPARK-925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321810#comment-14321810 ] Nicholas Chammas commented on SPARK-925: Loading config from a file seems like a good thing to have and matches what comparable tools like Ubuntu Juju and MIT StarCluster do. However, I would favor a format other than JSON since JSON doesn't allow you to comment stuff out. I've used tools with JSON-backed config files, and they are super annoying to deal with if you try to alternate invocations between version A with this line uncommented and version B with the same line commented out. YAML seems like a better choice for this task. What do you think [~shayping]? cc [~shivaram] Allow ec2 scripts to load default options from a json file -- Key: SPARK-925 URL: https://issues.apache.org/jira/browse/SPARK-925 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 0.8.0 Reporter: Shay Seng Priority: Minor The option list for ec2 script can be a little irritating to type in, especially things like path to identity-file, region , zone, ami etc. It would be nice if ec2 script looks for an options.json file in the following order: (1) PWD, (2) ~/spark-ec2, (3) same dir as spark_ec2.py Something like: def get_defaults_from_options(): # Check to see if a options.json file exists, if so load it. # However, values in the options.json file can only overide values in opts # if the Opt values are None or # i.e. commandline options take presidence defaults = {'aws-access-key-id':'','aws-secret-access-key':'','key-pair':'', 'identity-file':'', 'region':'ap-southeast-1', 'zone':'', 'ami':'','slaves':1, 'instance-type':'m1.large'} # Look for options.json in directory cluster was called from # Had to modify the spark_ec2 wrapper script since it mangles the pwd startwd = os.environ['STARTWD'] if os.path.exists(os.path.join(startwd,options.json)): optionspath = os.path.join(startwd,options.json) else: optionspath = os.path.join(os.getcwd(),options.json) try: print Loading options file: , optionspath with open (optionspath) as json_data: jdata = json.load(json_data) for k in jdata: defaults[k]=jdata[k] except IOError: print 'Warning: options.json file not loaded' # Check permissions on identity-file, if defined, otherwise launch will fail late and will be irritating if defaults['identity-file']!='': st = os.stat(defaults['identity-file']) user_can_read = bool(st.st_mode stat.S_IRUSR) grp_perms = bool(st.st_mode stat.S_IRWXG) others_perm = bool(st.st_mode stat.S_IRWXO) if (not user_can_read): print No read permission to read , defaults['identify-file'] sys.exit(1) if (grp_perms or others_perm): print Permissions are too open, please chmod 600 file , defaults['identify-file'] sys.exit(1) # if defaults contain AWS access id or private key, set it to environment. # required for use with boto to access the AWS console if defaults['aws-access-key-id'] != '': os.environ['AWS_ACCESS_KEY_ID']=defaults['aws-access-key-id'] if defaults['aws-secret-access-key'] != '': os.environ['AWS_SECRET_ACCESS_KEY'] = defaults['aws-secret-access-key'] return defaults -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5826) JavaStreamingContext.fileStream cause Configuration NotSerializableException
[ https://issues.apache.org/jira/browse/SPARK-5826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321901#comment-14321901 ] Saisai Shao commented on SPARK-5826: Quite curious why DStream#checkpoint() calls {{persist}} internally to cache the RDD, rather than checkpointing RDD, any comments [~tdas]? Thanks a lot. {code} def checkpoint(interval: Duration): DStream[T] = { if (isInitialized) { throw new UnsupportedOperationException( Cannot change checkpoint interval of an DStream after streaming context has started) } persist() checkpointDuration = interval this } {code} JavaStreamingContext.fileStream cause Configuration NotSerializableException Key: SPARK-5826 URL: https://issues.apache.org/jira/browse/SPARK-5826 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Littlestar Priority: Minor Attachments: TestStream.java org.apache.spark.streaming.api.java.JavaStreamingContext.fileStream(String directory, ClassLongWritable kClass, ClassText vClass, ClassTextInputFormat fClass, FunctionPath, Boolean filter, boolean newFilesOnly, Configuration conf) I use JavaStreamingContext.fileStream with 1.3.0/master with Configuration. but it throw strange exception. java.io.NotSerializableException: org.apache.hadoop.conf.Configuration at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440) at org.apache.spark.streaming.DStreamGraph$$anonfun$writeObject$1.apply$mcV$sp(DStreamGraph.scala:177) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1075) at org.apache.spark.streaming.DStreamGraph.writeObject(DStreamGraph.scala:172) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.streaming.CheckpointWriter.write(Checkpoint.scala:184) at org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:278) at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:169) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:78) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at
[jira] [Created] (SPARK-5823) Reuse mutable rows for inner structures when parsing JSON objects
Yin Huai created SPARK-5823: --- Summary: Reuse mutable rows for inner structures when parsing JSON objects Key: SPARK-5823 URL: https://issues.apache.org/jira/browse/SPARK-5823 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai With SPARK-5738, we will reuse a mutable row for rows when parsing JSON objects. We can do the same thing for inner structures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5824) CTAS should set null format in hive-0.13.1
Adrian Wang created SPARK-5824: -- Summary: CTAS should set null format in hive-0.13.1 Key: SPARK-5824 URL: https://issues.apache.org/jira/browse/SPARK-5824 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5637) Expose spark_ec2 as as StarCluster Plugin
[ https://issues.apache.org/jira/browse/SPARK-5637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5637. -- Resolution: Won't Fix I think this is simply something that should be done outside the core project. Expose spark_ec2 as as StarCluster Plugin - Key: SPARK-5637 URL: https://issues.apache.org/jira/browse/SPARK-5637 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Alex Rothberg Priority: Minor Starcluster has a lot features in place for stating EC2 instances and it would be great to have an option to leverage that as a plugin. See: http://star.mit.edu/cluster/docs/latest/index.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5816) Add huge backward compatibility warning in DriverWrapper
[ https://issues.apache.org/jira/browse/SPARK-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321979#comment-14321979 ] Saisai Shao edited comment on SPARK-5816 at 2/15/15 1:27 PM: - Sorry for wrong link, I just mess the JIRA id. Really sorry about this, seems I cannot delete the comments. was (Author: jerryshao): Sorry for wrong link, I just mess the JIRA id. Add huge backward compatibility warning in DriverWrapper Key: SPARK-5816 URL: https://issues.apache.org/jira/browse/SPARK-5816 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Reporter: Andrew Or Assignee: Andrew Or Priority: Critical As of Spark 1.3, we provide backward and forward compatibility in standalone cluster mode through the REST submission gateway. HOWEVER, it nevertheless goes through the legacy o.a.s.deploy.DriverWrapper, and the semantics of the command line arguments there must not change. For instance, this was broken in commit 20a6013106b56a1a1cc3e8cda092330ffbe77cc3. There is currently no warning against that in the class and so we should add one before it's too late. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5820) Example does not work when using SOCKS proxy
[ https://issues.apache.org/jira/browse/SPARK-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322022#comment-14322022 ] Eric O. LEBIGOT (EOL) edited comment on SPARK-5820 at 2/15/15 3:25 PM: --- Thank you for investigating this. The proxy server (localhost) does not require authentication. The example (SparkPi) does work when no SOCKS proxy is used (in the OS X Network Preferences). I did not mean to say that my proxy doesn't work for any JVM-based process (I'm not sure how I could test this). :) If you were referring to the three JVM settings that I tried (with the local SOCKS proxy server active), they are simply those of the OS X Java Preferences (AppleSystem PreferencesJavaNetwork Settings). For all JVM proxy settings, ./bin/run-example SparkPi fails in the same way (as in the original report). Here is how to reproduce the problem: 1) Creation of the local SOCKS proxy: ssh -g -N -D 12345 login@remotehost 2) Activation of the proxy in OS X: AppleSystem PreferencesNetworkAdvancedProxies, select SOCKS Proxy, set SOCKS Proxy Server to localhost, port 12345. Now, browser connections go through the proxy (this should show in http://www.whatismyip.com/, for example). was (Author: lebigot): Thank you for investigating this. The proxy server (localhost) does not require authentication. The example (SparkPi) does work when no SOCKS proxy is used (in the OS X Network Preferences). I did not mean to say that my proxy doesn't work for any JVM-based process (I'm not sure how I could test this). :) If you were referring to the three JVM settings that I tried (with the local SOCKS proxy server active), they are simply those of the OS X Java Preferences (AppleSystem PreferencesJavaNetwork Settings). For all JVM proxy settings, ./bin/run-example SparkPi fails in the same way (as in the original report). Example does not work when using SOCKS proxy Key: SPARK-5820 URL: https://issues.apache.org/jira/browse/SPARK-5820 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Reporter: Eric O. LEBIGOT (EOL) When using a SOCKS proxy (on OS X 10.10.2), running even the basic example ./bin/run-example SparkPi 10 fails. -- Partial log -- 15/02/14 23:23:00 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job 15/02/14 23:23:00 INFO TaskSchedulerImpl: Cancelling stage 0 15/02/14 23:23:00 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1) 15/02/14 23:23:00 INFO TaskSchedulerImpl: Stage 0 was cancelled 15/02/14 23:23:00 INFO DAGScheduler: Job 0 failed: reduce at SparkPi.scala:35, took 1.920223 s Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Malformed reply from SOCKS server at java.net.SocksSocketImpl.readSocksReply(SocksSocketImpl.java:129) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:503) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.init(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1003) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:951) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:582) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:433) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:356) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:353) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:778) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:777) at
[jira] [Commented] (SPARK-5827) Add missing imports in the example of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-5827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321988#comment-14321988 ] Apache Spark commented on SPARK-5827: - User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/4615 Add missing imports in the example of SQLContext Key: SPARK-5827 URL: https://issues.apache.org/jira/browse/SPARK-5827 Project: Spark Issue Type: Documentation Components: SQL Reporter: Takeshi Yamamuro Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5816) Add huge backward compatibility warning in DriverWrapper
[ https://issues.apache.org/jira/browse/SPARK-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321979#comment-14321979 ] Saisai Shao commented on SPARK-5816: Sorry for wrong link, I just mess the JIRA id. Add huge backward compatibility warning in DriverWrapper Key: SPARK-5816 URL: https://issues.apache.org/jira/browse/SPARK-5816 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Reporter: Andrew Or Assignee: Andrew Or Priority: Critical As of Spark 1.3, we provide backward and forward compatibility in standalone cluster mode through the REST submission gateway. HOWEVER, it nevertheless goes through the legacy o.a.s.deploy.DriverWrapper, and the semantics of the command line arguments there must not change. For instance, this was broken in commit 20a6013106b56a1a1cc3e8cda092330ffbe77cc3. There is currently no warning against that in the class and so we should add one before it's too late. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5827) Add missing imports in the example of SQLContext
Takeshi Yamamuro created SPARK-5827: --- Summary: Add missing imports in the example of SQLContext Key: SPARK-5827 URL: https://issues.apache.org/jira/browse/SPARK-5827 Project: Spark Issue Type: Documentation Components: SQL Reporter: Takeshi Yamamuro Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5820) Example does not work when using SOCKS proxy
[ https://issues.apache.org/jira/browse/SPARK-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322022#comment-14322022 ] Eric O. LEBIGOT (EOL) edited comment on SPARK-5820 at 2/15/15 3:27 PM: --- Thank you for investigating this. The proxy server (localhost) does not require authentication. The example (SparkPi) does work when no SOCKS proxy is used (in the OS X Network Preferences). I did not mean to say that my proxy doesn't work for any JVM-based process (I'm not sure how I could test this). :) If you were referring to the three JVM settings that I tried (with the local SOCKS proxy server active), they are simply those of the OS X Java Preferences (AppleSystem PreferencesJavaNetwork Settings). For all JVM proxy settings, ./bin/run-example SparkPi fails in the same way (as in the original report). Here is how to reproduce the problem: 1) Creation of the local SOCKS proxy: ssh -g -N -D 12345 login@remotehost 2) Activation of the proxy in OS X: AppleSystem PreferencesNetworkAdvancedProxies, select SOCKS Proxy, set SOCKS Proxy Server to localhost, port 12345, make sure that Bypassed hosts contains localhost (and 127.0.0.1, for good measure, maybe). Now, connections go through the proxy (this should show in http://www.whatismyip.com/, for example). was (Author: lebigot): Thank you for investigating this. The proxy server (localhost) does not require authentication. The example (SparkPi) does work when no SOCKS proxy is used (in the OS X Network Preferences). I did not mean to say that my proxy doesn't work for any JVM-based process (I'm not sure how I could test this). :) If you were referring to the three JVM settings that I tried (with the local SOCKS proxy server active), they are simply those of the OS X Java Preferences (AppleSystem PreferencesJavaNetwork Settings). For all JVM proxy settings, ./bin/run-example SparkPi fails in the same way (as in the original report). Here is how to reproduce the problem: 1) Creation of the local SOCKS proxy: ssh -g -N -D 12345 login@remotehost 2) Activation of the proxy in OS X: AppleSystem PreferencesNetworkAdvancedProxies, select SOCKS Proxy, set SOCKS Proxy Server to localhost, port 12345. Now, browser connections go through the proxy (this should show in http://www.whatismyip.com/, for example). Example does not work when using SOCKS proxy Key: SPARK-5820 URL: https://issues.apache.org/jira/browse/SPARK-5820 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Reporter: Eric O. LEBIGOT (EOL) When using a SOCKS proxy (on OS X 10.10.2), running even the basic example ./bin/run-example SparkPi 10 fails. -- Partial log -- 15/02/14 23:23:00 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job 15/02/14 23:23:00 INFO TaskSchedulerImpl: Cancelling stage 0 15/02/14 23:23:00 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1) 15/02/14 23:23:00 INFO TaskSchedulerImpl: Stage 0 was cancelled 15/02/14 23:23:00 INFO DAGScheduler: Job 0 failed: reduce at SparkPi.scala:35, took 1.920223 s Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Malformed reply from SOCKS server at java.net.SocksSocketImpl.readSocksReply(SocksSocketImpl.java:129) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:503) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.init(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1003) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:951) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:582) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:433) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:356) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:353) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:778) at
[jira] [Comment Edited] (SPARK-5820) Example does not work when using SOCKS proxy
[ https://issues.apache.org/jira/browse/SPARK-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322022#comment-14322022 ] Eric O. LEBIGOT (EOL) edited comment on SPARK-5820 at 2/15/15 3:28 PM: --- Thank you for investigating this. The proxy server (localhost) does not require authentication. The example (SparkPi) does work when no SOCKS proxy is used (in the OS X Network Preferences). I did not mean to say that my proxy doesn't work for any JVM-based process (I'm not sure how I could test this). :) If you were referring to the three JVM settings that I tried (with the local SOCKS proxy server active), they are simply those of the OS X Java Preferences (AppleSystem PreferencesJavaNetwork Settings). For all JVM proxy settings, ./bin/run-example SparkPi fails in the same way (as in the original report). Here is how to reproduce the problem: 1) Creation of the local SOCKS proxy: ssh -g -N -D 12345 login@remotehost 2) Activation of the proxy in OS X: AppleSystem PreferencesNetworkAdvancedProxies, select SOCKS Proxy, set SOCKS Proxy Server to localhost, port 12345, make sure that Bypassed hosts contains localhost (and 127.0.0.1, for good measure, maybe). Now, connections go through the proxy (this should show in http://www.whatismyip.com/, for example)… and the SparkPi example fails. was (Author: lebigot): Thank you for investigating this. The proxy server (localhost) does not require authentication. The example (SparkPi) does work when no SOCKS proxy is used (in the OS X Network Preferences). I did not mean to say that my proxy doesn't work for any JVM-based process (I'm not sure how I could test this). :) If you were referring to the three JVM settings that I tried (with the local SOCKS proxy server active), they are simply those of the OS X Java Preferences (AppleSystem PreferencesJavaNetwork Settings). For all JVM proxy settings, ./bin/run-example SparkPi fails in the same way (as in the original report). Here is how to reproduce the problem: 1) Creation of the local SOCKS proxy: ssh -g -N -D 12345 login@remotehost 2) Activation of the proxy in OS X: AppleSystem PreferencesNetworkAdvancedProxies, select SOCKS Proxy, set SOCKS Proxy Server to localhost, port 12345, make sure that Bypassed hosts contains localhost (and 127.0.0.1, for good measure, maybe). Now, connections go through the proxy (this should show in http://www.whatismyip.com/, for example). Example does not work when using SOCKS proxy Key: SPARK-5820 URL: https://issues.apache.org/jira/browse/SPARK-5820 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Reporter: Eric O. LEBIGOT (EOL) When using a SOCKS proxy (on OS X 10.10.2), running even the basic example ./bin/run-example SparkPi 10 fails. -- Partial log -- 15/02/14 23:23:00 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job 15/02/14 23:23:00 INFO TaskSchedulerImpl: Cancelling stage 0 15/02/14 23:23:00 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1) 15/02/14 23:23:00 INFO TaskSchedulerImpl: Stage 0 was cancelled 15/02/14 23:23:00 INFO DAGScheduler: Job 0 failed: reduce at SparkPi.scala:35, took 1.920223 s Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Malformed reply from SOCKS server at java.net.SocksSocketImpl.readSocksReply(SocksSocketImpl.java:129) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:503) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.init(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1003) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:951) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:582) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:433) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:356) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:353) at
[jira] [Commented] (SPARK-5826) JavaStreamingContext.fileStream cause Configuration NotSerializableException
[ https://issues.apache.org/jira/browse/SPARK-5826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321980#comment-14321980 ] Apache Spark commented on SPARK-5826: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/4612 JavaStreamingContext.fileStream cause Configuration NotSerializableException Key: SPARK-5826 URL: https://issues.apache.org/jira/browse/SPARK-5826 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Littlestar Priority: Minor Attachments: TestStream.java org.apache.spark.streaming.api.java.JavaStreamingContext.fileStream(String directory, ClassLongWritable kClass, ClassText vClass, ClassTextInputFormat fClass, FunctionPath, Boolean filter, boolean newFilesOnly, Configuration conf) I use JavaStreamingContext.fileStream with 1.3.0/master with Configuration. but it throw strange exception. java.io.NotSerializableException: org.apache.hadoop.conf.Configuration at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440) at org.apache.spark.streaming.DStreamGraph$$anonfun$writeObject$1.apply$mcV$sp(DStreamGraph.scala:177) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1075) at org.apache.spark.streaming.DStreamGraph.writeObject(DStreamGraph.scala:172) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.streaming.CheckpointWriter.write(Checkpoint.scala:184) at org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:278) at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:169) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:78) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1.aroundReceive(JobGenerator.scala:76) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at
[jira] [Commented] (SPARK-3340) Deprecate ADD_JARS and ADD_FILES
[ https://issues.apache.org/jira/browse/SPARK-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322020#comment-14322020 ] Apache Spark commented on SPARK-3340: - User 'azagrebin' has created a pull request for this issue: https://github.com/apache/spark/pull/4616 Deprecate ADD_JARS and ADD_FILES Key: SPARK-3340 URL: https://issues.apache.org/jira/browse/SPARK-3340 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Labels: starter These were introduced before Spark submit even existed. Now that there are many better ways of setting jars and python files through Spark submit, we should deprecate these environment variables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5820) Example does not work when using SOCKS proxy
[ https://issues.apache.org/jira/browse/SPARK-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322022#comment-14322022 ] Eric O. LEBIGOT (EOL) commented on SPARK-5820: -- Thank you for investigating this. The proxy server (localhost) does not require authentication. The example (SparkPi) does work when no SOCKS proxy is used (in the OS X Network Preferences). I did not mean to say that my proxy doesn't work for any JVM-based process (I'm not sure how I could test this). :) If you were referring to the three JVM settings that I tried (with the local SOCKS proxy server active), they are simply those of the OS X Java Preferences (AppleSystem PreferencesJavaNetwork Settings). For all JVM proxy settings, ./bin/run-example SparkPi fails in the same way (as in the original report). Example does not work when using SOCKS proxy Key: SPARK-5820 URL: https://issues.apache.org/jira/browse/SPARK-5820 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Reporter: Eric O. LEBIGOT (EOL) When using a SOCKS proxy (on OS X 10.10.2), running even the basic example ./bin/run-example SparkPi 10 fails. -- Partial log -- 15/02/14 23:23:00 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job 15/02/14 23:23:00 INFO TaskSchedulerImpl: Cancelling stage 0 15/02/14 23:23:00 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1) 15/02/14 23:23:00 INFO TaskSchedulerImpl: Stage 0 was cancelled 15/02/14 23:23:00 INFO DAGScheduler: Job 0 failed: reduce at SparkPi.scala:35, took 1.920223 s Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Malformed reply from SOCKS server at java.net.SocksSocketImpl.readSocksReply(SocksSocketImpl.java:129) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:503) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.init(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1003) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:951) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:582) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:433) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:356) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:353) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:778) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:777) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:353) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
[jira] [Resolved] (SPARK-5827) Add missing imports in the example of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-5827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5827. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4615 [https://github.com/apache/spark/pull/4615] Add missing imports in the example of SQLContext Key: SPARK-5827 URL: https://issues.apache.org/jira/browse/SPARK-5827 Project: Spark Issue Type: Documentation Components: SQL Reporter: Takeshi Yamamuro Priority: Trivial Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5827) Add missing imports in the example of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-5827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5827: - Assignee: Takeshi Yamamuro Add missing imports in the example of SQLContext Key: SPARK-5827 URL: https://issues.apache.org/jira/browse/SPARK-5827 Project: Spark Issue Type: Documentation Components: SQL Reporter: Takeshi Yamamuro Assignee: Takeshi Yamamuro Priority: Trivial Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5820) Example does not work when using SOCKS proxy
[ https://issues.apache.org/jira/browse/SPARK-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5820: - Component/s: Examples Priority: Minor (was: Major) Example does not work when using SOCKS proxy Key: SPARK-5820 URL: https://issues.apache.org/jira/browse/SPARK-5820 Project: Spark Issue Type: Bug Components: Examples Affects Versions: 1.2.1 Reporter: Eric O. LEBIGOT (EOL) Priority: Minor When using a SOCKS proxy (on OS X 10.10.2), running even the basic example ./bin/run-example SparkPi 10 fails. -- Partial log -- 15/02/14 23:23:00 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job 15/02/14 23:23:00 INFO TaskSchedulerImpl: Cancelling stage 0 15/02/14 23:23:00 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1) 15/02/14 23:23:00 INFO TaskSchedulerImpl: Stage 0 was cancelled 15/02/14 23:23:00 INFO DAGScheduler: Job 0 failed: reduce at SparkPi.scala:35, took 1.920223 s Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Malformed reply from SOCKS server at java.net.SocksSocketImpl.readSocksReply(SocksSocketImpl.java:129) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:503) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.init(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1003) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:951) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:582) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:433) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:356) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:353) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:778) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:777) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:353) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at scala.Option.foreach(Option.scala:245) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at
[jira] [Commented] (SPARK-5745) Allow to use custom TaskMetrics implementation
[ https://issues.apache.org/jira/browse/SPARK-5745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322091#comment-14322091 ] Patrick Wendell commented on SPARK-5745: Hey [~jlewandowski] - TaskMetrics are a mostly internal concept. In fact, there isn't really any nice framework for aggregation internally. We instead have a bunch of manual aggregation in various places. The primary user-facing API we have aggregated counters are accumulators. Are there features lacking from accumulators that make it difficult for you to use them for your use case? Allow to use custom TaskMetrics implementation -- Key: SPARK-5745 URL: https://issues.apache.org/jira/browse/SPARK-5745 Project: Spark Issue Type: Wish Components: Spark Core Reporter: Jacek Lewandowski There can be various RDDs implemented and the {{TaskMetrics}} provides a great API for collecting metrics and aggregating them. However some RDDs may want to register some custom metrics and the current implementation doesn't allow for this (for example the number of read rows or whatever). I suppose that this can be changed without modifying the whole interface - there could used some factory to create the initial {{TaskMetrics}} object. The default factory could be overridden by user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5828) Dynamic partition pattern support
Jianshi Huang created SPARK-5828: Summary: Dynamic partition pattern support Key: SPARK-5828 URL: https://issues.apache.org/jira/browse/SPARK-5828 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Jianshi Huang Hi, HCatalog allows you to specify the pattern of paths for partitions, which will be used by dynamic partition loading. https://cwiki.apache.org/confluence/display/Hive/HCatalog+DynamicPartitions#HCatalogDynamicPartitions-ExternalTables Can we have similar feature in SparkSQL? Thanks, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5502) User guide for isotonic regression
[ https://issues.apache.org/jira/browse/SPARK-5502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5502. -- Resolution: Fixed Fix Version/s: 1.3.0 User guide for isotonic regression -- Key: SPARK-5502 URL: https://issues.apache.org/jira/browse/SPARK-5502 Project: Spark Issue Type: Documentation Components: Documentation, MLlib Reporter: Xiangrui Meng Assignee: Martin Zapletal Fix For: 1.3.0 Add user guide to docs/mllib-regression.md with code examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-1867: - Component/s: Spark Core Spark Documentation Error causes java.lang.IllegalStateException: unread block data --- Key: SPARK-1867 URL: https://issues.apache.org/jira/browse/SPARK-1867 Project: Spark Issue Type: Bug Components: Spark Core Reporter: sam I've employed two System Administrators on a contract basis (for quite a bit of money), and both contractors have independently hit the following exception. What we are doing is: 1. Installing Spark 0.9.1 according to the documentation on the website, along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. 2. Building a fat jar with a Spark app with sbt then trying to run it on the cluster I've also included code snippets, and sbt deps at the bottom. When I've Googled this, there seems to be two somewhat vague responses: a) Mismatching spark versions on nodes/user code b) Need to add more jars to the SparkConf Now I know that (b) is not the problem having successfully run the same code on other clusters while only including one jar (it's a fat jar). But I have no idea how to check for (a) - it appears Spark doesn't have any version checks or anything - it would be nice if it checked versions and threw a mismatching version exception: you have user code using version X and node Y has version Z. I would be very grateful for advice on this. The exception: Exception in thread main org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 32 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to java.lang.IllegalStateException: unread block data [duplicate 59] My code snippet: val conf = new SparkConf() .setMaster(clusterMaster) .setAppName(appName) .setSparkHome(sparkHome) .setJars(SparkContext.jarOfClass(this.getClass)) println(count = + new SparkContext(conf).textFile(someHdfsPath).count()) My SBT dependencies: // relevant org.apache.spark % spark-core_2.10 % 0.9.1, org.apache.hadoop % hadoop-client % 2.3.0-mr1-cdh5.0.0, // standard, probably unrelated com.github.seratch %% awscala % [0.2,), org.scalacheck %% scalacheck % 1.10.1 % test, org.specs2 %% specs2 % 1.14 % test, org.scala-lang % scala-reflect % 2.10.3, org.scalaz %% scalaz-core % 7.0.5, net.minidev % json-smart % 1.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5669) Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS
[ https://issues.apache.org/jira/browse/SPARK-5669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5669: - Assignee: Sean Owen Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS Key: SPARK-5669 URL: https://issues.apache.org/jira/browse/SPARK-5669 Project: Spark Issue Type: Bug Components: Build Reporter: Sean Owen Assignee: Sean Owen Priority: Blocker Fix For: 1.3.0 Sorry for Blocker, but it's a license issue. The Spark assembly includes the following, from JBLAS: {code} lib/ lib/static/ lib/static/Mac OS X/ lib/static/Mac OS X/x86_64/ lib/static/Mac OS X/x86_64/libjblas_arch_flavor.jnilib lib/static/Mac OS X/x86_64/sse3/ lib/static/Mac OS X/x86_64/sse3/libjblas.jnilib lib/static/Windows/ lib/static/Windows/x86/ lib/static/Windows/x86/libgfortran-3.dll lib/static/Windows/x86/libgcc_s_dw2-1.dll lib/static/Windows/x86/jblas_arch_flavor.dll lib/static/Windows/x86/sse3/ lib/static/Windows/x86/sse3/jblas.dll lib/static/Windows/amd64/ lib/static/Windows/amd64/libgfortran-3.dll lib/static/Windows/amd64/jblas.dll lib/static/Windows/amd64/libgcc_s_sjlj-1.dll lib/static/Windows/amd64/jblas_arch_flavor.dll lib/static/Linux/ lib/static/Linux/i386/ lib/static/Linux/i386/sse3/ lib/static/Linux/i386/sse3/libjblas.so lib/static/Linux/i386/libjblas_arch_flavor.so lib/static/Linux/amd64/ lib/static/Linux/amd64/sse3/ lib/static/Linux/amd64/sse3/libjblas.so lib/static/Linux/amd64/libjblas_arch_flavor.so {code} Unfortunately the libgfortran and libgcc libraries included for Windows are not licensed in a way that's compatible with Spark and the AL2 -- LGPL at least. It's easy to exclude them. I'm not clear what it does to running on Windows; I assume it can still work but the libs would have to be made available locally and put on the shared library path manually. I don't think there's a package manager as in Linux that would make it easily available. I'm not able to test on Windows. If it doesn't work, the follow-up question is whether that means JBLAS has to be removed on the double, or treated as a known issue for 1.3.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5669) Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS
[ https://issues.apache.org/jira/browse/SPARK-5669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5669: - Target Version/s: 1.3.0, 1.1.2, 1.2.2 Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS Key: SPARK-5669 URL: https://issues.apache.org/jira/browse/SPARK-5669 Project: Spark Issue Type: Bug Components: Build Reporter: Sean Owen Assignee: Sean Owen Priority: Blocker Fix For: 1.3.0 Sorry for Blocker, but it's a license issue. The Spark assembly includes the following, from JBLAS: {code} lib/ lib/static/ lib/static/Mac OS X/ lib/static/Mac OS X/x86_64/ lib/static/Mac OS X/x86_64/libjblas_arch_flavor.jnilib lib/static/Mac OS X/x86_64/sse3/ lib/static/Mac OS X/x86_64/sse3/libjblas.jnilib lib/static/Windows/ lib/static/Windows/x86/ lib/static/Windows/x86/libgfortran-3.dll lib/static/Windows/x86/libgcc_s_dw2-1.dll lib/static/Windows/x86/jblas_arch_flavor.dll lib/static/Windows/x86/sse3/ lib/static/Windows/x86/sse3/jblas.dll lib/static/Windows/amd64/ lib/static/Windows/amd64/libgfortran-3.dll lib/static/Windows/amd64/jblas.dll lib/static/Windows/amd64/libgcc_s_sjlj-1.dll lib/static/Windows/amd64/jblas_arch_flavor.dll lib/static/Linux/ lib/static/Linux/i386/ lib/static/Linux/i386/sse3/ lib/static/Linux/i386/sse3/libjblas.so lib/static/Linux/i386/libjblas_arch_flavor.so lib/static/Linux/amd64/ lib/static/Linux/amd64/sse3/ lib/static/Linux/amd64/sse3/libjblas.so lib/static/Linux/amd64/libjblas_arch_flavor.so {code} Unfortunately the libgfortran and libgcc libraries included for Windows are not licensed in a way that's compatible with Spark and the AL2 -- LGPL at least. It's easy to exclude them. I'm not clear what it does to running on Windows; I assume it can still work but the libs would have to be made available locally and put on the shared library path manually. I don't think there's a package manager as in Linux that would make it easily available. I'm not able to test on Windows. If it doesn't work, the follow-up question is whether that means JBLAS has to be removed on the double, or treated as a known issue for 1.3.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5669) Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS
[ https://issues.apache.org/jira/browse/SPARK-5669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322084#comment-14322084 ] Sean Owen commented on SPARK-5669: -- Since the follow-up change is to remove JBLAS, and that's covered in SPARK-5814, shall we track the remaining work there? Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS Key: SPARK-5669 URL: https://issues.apache.org/jira/browse/SPARK-5669 Project: Spark Issue Type: Bug Components: Build Reporter: Sean Owen Assignee: Sean Owen Priority: Blocker Fix For: 1.3.0 Sorry for Blocker, but it's a license issue. The Spark assembly includes the following, from JBLAS: {code} lib/ lib/static/ lib/static/Mac OS X/ lib/static/Mac OS X/x86_64/ lib/static/Mac OS X/x86_64/libjblas_arch_flavor.jnilib lib/static/Mac OS X/x86_64/sse3/ lib/static/Mac OS X/x86_64/sse3/libjblas.jnilib lib/static/Windows/ lib/static/Windows/x86/ lib/static/Windows/x86/libgfortran-3.dll lib/static/Windows/x86/libgcc_s_dw2-1.dll lib/static/Windows/x86/jblas_arch_flavor.dll lib/static/Windows/x86/sse3/ lib/static/Windows/x86/sse3/jblas.dll lib/static/Windows/amd64/ lib/static/Windows/amd64/libgfortran-3.dll lib/static/Windows/amd64/jblas.dll lib/static/Windows/amd64/libgcc_s_sjlj-1.dll lib/static/Windows/amd64/jblas_arch_flavor.dll lib/static/Linux/ lib/static/Linux/i386/ lib/static/Linux/i386/sse3/ lib/static/Linux/i386/sse3/libjblas.so lib/static/Linux/i386/libjblas_arch_flavor.so lib/static/Linux/amd64/ lib/static/Linux/amd64/sse3/ lib/static/Linux/amd64/sse3/libjblas.so lib/static/Linux/amd64/libjblas_arch_flavor.so {code} Unfortunately the libgfortran and libgcc libraries included for Windows are not licensed in a way that's compatible with Spark and the AL2 -- LGPL at least. It's easy to exclude them. I'm not clear what it does to running on Windows; I assume it can still work but the libs would have to be made available locally and put on the shared library path manually. I don't think there's a package manager as in Linux that would make it easily available. I'm not able to test on Windows. If it doesn't work, the follow-up question is whether that means JBLAS has to be removed on the double, or treated as a known issue for 1.3.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5826) JavaStreamingContext.fileStream cause Configuration NotSerializableException
[ https://issues.apache.org/jira/browse/SPARK-5826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5826: --- Priority: Critical (was: Minor) JavaStreamingContext.fileStream cause Configuration NotSerializableException Key: SPARK-5826 URL: https://issues.apache.org/jira/browse/SPARK-5826 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Littlestar Priority: Critical Attachments: TestStream.java org.apache.spark.streaming.api.java.JavaStreamingContext.fileStream(String directory, ClassLongWritable kClass, ClassText vClass, ClassTextInputFormat fClass, FunctionPath, Boolean filter, boolean newFilesOnly, Configuration conf) I use JavaStreamingContext.fileStream with 1.3.0/master with Configuration. but it throw strange exception. java.io.NotSerializableException: org.apache.hadoop.conf.Configuration at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440) at org.apache.spark.streaming.DStreamGraph$$anonfun$writeObject$1.apply$mcV$sp(DStreamGraph.scala:177) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1075) at org.apache.spark.streaming.DStreamGraph.writeObject(DStreamGraph.scala:172) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.streaming.CheckpointWriter.write(Checkpoint.scala:184) at org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:278) at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:169) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:78) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1.aroundReceive(JobGenerator.scala:76) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at
[jira] [Commented] (SPARK-5669) Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS
[ https://issues.apache.org/jira/browse/SPARK-5669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322082#comment-14322082 ] Xiangrui Meng commented on SPARK-5669: -- PR #4453 resolves this issue for master and branch-1.3. Spark assembly includes incompatibly licensed libgfortran, libgcc code via JBLAS Key: SPARK-5669 URL: https://issues.apache.org/jira/browse/SPARK-5669 Project: Spark Issue Type: Bug Components: Build Reporter: Sean Owen Assignee: Sean Owen Priority: Blocker Fix For: 1.3.0 Sorry for Blocker, but it's a license issue. The Spark assembly includes the following, from JBLAS: {code} lib/ lib/static/ lib/static/Mac OS X/ lib/static/Mac OS X/x86_64/ lib/static/Mac OS X/x86_64/libjblas_arch_flavor.jnilib lib/static/Mac OS X/x86_64/sse3/ lib/static/Mac OS X/x86_64/sse3/libjblas.jnilib lib/static/Windows/ lib/static/Windows/x86/ lib/static/Windows/x86/libgfortran-3.dll lib/static/Windows/x86/libgcc_s_dw2-1.dll lib/static/Windows/x86/jblas_arch_flavor.dll lib/static/Windows/x86/sse3/ lib/static/Windows/x86/sse3/jblas.dll lib/static/Windows/amd64/ lib/static/Windows/amd64/libgfortran-3.dll lib/static/Windows/amd64/jblas.dll lib/static/Windows/amd64/libgcc_s_sjlj-1.dll lib/static/Windows/amd64/jblas_arch_flavor.dll lib/static/Linux/ lib/static/Linux/i386/ lib/static/Linux/i386/sse3/ lib/static/Linux/i386/sse3/libjblas.so lib/static/Linux/i386/libjblas_arch_flavor.so lib/static/Linux/amd64/ lib/static/Linux/amd64/sse3/ lib/static/Linux/amd64/sse3/libjblas.so lib/static/Linux/amd64/libjblas_arch_flavor.so {code} Unfortunately the libgfortran and libgcc libraries included for Windows are not licensed in a way that's compatible with Spark and the AL2 -- LGPL at least. It's easy to exclude them. I'm not clear what it does to running on Windows; I assume it can still work but the libs would have to be made available locally and put on the shared library path manually. I don't think there's a package manager as in Linux that would make it easily available. I'm not able to test on Windows. If it doesn't work, the follow-up question is whether that means JBLAS has to be removed on the double, or treated as a known issue for 1.3.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5815) Deprecate SVDPlusPlus APIs that expose DoubleMatrix from JBLAS
[ https://issues.apache.org/jira/browse/SPARK-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5815: - Assignee: Sean Owen (was: Xiangrui Meng) Deprecate SVDPlusPlus APIs that expose DoubleMatrix from JBLAS -- Key: SPARK-5815 URL: https://issues.apache.org/jira/browse/SPARK-5815 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Sean Owen It is generally bad to expose types defined in a 3rd-party package in Spark public APIs. We should deprecate those methods in SVDPlusPlus and replace them in the next release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5816) Add huge backward compatibility warning in DriverWrapper
[ https://issues.apache.org/jira/browse/SPARK-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5816: - Affects Version/s: 1.3.0 Add huge backward compatibility warning in DriverWrapper Key: SPARK-5816 URL: https://issues.apache.org/jira/browse/SPARK-5816 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.3.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical As of Spark 1.3, we provide backward and forward compatibility in standalone cluster mode through the REST submission gateway. HOWEVER, it nevertheless goes through the legacy o.a.s.deploy.DriverWrapper, and the semantics of the command line arguments there must not change. For instance, this was broken in commit 20a6013106b56a1a1cc3e8cda092330ffbe77cc3. There is currently no warning against that in the class and so we should add one before it's too late. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5795) api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java
[ https://issues.apache.org/jira/browse/SPARK-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321761#comment-14321761 ] Apache Spark commented on SPARK-5795: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/4608 api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java - Key: SPARK-5795 URL: https://issues.apache.org/jira/browse/SPARK-5795 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Littlestar Priority: Critical Attachments: TestStreamCompile.java import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; the following code can't compile on java. JavaPairDStreamInteger, Integer rs = rs.saveAsNewAPIHadoopFiles(prefix, txt, Integer.class, Integer.class, TextOutputFormat.class, jobConf); but similar code in JavaPairRDD works ok. JavaPairRDDString, String counts =... counts.saveAsNewAPIHadoopFile(out, Text.class, Text.class, TextOutputFormat.class, jobConf); mybe the def saveAsNewAPIHadoopFiles( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_ : NewOutputFormat[_, _]], conf: Configuration = new Configuration) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } = def saveAsNewAPIHadoopFiles[F : NewOutputFormat[_, _]]( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[F], conf: Configuration = new Configuration) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5826) JavaStreamingContext.fileStream cause Configuration NotSerializableException
[ https://issues.apache.org/jira/browse/SPARK-5826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321849#comment-14321849 ] Littlestar edited comment on SPARK-5826 at 2/15/15 7:51 AM: testcode upload. It throw Exception every 2 seconds. 15/02/15 15:50:35 ERROR actor.OneForOneStrategy: org.apache.hadoop.conf.Configuration java.io.NotSerializableException: org.apache.hadoop.conf.Configuration at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440) at org.apache.spark.streaming.DStreamGraph$$anonfun$writeObject$1.apply$mcV$sp(DStreamGraph.scala:177) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1075) at org.apache.spark.streaming.DStreamGraph.writeObject(DStreamGraph.scala:172) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.streaming.CheckpointWriter.write(Checkpoint.scala:184) at org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:278) at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:169) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:78) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1.aroundReceive(JobGenerator.scala:76) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) was (Author: cnstar9988): testcode upload. !TestStream.java! JavaStreamingContext.fileStream cause Configuration NotSerializableException Key: SPARK-5826 URL:
[jira] [Commented] (SPARK-5824) CTAS should set null format in hive-0.13.1
[ https://issues.apache.org/jira/browse/SPARK-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321806#comment-14321806 ] Apache Spark commented on SPARK-5824: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/4609 CTAS should set null format in hive-0.13.1 -- Key: SPARK-5824 URL: https://issues.apache.org/jira/browse/SPARK-5824 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5795) api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java
[ https://issues.apache.org/jira/browse/SPARK-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321848#comment-14321848 ] Littlestar edited comment on SPARK-5795 at 2/15/15 8:05 AM: I merge pull/4608 and rebuild, it works for me, thanks. was (Author: cnstar9988): works for me, thanks. api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java - Key: SPARK-5795 URL: https://issues.apache.org/jira/browse/SPARK-5795 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Littlestar Priority: Critical Attachments: TestStreamCompile.java import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; the following code can't compile on java. JavaPairDStreamInteger, Integer rs = rs.saveAsNewAPIHadoopFiles(prefix, txt, Integer.class, Integer.class, TextOutputFormat.class, jobConf); but similar code in JavaPairRDD works ok. JavaPairRDDString, String counts =... counts.saveAsNewAPIHadoopFile(out, Text.class, Text.class, TextOutputFormat.class, jobConf); mybe the def saveAsNewAPIHadoopFiles( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_ : NewOutputFormat[_, _]], conf: Configuration = new Configuration) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } = def saveAsNewAPIHadoopFiles[F : NewOutputFormat[_, _]]( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[F], conf: Configuration = new Configuration) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5826) JavaStreamingContext.fileStream cause Configuration NotSerializableException
[ https://issues.apache.org/jira/browse/SPARK-5826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Littlestar updated SPARK-5826: -- Attachment: TestStream.java testcode upload. !TestStream.java! JavaStreamingContext.fileStream cause Configuration NotSerializableException Key: SPARK-5826 URL: https://issues.apache.org/jira/browse/SPARK-5826 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Littlestar Priority: Minor Attachments: TestStream.java org.apache.spark.streaming.api.java.JavaStreamingContext.fileStream(String directory, ClassLongWritable kClass, ClassText vClass, ClassTextInputFormat fClass, FunctionPath, Boolean filter, boolean newFilesOnly, Configuration conf) I use JavaStreamingContext.fileStream with 1.3.0/master with Configuration. but it throw strange exception. java.io.NotSerializableException: org.apache.hadoop.conf.Configuration at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440) at org.apache.spark.streaming.DStreamGraph$$anonfun$writeObject$1.apply$mcV$sp(DStreamGraph.scala:177) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1075) at org.apache.spark.streaming.DStreamGraph.writeObject(DStreamGraph.scala:172) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.streaming.CheckpointWriter.write(Checkpoint.scala:184) at org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:278) at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:169) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:78) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1.aroundReceive(JobGenerator.scala:76) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at
[jira] [Updated] (SPARK-5795) api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java
[ https://issues.apache.org/jira/browse/SPARK-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5795: - Priority: Critical (was: Minor) Yes, I've seen the same problem and been meaning to do something about it. It makes you do this to use {{JavaPairDStream}}: https://github.com/OryxProject/oryx/blob/master/oryx-lambda/src/main/java/com/cloudera/oryx/lambda/BatchLayer.java#L187 So basically, this is how it's declared now: {code} def saveAsNewAPIHadoopFiles( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_ : NewOutputFormat[_, _]], conf: Configuration = new Configuration) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } {code} but this works, and is how it works in {{JavaPairRDD}}: {code} def saveAsNewAPIHadoopFiles[F : NewOutputFormat[_, _]]( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[F], conf: Configuration = new Configuration) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } {code} I worry about an API change of course, but, I think the current API isn't directly callable, so seems OK to change. For a simple demo, try compiling this: {code} JavaPairDStreamIntWritable,Text pds = null; pds.saveAsNewAPIHadoopFiles(, , IntWritable.class, IntWritable.class, SequenceFileOutputFormat.class); {code} The change above makes it work. I'll open a PR. I bumped the priority based on my understanding of the issue. api.java.JavaPairDStream.saveAsNewAPIHadoopFiles may not friendly to java - Key: SPARK-5795 URL: https://issues.apache.org/jira/browse/SPARK-5795 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Littlestar Priority: Critical Attachments: TestStreamCompile.java import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; the following code can't compile on java. JavaPairDStreamInteger, Integer rs = rs.saveAsNewAPIHadoopFiles(prefix, txt, Integer.class, Integer.class, TextOutputFormat.class, jobConf); but similar code in JavaPairRDD works ok. JavaPairRDDString, String counts =... counts.saveAsNewAPIHadoopFile(out, Text.class, Text.class, TextOutputFormat.class, jobConf); mybe the def saveAsNewAPIHadoopFiles( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[_ : NewOutputFormat[_, _]], conf: Configuration = new Configuration) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } = def saveAsNewAPIHadoopFiles[F : NewOutputFormat[_, _]]( prefix: String, suffix: String, keyClass: Class[_], valueClass: Class[_], outputFormatClass: Class[F], conf: Configuration = new Configuration) { dstream.saveAsNewAPIHadoopFiles(prefix, suffix, keyClass, valueClass, outputFormatClass, conf) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5826) JavaStreamingContext.fileStream cause Configuration NotSerializableException
[ https://issues.apache.org/jira/browse/SPARK-5826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Littlestar updated SPARK-5826: -- Component/s: Streaming JavaStreamingContext.fileStream cause Configuration NotSerializableException Key: SPARK-5826 URL: https://issues.apache.org/jira/browse/SPARK-5826 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Littlestar Priority: Minor org.apache.spark.streaming.api.java.JavaStreamingContext.fileStream(String directory, ClassLongWritable kClass, ClassText vClass, ClassTextInputFormat fClass, FunctionPath, Boolean filter, boolean newFilesOnly, Configuration conf) I use JavaStreamingContext.fileStream with 1.3.0/master with Configuration. but it throw strange exception. java.io.NotSerializableException: org.apache.hadoop.conf.Configuration at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440) at org.apache.spark.streaming.DStreamGraph$$anonfun$writeObject$1.apply$mcV$sp(DStreamGraph.scala:177) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1075) at org.apache.spark.streaming.DStreamGraph.writeObject(DStreamGraph.scala:172) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.streaming.CheckpointWriter.write(Checkpoint.scala:184) at org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:278) at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:169) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:78) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1.aroundReceive(JobGenerator.scala:76) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at
[jira] [Updated] (SPARK-5746) Check invalid cases for the write path of data source API
[ https://issues.apache.org/jira/browse/SPARK-5746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5746: Summary: Check invalid cases for the write path of data source API (was: INSERT OVERWRITE throws FileNotFoundException when the source and destination point to the same table.) Check invalid cases for the write path of data source API - Key: SPARK-5746 URL: https://issues.apache.org/jira/browse/SPARK-5746 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker With the newly introduced write support of data source API, {{JSONRelation}} and {{ParquetRelation2}} both suffer this bug. The root cause is that we removed the source table before insertion ([here|https://github.com/apache/spark/blob/1ac099e3e00ddb01af8e6e3a84c70f8363f04b5c/sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala#L112-L121]). The correct solution should be first insert into a temporary folder, and then overwrite the source table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5746) Check invalid cases for the write path of data source API
[ https://issues.apache.org/jira/browse/SPARK-5746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5746: Description: Right now, with the newly introduced write support of data source API, {{JSONRelation}} and {{ParquetRelation2}} both delete data first when the save mode is overwrite ([here|https://github.com/apache/spark/blob/1ac099e3e00ddb01af8e6e3a84c70f8363f04b5c/sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala#L112-L121]) and this behavior introduces issues when the destination table is an input table of the query. For example {code} INSERT OVERWRITE t SELECT * FROM t {code} We need to add an analysis rule to check cases that are invalid for the write path of data source API. was: With the newly introduced write support of data source API, {{JSONRelation}} and {{ParquetRelation2}} both suffer this bug. The root cause is that we removed the source table before insertion ([here|https://github.com/apache/spark/blob/1ac099e3e00ddb01af8e6e3a84c70f8363f04b5c/sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala#L112-L121]). The correct solution should be first insert into a temporary folder, and then overwrite the source table. Check invalid cases for the write path of data source API - Key: SPARK-5746 URL: https://issues.apache.org/jira/browse/SPARK-5746 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker Right now, with the newly introduced write support of data source API, {{JSONRelation}} and {{ParquetRelation2}} both delete data first when the save mode is overwrite ([here|https://github.com/apache/spark/blob/1ac099e3e00ddb01af8e6e3a84c70f8363f04b5c/sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala#L112-L121]) and this behavior introduces issues when the destination table is an input table of the query. For example {code} INSERT OVERWRITE t SELECT * FROM t {code} We need to add an analysis rule to check cases that are invalid for the write path of data source API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5746) Check invalid cases for the write path of data source API
[ https://issues.apache.org/jira/browse/SPARK-5746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322273#comment-14322273 ] Apache Spark commented on SPARK-5746: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/4617 Check invalid cases for the write path of data source API - Key: SPARK-5746 URL: https://issues.apache.org/jira/browse/SPARK-5746 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Blocker Right now, with the newly introduced write support of data source API, {{JSONRelation}} and {{ParquetRelation2}} both delete data first when the save mode is overwrite ([here|https://github.com/apache/spark/blob/1ac099e3e00ddb01af8e6e3a84c70f8363f04b5c/sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala#L112-L121]) and this behavior introduces issues when the destination table is an input table of the query. For example {code} INSERT OVERWRITE t SELECT * FROM t {code} We need to add an analysis rule to check cases that are invalid for the write path of data source API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5823) Reuse mutable rows for inner structures when parsing JSON objects
[ https://issues.apache.org/jira/browse/SPARK-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321792#comment-14321792 ] Yanbo Liang commented on SPARK-5823: Hi [~yhuai] Actually, I have implemented reusing the mutable row for inner structures at https://issues.apache.org/jira/browse/SPARK-5738. However I have found that you have mentioned it will extend Spark SQL's JSON support to handle the case where each object in the dataset might have considerably different schema. In this scenario, the inner nested mutable row strategy will not take much performance improvements, am I right? Reuse mutable rows for inner structures when parsing JSON objects - Key: SPARK-5823 URL: https://issues.apache.org/jira/browse/SPARK-5823 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai With SPARK-5738, we will reuse a mutable row for rows when parsing JSON objects. We can do the same thing for inner structures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5821) [SQL] CTAS command failure when your don't have write permission of the parent directory
[ https://issues.apache.org/jira/browse/SPARK-5821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321824#comment-14321824 ] Apache Spark commented on SPARK-5821: - User 'yanbohappy' has created a pull request for this issue: https://github.com/apache/spark/pull/4610 [SQL] CTAS command failure when your don't have write permission of the parent directory Key: SPARK-5821 URL: https://issues.apache.org/jira/browse/SPARK-5821 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Yanbo Liang When you run CTAS command such as CREATE TEMPORARY TABLE jsonTable USING org.apache.spark.sql.json.DefaultSource OPTIONS ( path /a/b/c/d ) AS SELECT a, b FROM jt, you will run into failure if you don't have write permission for directory /a/b/c whether d is a directory or file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5820) Example does not work when using SOCKS proxy
[ https://issues.apache.org/jira/browse/SPARK-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321776#comment-14321776 ] Eric O. LEBIGOT (EOL) edited comment on SPARK-5820 at 2/15/15 1:50 AM: --- Good question, I should have added more details: I am running locally (and have a local SOCKS proxy connection to a remote host). In the Java OS X Preferences, I had set the JVM uses the proxy settings from my default browser (default). I just tried settings, to no avail (they all fail with the same error): - Use proxy server (Advanced: For all protocols, on and off), Bypass proxy server for local addresses. - Direct connection. It is strange that even the last one fails. was (Author: lebigot): Good question, I should have added more details: I am running locally (and have a local SOCKS proxy connection to a remote host). The JVM uses the proxy settings from my default browser (as indicated in the Java OS X Preferences). I just tried another setting: Use proxy server (Advanced: For all protocols, on and off), Bypass proxy server for local addresses. This does not work either. Example does not work when using SOCKS proxy Key: SPARK-5820 URL: https://issues.apache.org/jira/browse/SPARK-5820 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Reporter: Eric O. LEBIGOT (EOL) When using a SOCKS proxy (on OS X 10.10.2), running even the basic example ./bin/run-example SparkPi 10 fails. -- Partial log -- 15/02/14 23:23:00 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job 15/02/14 23:23:00 INFO TaskSchedulerImpl: Cancelling stage 0 15/02/14 23:23:00 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1) 15/02/14 23:23:00 INFO TaskSchedulerImpl: Stage 0 was cancelled 15/02/14 23:23:00 INFO DAGScheduler: Job 0 failed: reduce at SparkPi.scala:35, took 1.920223 s Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Malformed reply from SOCKS server at java.net.SocksSocketImpl.readSocksReply(SocksSocketImpl.java:129) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:503) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.init(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1003) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:951) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:582) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:433) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:356) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:353) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:778) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:777) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:353) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at
[jira] [Comment Edited] (SPARK-5820) Example does not work when using SOCKS proxy
[ https://issues.apache.org/jira/browse/SPARK-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321776#comment-14321776 ] Eric O. LEBIGOT (EOL) edited comment on SPARK-5820 at 2/15/15 2:23 AM: --- Good question, I should have added more details: I am running locally (and have a local SOCKS proxy connection to a remote host, with localhost bypassed by the proxy, as it should). In the Java OS X Preferences, I had set the to JVM use the proxy settings from my default browser (default), which is to use the system-wide proxy setting (from AppleSystem PreferencesNetworkAdvancedProxies). I just tried the following other JVM settings, to no avail (they all fail with the same error): - Use proxy server (Advanced: For all protocols, on and off), Bypass proxy server for local addresses. - Direct connection. It is strange that even the last one fails. was (Author: lebigot): Good question, I should have added more details: I am running locally (and have a local SOCKS proxy connection to a remote host). In the Java OS X Preferences, I had set the to JVM use the proxy settings from my default browser (default), which is to use the system-wide proxy setting (from AppleSystem PreferencesNetworkAdvancedProxies). I just tried the following other JVM settings, to no avail (they all fail with the same error): - Use proxy server (Advanced: For all protocols, on and off), Bypass proxy server for local addresses. - Direct connection. It is strange that even the last one fails. Example does not work when using SOCKS proxy Key: SPARK-5820 URL: https://issues.apache.org/jira/browse/SPARK-5820 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Reporter: Eric O. LEBIGOT (EOL) When using a SOCKS proxy (on OS X 10.10.2), running even the basic example ./bin/run-example SparkPi 10 fails. -- Partial log -- 15/02/14 23:23:00 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job 15/02/14 23:23:00 INFO TaskSchedulerImpl: Cancelling stage 0 15/02/14 23:23:00 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1) 15/02/14 23:23:00 INFO TaskSchedulerImpl: Stage 0 was cancelled 15/02/14 23:23:00 INFO DAGScheduler: Job 0 failed: reduce at SparkPi.scala:35, took 1.920223 s Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Malformed reply from SOCKS server at java.net.SocksSocketImpl.readSocksReply(SocksSocketImpl.java:129) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:503) at java.net.Socket.connect(Socket.java:579) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.init(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1003) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:951) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:850) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:582) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:433) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:356) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:353) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:778) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:777) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:353) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at
[jira] [Commented] (SPARK-5813) Spark-ec2: Switch to OracleJDK
[ https://issues.apache.org/jira/browse/SPARK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322208#comment-14322208 ] Florian Verhein commented on SPARK-5813: Good point. I think you're right re: scripting away - I understand it's sometimes done by sysadmins/ops to automate their installation processes in-house, but that is a different situation. Thanks for that. spark_ec2 works by looking up an existing ami and using it to instantiate ec2 instances. I don't know who currently maintains these. Spark-ec2: Switch to OracleJDK -- Key: SPARK-5813 URL: https://issues.apache.org/jira/browse/SPARK-5813 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Florian Verhein Priority: Minor Currently using OpenJDK, however it is generally recommended to use Oracle JDK, esp for Hadoop deployments, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4865) Include temporary tables in SHOW TABLES
[ https://issues.apache.org/jira/browse/SPARK-4865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322309#comment-14322309 ] Apache Spark commented on SPARK-4865: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/4618 Include temporary tables in SHOW TABLES --- Key: SPARK-4865 URL: https://issues.apache.org/jira/browse/SPARK-4865 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Misha Chernetsov Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5826) JavaStreamingContext.fileStream cause Configuration NotSerializableException
[ https://issues.apache.org/jira/browse/SPARK-5826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322317#comment-14322317 ] Littlestar commented on SPARK-5826: --- I merge pull/4612 and rebuild, it works for me, no exception, thanks. JavaStreamingContext.fileStream cause Configuration NotSerializableException Key: SPARK-5826 URL: https://issues.apache.org/jira/browse/SPARK-5826 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Littlestar Priority: Critical Attachments: TestStream.java org.apache.spark.streaming.api.java.JavaStreamingContext.fileStream(String directory, ClassLongWritable kClass, ClassText vClass, ClassTextInputFormat fClass, FunctionPath, Boolean filter, boolean newFilesOnly, Configuration conf) I use JavaStreamingContext.fileStream with 1.3.0/master with Configuration. but it throw strange exception. java.io.NotSerializableException: org.apache.hadoop.conf.Configuration at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440) at org.apache.spark.streaming.DStreamGraph$$anonfun$writeObject$1.apply$mcV$sp(DStreamGraph.scala:177) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1075) at org.apache.spark.streaming.DStreamGraph.writeObject(DStreamGraph.scala:172) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.streaming.CheckpointWriter.write(Checkpoint.scala:184) at org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:278) at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:169) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:78) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1.aroundReceive(JobGenerator.scala:76) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at
[jira] [Created] (SPARK-5829) JavaStreamingContext.fileStream run task repeated empty when no more new files
Littlestar created SPARK-5829: - Summary: JavaStreamingContext.fileStream run task repeated empty when no more new files Key: SPARK-5829 URL: https://issues.apache.org/jira/browse/SPARK-5829 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Environment: spark master (1.3.0) with SPARK-5826 patch. Reporter: Littlestar spark master (1.3.0) with SPARK-5826 patch. JavaStreamingContext.fileStream run task repeated empty when no more new files reproduce: 1. mkdir /testspark/watchdir on HDFS. 2. run app. 3. put some text files into /testspark/watchdir. every 30 seconds, spark log indicates that a new sub task runs. and /testspark/resultdir/ has new directory with empty files every 30 seconds. when no new files add, but it runs new task with empy rdd. {noformat} package my.test.hadoop.spark; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.spark.SparkConf; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.streaming.Durations; import org.apache.spark.streaming.api.java.JavaPairDStream; import org.apache.spark.streaming.api.java.JavaStreamingContext; import scala.Tuple2; public class TestStream { @SuppressWarnings({ serial, resource }) public static void main(String[] args) throws Exception { SparkConf conf = new SparkConf().setAppName(TestStream); JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(30)); jssc.checkpoint(/testspark/checkpointdir); Configuration jobConf = new Configuration(); jobConf.set(my.test.fields,fields); JavaPairDStreamInteger, Integer is = jssc.fileStream(/testspark/watchdir, LongWritable.class, Text.class, TextInputFormat.class, new FunctionPath, Boolean() { @Override public Boolean call(Path v1) throws Exception { return true; } }, true, jobConf).mapToPair(new PairFunctionTuple2LongWritable, Text, Integer, Integer() { @Override public Tuple2Integer, Integer call(Tuple2LongWritable, Text arg0) throws Exception { return new Tuple2Integer, Integer(1, 1); } }); JavaPairDStreamInteger, Integer rs = is.reduceByKey(new Function2Integer, Integer, Integer() { @Override public Integer call(Integer arg0, Integer arg1) throws Exception { return arg0 + arg1; } }); rs.checkpoint(Durations.seconds(60)); rs.saveAsNewAPIHadoopFiles(/testspark/resultdir/output, suffix, Integer.class, Integer.class, TextOutputFormat.class); jssc.start(); jssc.awaitTermination(); } } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5829) JavaStreamingContext.fileStream run task repeated empty when no more new files
[ https://issues.apache.org/jira/browse/SPARK-5829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322321#comment-14322321 ] Littlestar commented on SPARK-5829: --- when I add new files into /testspark/watchdir, it runs new task with good output. when no new files add, it runs new task with empy rdd every 30 seconds.(I think there is some bugs, when no new files found) JavaStreamingContext.fileStream run task repeated empty when no more new files -- Key: SPARK-5829 URL: https://issues.apache.org/jira/browse/SPARK-5829 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Environment: spark master (1.3.0) with SPARK-5826 patch. Reporter: Littlestar spark master (1.3.0) with SPARK-5826 patch. JavaStreamingContext.fileStream run task repeated empty when no more new files reproduce: 1. mkdir /testspark/watchdir on HDFS. 2. run app. 3. put some text files into /testspark/watchdir. every 30 seconds, spark log indicates that a new sub task runs. and /testspark/resultdir/ has new directory with empty files every 30 seconds. when no new files add, but it runs new task with empy rdd. {noformat} package my.test.hadoop.spark; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.spark.SparkConf; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.streaming.Durations; import org.apache.spark.streaming.api.java.JavaPairDStream; import org.apache.spark.streaming.api.java.JavaStreamingContext; import scala.Tuple2; public class TestStream { @SuppressWarnings({ serial, resource }) public static void main(String[] args) throws Exception { SparkConf conf = new SparkConf().setAppName(TestStream); JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(30)); jssc.checkpoint(/testspark/checkpointdir); Configuration jobConf = new Configuration(); jobConf.set(my.test.fields,fields); JavaPairDStreamInteger, Integer is = jssc.fileStream(/testspark/watchdir, LongWritable.class, Text.class, TextInputFormat.class, new FunctionPath, Boolean() { @Override public Boolean call(Path v1) throws Exception { return true; } }, true, jobConf).mapToPair(new PairFunctionTuple2LongWritable, Text, Integer, Integer() { @Override public Tuple2Integer, Integer call(Tuple2LongWritable, Text arg0) throws Exception { return new Tuple2Integer, Integer(1, 1); } }); JavaPairDStreamInteger, Integer rs = is.reduceByKey(new Function2Integer, Integer, Integer() { @Override public Integer call(Integer arg0, Integer arg1) throws Exception { return arg0 + arg1; } }); rs.checkpoint(Durations.seconds(60)); rs.saveAsNewAPIHadoopFiles(/testspark/resultdir/output, suffix, Integer.class, Integer.class, TextOutputFormat.class); jssc.start(); jssc.awaitTermination(); } } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5829) JavaStreamingContext.fileStream run task loop repeated empty when no more new files found
[ https://issues.apache.org/jira/browse/SPARK-5829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Littlestar updated SPARK-5829: -- Summary: JavaStreamingContext.fileStream run task loop repeated empty when no more new files found (was: JavaStreamingContext.fileStream run task repeated empty when no more new files) JavaStreamingContext.fileStream run task loop repeated empty when no more new files found -- Key: SPARK-5829 URL: https://issues.apache.org/jira/browse/SPARK-5829 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Environment: spark master (1.3.0) with SPARK-5826 patch. Reporter: Littlestar spark master (1.3.0) with SPARK-5826 patch. JavaStreamingContext.fileStream run task repeated empty when no more new files reproduce: 1. mkdir /testspark/watchdir on HDFS. 2. run app. 3. put some text files into /testspark/watchdir. every 30 seconds, spark log indicates that a new sub task runs. and /testspark/resultdir/ has new directory with empty files every 30 seconds. when no new files add, but it runs new task with empy rdd. {noformat} package my.test.hadoop.spark; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.spark.SparkConf; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.streaming.Durations; import org.apache.spark.streaming.api.java.JavaPairDStream; import org.apache.spark.streaming.api.java.JavaStreamingContext; import scala.Tuple2; public class TestStream { @SuppressWarnings({ serial, resource }) public static void main(String[] args) throws Exception { SparkConf conf = new SparkConf().setAppName(TestStream); JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(30)); jssc.checkpoint(/testspark/checkpointdir); Configuration jobConf = new Configuration(); jobConf.set(my.test.fields,fields); JavaPairDStreamInteger, Integer is = jssc.fileStream(/testspark/watchdir, LongWritable.class, Text.class, TextInputFormat.class, new FunctionPath, Boolean() { @Override public Boolean call(Path v1) throws Exception { return true; } }, true, jobConf).mapToPair(new PairFunctionTuple2LongWritable, Text, Integer, Integer() { @Override public Tuple2Integer, Integer call(Tuple2LongWritable, Text arg0) throws Exception { return new Tuple2Integer, Integer(1, 1); } }); JavaPairDStreamInteger, Integer rs = is.reduceByKey(new Function2Integer, Integer, Integer() { @Override public Integer call(Integer arg0, Integer arg1) throws Exception { return arg0 + arg1; } }); rs.checkpoint(Durations.seconds(60)); rs.saveAsNewAPIHadoopFiles(/testspark/resultdir/output, suffix, Integer.class, Integer.class, TextOutputFormat.class); jssc.start(); jssc.awaitTermination(); } } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5829) JavaStreamingContext.fileStream run task loop repeated empty when no more new files found
[ https://issues.apache.org/jira/browse/SPARK-5829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Littlestar updated SPARK-5829: -- Priority: Minor (was: Major) JavaStreamingContext.fileStream run task loop repeated empty when no more new files found -- Key: SPARK-5829 URL: https://issues.apache.org/jira/browse/SPARK-5829 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Environment: spark master (1.3.0) with SPARK-5826 patch. Reporter: Littlestar Priority: Minor spark master (1.3.0) with SPARK-5826 patch. JavaStreamingContext.fileStream run task repeated empty when no more new files reproduce: 1. mkdir /testspark/watchdir on HDFS. 2. run app. 3. put some text files into /testspark/watchdir. every 30 seconds, spark log indicates that a new sub task runs. and /testspark/resultdir/ has new directory with empty files every 30 seconds. when no new files add, but it runs new task with empy rdd. {noformat} package my.test.hadoop.spark; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.spark.SparkConf; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.streaming.Durations; import org.apache.spark.streaming.api.java.JavaPairDStream; import org.apache.spark.streaming.api.java.JavaStreamingContext; import scala.Tuple2; public class TestStream { @SuppressWarnings({ serial, resource }) public static void main(String[] args) throws Exception { SparkConf conf = new SparkConf().setAppName(TestStream); JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(30)); jssc.checkpoint(/testspark/checkpointdir); Configuration jobConf = new Configuration(); jobConf.set(my.test.fields,fields); JavaPairDStreamInteger, Integer is = jssc.fileStream(/testspark/watchdir, LongWritable.class, Text.class, TextInputFormat.class, new FunctionPath, Boolean() { @Override public Boolean call(Path v1) throws Exception { return true; } }, true, jobConf).mapToPair(new PairFunctionTuple2LongWritable, Text, Integer, Integer() { @Override public Tuple2Integer, Integer call(Tuple2LongWritable, Text arg0) throws Exception { return new Tuple2Integer, Integer(1, 1); } }); JavaPairDStreamInteger, Integer rs = is.reduceByKey(new Function2Integer, Integer, Integer() { @Override public Integer call(Integer arg0, Integer arg1) throws Exception { return arg0 + arg1; } }); rs.checkpoint(Durations.seconds(60)); rs.saveAsNewAPIHadoopFiles(/testspark/resultdir/output, suffix, Integer.class, Integer.class, TextOutputFormat.class); jssc.start(); jssc.awaitTermination(); } } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5796) Do not transform data on a last estimator in Pipeline
[ https://issues.apache.org/jira/browse/SPARK-5796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5796. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4590 [https://github.com/apache/spark/pull/4590] Do not transform data on a last estimator in Pipeline - Key: SPARK-5796 URL: https://issues.apache.org/jira/browse/SPARK-5796 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.3.0 Reporter: Peter Rudenko Priority: Minor Fix For: 1.3.0 If it's a last estimator in Pipeline there's no need to transform data, since there's no next stage that would consume this data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5815) Deprecate SVDPlusPlus APIs that expose DoubleMatrix from JBLAS
[ https://issues.apache.org/jira/browse/SPARK-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5815. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4614 [https://github.com/apache/spark/pull/4614] Deprecate SVDPlusPlus APIs that expose DoubleMatrix from JBLAS -- Key: SPARK-5815 URL: https://issues.apache.org/jira/browse/SPARK-5815 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Sean Owen Fix For: 1.3.0 It is generally bad to expose types defined in a 3rd-party package in Spark public APIs. We should deprecate those methods in SVDPlusPlus and replace them in the next release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5769) Set params in constructor and setParams() in Python ML pipeline API
[ https://issues.apache.org/jira/browse/SPARK-5769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5769. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4564 [https://github.com/apache/spark/pull/4564] Set params in constructor and setParams() in Python ML pipeline API --- Key: SPARK-5769 URL: https://issues.apache.org/jira/browse/SPARK-5769 Project: Spark Issue Type: New Feature Components: ML, PySpark Affects Versions: 1.3.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.3.0 As discussed in the design doc of SPARK-4586, we want to make Python users happy (no setters/getters) while keeping a low maintenance cost by forcing keyword arguments in the constructor and in setParams. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3866) Clean up python/run-tests problems
[ https://issues.apache.org/jira/browse/SPARK-3866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-3866. --- Resolution: Fixed It looks like all of this issue's subtasks have been resolved, so I'm going to mark this as fixed. Clean up python/run-tests problems -- Key: SPARK-3866 URL: https://issues.apache.org/jira/browse/SPARK-3866 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Environment: Mac OS X 10.9.5, Python 2.7.8, Java 1.8.0_20 Reporter: Tomohiko K. Labels: pyspark, testing Attachments: unit-tests.log This issue is a overhaul issue to remove problems encountered when I run ./python/run-tests at commit a85f24accd3266e0f97ee04d03c22b593d99c062. It will have sub-tasks for some kinds of issues. A test output is contained in the attached file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4705) Driver retries in yarn-cluster mode always fail if event logging is enabled
[ https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322344#comment-14322344 ] Twinkle Sachdeva commented on SPARK-4705: - Hi, +1. I will upload the screenshot with these changes. Thanks, Twinkle Driver retries in yarn-cluster mode always fail if event logging is enabled --- Key: SPARK-4705 URL: https://issues.apache.org/jira/browse/SPARK-4705 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Marcelo Vanzin Attachments: Screen Shot 2015-02-10 at 6.27.49 pm.png yarn-cluster mode will retry to run the driver in certain failure modes. If even logging is enabled, this will most probably fail, because: {noformat} Exception in thread Driver java.io.IOException: Log directory hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003 already exists! at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129) at org.apache.spark.util.FileLogger.start(FileLogger.scala:115) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74) at org.apache.spark.SparkContext.init(SparkContext.scala:353) {noformat} The even log path should be more unique. Or perhaps retries of the same app should clean up the old logs first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5829) JavaStreamingContext.fileStream run task loop repeated empty when no more new files found
[ https://issues.apache.org/jira/browse/SPARK-5829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322408#comment-14322408 ] Littlestar edited comment on SPARK-5829 at 2/16/15 6:05 AM: FileInputDStream.scala {noformat} override def compute(validTime: Time): Option[RDD[(K, V)]] = { // Find new files val newFiles = findNewFiles(validTime.milliseconds) logInfo(New files at time + validTime + :\n + newFiles.mkString(\n)) batchTimeToSelectedFiles += ((validTime, newFiles)) recentlySelectedFiles ++= newFiles +may there is check {newFiles.size 0} can avoid this problem??+ Some(filesToRDD(newFiles)) } {noformat} Thanks. was (Author: cnstar9988): FileInputDStream.scala override def compute(validTime: Time): Option[RDD[(K, V)]] = { // Find new files val newFiles = findNewFiles(validTime.milliseconds) logInfo(New files at time + validTime + :\n + newFiles.mkString(\n)) batchTimeToSelectedFiles += ((validTime, newFiles)) recentlySelectedFiles ++= newFiles +may there is check {newFiles.size 0} can avoid this problem??+ Some(filesToRDD(newFiles)) } Thanks. JavaStreamingContext.fileStream run task loop repeated empty when no more new files found -- Key: SPARK-5829 URL: https://issues.apache.org/jira/browse/SPARK-5829 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Environment: spark master (1.3.0) with SPARK-5826 patch. Reporter: Littlestar Priority: Minor spark master (1.3.0) with SPARK-5826 patch. JavaStreamingContext.fileStream run task repeated empty when no more new files reproduce: 1. mkdir /testspark/watchdir on HDFS. 2. run app. 3. put some text files into /testspark/watchdir. every 30 seconds, spark log indicates that a new sub task runs. and /testspark/resultdir/ has new directory with empty files every 30 seconds. when no new files add, but it runs new task with empy rdd. {noformat} package my.test.hadoop.spark; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.spark.SparkConf; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFunction; import org.apache.spark.streaming.Durations; import org.apache.spark.streaming.api.java.JavaPairDStream; import org.apache.spark.streaming.api.java.JavaStreamingContext; import scala.Tuple2; public class TestStream { @SuppressWarnings({ serial, resource }) public static void main(String[] args) throws Exception { SparkConf conf = new SparkConf().setAppName(TestStream); JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(30)); jssc.checkpoint(/testspark/checkpointdir); Configuration jobConf = new Configuration(); jobConf.set(my.test.fields,fields); JavaPairDStreamInteger, Integer is = jssc.fileStream(/testspark/watchdir, LongWritable.class, Text.class, TextInputFormat.class, new FunctionPath, Boolean() { @Override public Boolean call(Path v1) throws Exception { return true; } }, true, jobConf).mapToPair(new PairFunctionTuple2LongWritable, Text, Integer, Integer() { @Override public Tuple2Integer, Integer call(Tuple2LongWritable, Text arg0) throws Exception { return new Tuple2Integer, Integer(1, 1); } }); JavaPairDStreamInteger, Integer rs = is.reduceByKey(new Function2Integer, Integer, Integer() { @Override public Integer call(Integer arg0, Integer arg1) throws Exception { return arg0 + arg1; } }); rs.checkpoint(Durations.seconds(60)); rs.saveAsNewAPIHadoopFiles(/testspark/resultdir/output, suffix, Integer.class, Integer.class, TextOutputFormat.class); jssc.start(); jssc.awaitTermination(); } } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5830) Don't create unnecessary directory for local root dir
Weizhong created SPARK-5830: --- Summary: Don't create unnecessary directory for local root dir Key: SPARK-5830 URL: https://issues.apache.org/jira/browse/SPARK-5830 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Weizhong Priority: Minor Now will create an unnecessary directory for local root directory, and this directory will not be deleted after application exit. For example: before will create tmp dir like /tmp/spark-UUID now will create tmp dir like /tmp/spark-UUID/spark-UUID so the dir /tmp/spark-UUID will not be deleted as a local root directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5830) Don't create unnecessary directory for local root dir
[ https://issues.apache.org/jira/browse/SPARK-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322428#comment-14322428 ] Apache Spark commented on SPARK-5830: - User 'Sephiroth-Lin' has created a pull request for this issue: https://github.com/apache/spark/pull/4620 Don't create unnecessary directory for local root dir - Key: SPARK-5830 URL: https://issues.apache.org/jira/browse/SPARK-5830 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Weizhong Priority: Minor Now will create an unnecessary directory for local root directory, and this directory will not be deleted after application exit. For example: before will create tmp dir like /tmp/spark-UUID now will create tmp dir like /tmp/spark-UUID/spark-UUID so the dir /tmp/spark-UUID will not be deleted as a local root directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5831) When checkpoint file size is bigger than 10, then delete them
[ https://issues.apache.org/jira/browse/SPARK-5831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322433#comment-14322433 ] Apache Spark commented on SPARK-5831: - User 'XuTingjun' has created a pull request for this issue: https://github.com/apache/spark/pull/4621 When checkpoint file size is bigger than 10, then delete them - Key: SPARK-5831 URL: https://issues.apache.org/jira/browse/SPARK-5831 Project: Spark Issue Type: Improvement Components: Streaming Reporter: meiyoula Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322332#comment-14322332 ] Manoj Kumar commented on SPARK-5016: [~mengxr] Can you please clarify a few things. 1. How to key the BreezeData in order to effect parallelization across k gaussians. (considering the fact that it is a soft assignment)? 2. Even if we are able to do so, there are a few lines of code corresponding to the log-likelihood computation as pointed by [~tgaloppo] , which are interdependent, How can that be done? GaussianMixtureEM should distribute matrix inverse for large numFeatures, k --- Key: SPARK-5016 URL: https://issues.apache.org/jira/browse/SPARK-5016 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley If numFeatures or k are large, GMM EM should distribute the matrix inverse computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org