[jira] [Created] (SPARK-1945) Add full Java examples in MLlib docs
Matei Zaharia created SPARK-1945: Summary: Add full Java examples in MLlib docs Key: SPARK-1945 URL: https://issues.apache.org/jira/browse/SPARK-1945 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Matei Zaharia Right now some of the Java tabs only say the following: All of MLlib’s methods use Java-friendly types, so you can import and call them there the same way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the Spark Java API uses a separate JavaRDD class. You can convert a Java RDD to a Scala one by calling .rdd() on your JavaRDD object. Would be nice to translate the Scala code into Java instead. Also, a few pages (most notably the Matrix one) don't have Java examples at all. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1153) Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs.
[ https://issues.apache.org/jira/browse/SPARK-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010846#comment-14010846 ] npanj edited comment on SPARK-1153 at 5/28/14 6:48 AM: --- An alternative approach, that I have been using: 1 Use a preprocessing step that maps UUID to an Long. 2. Build graph based on Longs For Mapping in step 1: - Rank your uuids. - some kind of has function? For 1, graphx can provide a tool to generate map. I will like to hear how others are building graphs out of non-Long node types. was (Author: npanj): An alternative approach, that I have been using: 1 Use a preprocessing step that maps UUID to an Long. 2. Build graph based on Longs For Mapping in step 1: - Rank your uuids. - some kind of has function? For 1, graphx can provide a tool to generate map. I will like to hear how others are building graphs out of non-Long node types Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs. -- Key: SPARK-1153 URL: https://issues.apache.org/jira/browse/SPARK-1153 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 0.9.0 Reporter: Deepak Nulu Currently, {{VertexId}} is a type-synonym for {{Long}}. I would like to be able to use {{UUID}} as the vertex ID type because the data I want to process with GraphX uses that type for its primay-keys. Others might have a different type for their primary-keys. Generalizing {{VertexId}} (with a type class) will help in such cases. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1946) Submit stage after executors have been registered
Zhihui created SPARK-1946: - Summary: Submit stage after executors have been registered Key: SPARK-1946 URL: https://issues.apache.org/jira/browse/SPARK-1946 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Zhihui Because creating TaskSetManager and registering executors are asynchronous, in most situation, early stages' tasks run without preferred locality. A simple solution is sleeping few seconds in application, so that executors have enough time to register. A better way is to make DAGScheduler submit stage after a few of executors have been registered by configuration properties. # submit stage only after successfully registered executors arrived the ratio, default value 0 spark.executor.registeredRatio = 0.8 # whatever registeredRatio is arrived, submit stage after the maxRegisteredWaitingTime(millisecond), default value 1 spark.executor.maxRegisteredWaitingTime = 5000 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1946) Submit stage after executors have been registered
[ https://issues.apache.org/jira/browse/SPARK-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihui updated SPARK-1946: -- Attachment: Spark Task Scheduler Optimization Proposal.pptx Submit stage after executors have been registered - Key: SPARK-1946 URL: https://issues.apache.org/jira/browse/SPARK-1946 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Zhihui Attachments: Spark Task Scheduler Optimization Proposal.pptx Because creating TaskSetManager and registering executors are asynchronous, in most situation, early stages' tasks run without preferred locality. A simple solution is sleeping few seconds in application, so that executors have enough time to register. A better way is to make DAGScheduler submit stage after a few of executors have been registered by configuration properties. # submit stage only after successfully registered executors arrived the ratio, default value 0 spark.executor.registeredRatio = 0.8 # whatever registeredRatio is arrived, submit stage after the maxRegisteredWaitingTime(millisecond), default value 1 spark.executor.maxRegisteredWaitingTime = 5000 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1946) Submit stage after executors have been registered
[ https://issues.apache.org/jira/browse/SPARK-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihui updated SPARK-1946: -- Description: Because creating TaskSetManager and registering executors are asynchronous, in most situation, early stages' tasks run without preferred locality. A simple solution is sleeping few seconds in application, so that executors have enough time to register. A better way is to make DAGScheduler submit stage after a few of executors have been registered by configuration properties. \# submit stage only after successfully registered executors arrived the ratio, default value 0 spark.executor.registeredRatio = 0.8 \# whatever registeredRatio is arrived, submit stage after the maxRegisteredWaitingTime(millisecond), default value 1 spark.executor.maxRegisteredWaitingTime = 5000 was: Because creating TaskSetManager and registering executors are asynchronous, in most situation, early stages' tasks run without preferred locality. A simple solution is sleeping few seconds in application, so that executors have enough time to register. A better way is to make DAGScheduler submit stage after a few of executors have been registered by configuration properties. # submit stage only after successfully registered executors arrived the ratio, default value 0 spark.executor.registeredRatio = 0.8 # whatever registeredRatio is arrived, submit stage after the maxRegisteredWaitingTime(millisecond), default value 1 spark.executor.maxRegisteredWaitingTime = 5000 Submit stage after executors have been registered - Key: SPARK-1946 URL: https://issues.apache.org/jira/browse/SPARK-1946 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Zhihui Attachments: Spark Task Scheduler Optimization Proposal.pptx Because creating TaskSetManager and registering executors are asynchronous, in most situation, early stages' tasks run without preferred locality. A simple solution is sleeping few seconds in application, so that executors have enough time to register. A better way is to make DAGScheduler submit stage after a few of executors have been registered by configuration properties. \# submit stage only after successfully registered executors arrived the ratio, default value 0 spark.executor.registeredRatio = 0.8 \# whatever registeredRatio is arrived, submit stage after the maxRegisteredWaitingTime(millisecond), default value 1 spark.executor.maxRegisteredWaitingTime = 5000 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1495) support leftsemijoin for sparkSQL
[ https://issues.apache.org/jira/browse/SPARK-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010932#comment-14010932 ] Adrian Wang commented on SPARK-1495: Another PR [https://github.com/apache/spark/pull/837] submitted. support leftsemijoin for sparkSQL - Key: SPARK-1495 URL: https://issues.apache.org/jira/browse/SPARK-1495 Project: Spark Issue Type: Improvement Components: SQL Reporter: Adrian Wang Fix For: 1.1.0 I created Github PR #395 for this issue.[https://github.com/apache/spark/pull/395] As marmbrus comments there, one design question is which of the following is better: 1. multiple operators that handle different kinds of joins, letting the planner pick the correct one 2. putting the switching logic inside of the operator as is done here -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010937#comment-14010937 ] Sean Owen commented on SPARK-1518: -- Re: versioning one more time, really supporting a bunch of versions may get costly. It's already tricky to manage two builds times YARN-or-not, Hive-or-not, times 4 flavors of Hadoop. I doubt the assemblies are yet problem-free in all cases. In practice it look like one generic Hadoop 1, Hadoop 2, and CDH 4 release is produced, and 1 set of Maven artifact. (PS again I am not sure Spark should contain a CDH-specific distribution? realizing it's really a proxy for a particular Hadoop combo. Same goes for a MapR profile, which is really for vendors to maintain) That means right now you can't build a Spark app for anything but Hadoop 1.x with Maven, without installing it yourself, and there's not an official distro for anything but two major Hadoop versions. Support for niche versions isn't really there or promised anyway, and fleshing out support may make doing so pretty burdensome. There is no suggested action here; if anything I suggest that the right thing is to add Maven artifacts with classifiers, add a few binary artifacts, subtract a few vendor artifacts, but this is a different action. Spark master doesn't compile against hadoop-common trunk Key: SPARK-1518 URL: https://issues.apache.org/jira/browse/SPARK-1518 Project: Spark Issue Type: Bug Reporter: Marcelo Vanzin Assignee: Colin Patrick McCabe Priority: Critical FSDataOutputStream::sync() has disappeared from trunk in Hadoop; FileLogger.scala is calling it. I've changed it locally to hsync() so I can compile the code, but haven't checked yet whether those are equivalent. hsync() seems to have been there forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1947) Child of SumDistinct or Average should be widened to prevent overflows the same as Sum.
[ https://issues.apache.org/jira/browse/SPARK-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010945#comment-14010945 ] Takuya Ueshin commented on SPARK-1947: -- PRed: https://github.com/apache/spark/pull/902 Child of SumDistinct or Average should be widened to prevent overflows the same as Sum. --- Key: SPARK-1947 URL: https://issues.apache.org/jira/browse/SPARK-1947 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takuya Ueshin Child of {{SumDistinct}} or {{Average}} should be widened to prevent overflows the same as {{Sum}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1948) Scalac crashes when building Spark in IntelliJ IDEA
Cheng Lian created SPARK-1948: - Summary: Scalac crashes when building Spark in IntelliJ IDEA Key: SPARK-1948 URL: https://issues.apache.org/jira/browse/SPARK-1948 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0 Reporter: Cheng Lian Priority: Minor After [commit 0be8b45|https://github.com/apache/spark/commit/0be8b45], the master branch fails to compile within IntelliJ IDEA and causes {{scalac}} to crash. But building Spark with SBT is OK. This issue is not blocking, but it's annoying since it prevents developers from debugging Spark within IDEA. I can't figure out the exact reason, only nailed down to this commit with binary searching. Maybe I should fire a bug issue to IDEA instead? How to reproduce: # Checkout [commit 0be8b45|https://github.com/apache/spark/commit/0be8b45] # Run {{sbt/sbt clean gen-idea}} under Spark source directory # Open the project in IntelliJ IDEA # Build the project The {{scalac}} crash report is attached. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1948) Scalac crashes when building Spark in IntelliJ IDEA
[ https://issues.apache.org/jira/browse/SPARK-1948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-1948: -- Attachment: scalac-crash.log Scalac crashes when building Spark in IntelliJ IDEA --- Key: SPARK-1948 URL: https://issues.apache.org/jira/browse/SPARK-1948 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0 Reporter: Cheng Lian Priority: Minor Attachments: scalac-crash.log After [commit 0be8b45|https://github.com/apache/spark/commit/0be8b45], the master branch fails to compile within IntelliJ IDEA and causes {{scalac}} to crash. But building Spark with SBT is OK. This issue is not blocking, but it's annoying since it prevents developers from debugging Spark within IDEA. I can't figure out the exact reason, only nailed down to this commit with binary searching. Maybe I should fire a bug issue to IDEA instead? How to reproduce: # Checkout [commit 0be8b45|https://github.com/apache/spark/commit/0be8b45] # Run {{sbt/sbt clean gen-idea}} under Spark source directory # Open the project in IntelliJ IDEA # Build the project The {{scalac}} crash report is attached. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1951) spark on yarn can't start
[ https://issues.apache.org/jira/browse/SPARK-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-1951: --- Description: {{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit --archives /input/lbs/recommend/toona/spark/conf toona-assembly.jar 20140521}} throw an exception: {code} Exception in thread main java.io.FileNotFoundException: File file:/input/lbs/recommend/toona/spark/conf does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:402) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289) at org.apache.spark.deploy.yarn.ClientBase$class.org$apache$spark$deploy$yarn$ClientBase$$copyRemoteFile(ClientBase.scala:162) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:237) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:232) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:232) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:230) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:230) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:39) at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74) at org.apache.spark.deploy.yarn.Client.run(Client.scala:96) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:186) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} {{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit --archives hdfs://10dian72:8020/input/lbs/recommend/toona/spark/conf toona-assembly.jar 20140521}} work. was: {{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit --archives /input/lbs/recommend/toona/spark/conf toona-assembly.jar 20140521}}throw an exception: {code} Exception in thread main java.io.FileNotFoundException: File file:/input/lbs/recommend/toona/spark/conf does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:402) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289) at org.apache.spark.deploy.yarn.ClientBase$class.org$apache$spark$deploy$yarn$ClientBase$$copyRemoteFile(ClientBase.scala:162) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:237) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:232) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:232) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:230) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:230) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:39) at
[jira] [Created] (SPARK-1950) spark on yarn can't start
Guoqiang Li created SPARK-1950: -- Summary: spark on yarn can't start Key: SPARK-1950 URL: https://issues.apache.org/jira/browse/SPARK-1950 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Guoqiang Li Priority: Blocker {{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit --archives /input/lbs/recommend/toona/spark/conf toona-assembly.jar 20140521}}throw an exception: {code} Exception in thread main java.io.FileNotFoundException: File file:/input/lbs/recommend/toona/spark/conf does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:402) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289) at org.apache.spark.deploy.yarn.ClientBase$class.org$apache$spark$deploy$yarn$ClientBase$$copyRemoteFile(ClientBase.scala:162) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:237) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:232) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:232) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:230) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:230) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:39) at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74) at org.apache.spark.deploy.yarn.Client.run(Client.scala:96) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:186) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} {{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit --archives hdfs://10dian72:8020/input/lbs/recommend/toona/spark/conf toona-assembly.jar 20140521}} work. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1950) spark on yarn can't start
[ https://issues.apache.org/jira/browse/SPARK-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011330#comment-14011330 ] Sean Owen commented on SPARK-1950: -- (Looks like you opened this twice? https://issues.apache.org/jira/browse/SPARK-1951 ) spark on yarn can't start -- Key: SPARK-1950 URL: https://issues.apache.org/jira/browse/SPARK-1950 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Guoqiang Li Priority: Blocker {{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit --archives /input/lbs/recommend/toona/spark/conf toona-assembly.jar 20140521}}throw an exception: {code} Exception in thread main java.io.FileNotFoundException: File file:/input/lbs/recommend/toona/spark/conf does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:402) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289) at org.apache.spark.deploy.yarn.ClientBase$class.org$apache$spark$deploy$yarn$ClientBase$$copyRemoteFile(ClientBase.scala:162) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:237) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:232) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:232) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:230) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:230) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:39) at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74) at org.apache.spark.deploy.yarn.Client.run(Client.scala:96) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:186) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} {{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit --archives hdfs://10dian72:8020/input/lbs/recommend/toona/spark/conf toona-assembly.jar 20140521}} work. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011350#comment-14011350 ] Colin Patrick McCabe commented on SPARK-1518: - bq. Re: versioning one more time, really supporting a bunch of versions may get costly. It's already tricky to manage two builds times YARN-or-not, Hive-or-not, times 4 flavors of Hadoop. I doubt the assemblies are yet problem-free in all cases. I think in this particular case, we can use reflection to support both Hadoop 1.X and newer stuff. bq. I am not sure Spark should contain a CDH-specific distribution? realizing it's really a proxy for a particular Hadoop combo. Same goes for a MapR profile, which is really for vendors to maintain) I agree 100%. We should keep vendor stuff out of the Apache release. Vendors can create their own build setups (that's what they get paid to do, after all.) bq. There is no suggested action here; if anything I suggest that the right thing is to add Maven artifacts with classifiers, add a few binary artifacts, subtract a few vendor artifacts, but this is a different action. If you have some ideas for how to improve the Maven build, it could be worth creating a JIRA. I think you're right that we need to make it more flexible so that people can build against more versions without editing the pom. It might be helpful to look at how HBase handles this in its {{pom.xml}} files. Spark master doesn't compile against hadoop-common trunk Key: SPARK-1518 URL: https://issues.apache.org/jira/browse/SPARK-1518 Project: Spark Issue Type: Bug Reporter: Marcelo Vanzin Assignee: Colin Patrick McCabe Priority: Critical FSDataOutputStream::sync() has disappeared from trunk in Hadoop; FileLogger.scala is calling it. I've changed it locally to hsync() so I can compile the code, but haven't checked yet whether those are equivalent. hsync() seems to have been there forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1916) SparkFlumeEvent with body bigger than 1020 bytes are not read properly
[ https://issues.apache.org/jira/browse/SPARK-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1916: --- Assignee: David Lemieux SparkFlumeEvent with body bigger than 1020 bytes are not read properly -- Key: SPARK-1916 URL: https://issues.apache.org/jira/browse/SPARK-1916 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 0.9.0 Reporter: David Lemieux Assignee: David Lemieux Attachments: SPARK-1916.diff The readExternal implementation on SparkFlumeEvent will read only the first 1020 bytes of the actual body when streaming data from flume. This means that any event sent to Spark via Flume will be processed properly if the body is small, but will fail if the body is bigger than 1020. Considering that the default max size for a Flume Avro Event is 32K, the implementation should be updated to read more. The following is related : http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-using-Flume-body-size-limitation-tt6127.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1836) REPL $outer type mismatch causes lookup() and equals() problems
[ https://issues.apache.org/jira/browse/SPARK-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011477#comment-14011477 ] Michael Armbrust commented on SPARK-1836: - Yeah I think its likely they are related. We can re-open this one later if fixing the other one doesn't solve your issue. REPL $outer type mismatch causes lookup() and equals() problems --- Key: SPARK-1836 URL: https://issues.apache.org/jira/browse/SPARK-1836 Project: Spark Issue Type: Bug Affects Versions: 0.9.0 Reporter: Michael Malak Anand Avati partially traced the cause to REPL wrapping classes in $outer classes. There are at least two major symptoms: 1. equals() = In REPL equals() (required in custom classes used as a key for groupByKey) seems to have to be written using instanceOf[] instead of the canonical match{} Spark Shell (equals uses match{}): {noformat} class C(val s:String) extends Serializable { override def equals(o: Any) = o match { case that: C = that.s == s case _ = false } } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new ava.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = false {noformat} Spark Shell (equals uses isInstanceOf[]): {noformat} class C(val s:String) extends Serializable { override def equals(o: Any) = if (o.isInstanceOf[C]) (o.asInstanceOf[C].s = s) else false } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new ava.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = true {noformat} Scala Shell (equals uses match{}): {noformat} class C(val s:String) extends Serializable { override def equals(o: Any) = o match { case that: C = that.s == s case _ = false } } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = true {noformat} 2. lookup() = {noformat} class C(val s:String) extends Serializable { override def equals(o: Any) = if (o.isInstanceOf[C]) o.asInstanceOf[C].s == s else false override def hashCode = s.hashCode override def toString = s } val r = sc.parallelize(Array((new C(a),11),(new C(a),12))) r.lookup(new C(a)) console:17: error: type mismatch; found : C required: C r.lookup(new C(a)) ^ {noformat} See http://mail-archives.apache.org/mod_mbox/spark-dev/201405.mbox/%3C1400019424.80629.YahooMailNeo%40web160801.mail.bf1.yahoo.com%3E -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1199) Type mismatch in Spark shell when using case class defined in shell
[ https://issues.apache.org/jira/browse/SPARK-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011492#comment-14011492 ] Michael Malak commented on SPARK-1199: -- See also additional test cases in https://issues.apache.org/jira/browse/SPARK-1836 which has now been marked as a duplicate. Type mismatch in Spark shell when using case class defined in shell --- Key: SPARK-1199 URL: https://issues.apache.org/jira/browse/SPARK-1199 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Andrew Kerr Priority: Critical Fix For: 1.1.0 Define a class in the shell: {code} case class TestClass(a:String) {code} and an RDD {code} val data = sc.parallelize(Seq(a)).map(TestClass(_)) {code} define a function on it and map over the RDD {code} def itemFunc(a:TestClass):TestClass = a data.map(itemFunc) {code} Error: {code} console:19: error: type mismatch; found : TestClass = TestClass required: TestClass = ? data.map(itemFunc) {code} Similarly with a mapPartitions: {code} def partitionFunc(a:Iterator[TestClass]):Iterator[TestClass] = a data.mapPartitions(partitionFunc) {code} {code} console:19: error: type mismatch; found : Iterator[TestClass] = Iterator[TestClass] required: Iterator[TestClass] = Iterator[?] Error occurred in an application involving default arguments. data.mapPartitions(partitionFunc) {code} The behavior is the same whether in local mode or on a cluster. This isn't specific to RDDs. A Scala collection in the Spark shell has the same problem. {code} scala Seq(TestClass(foo)).map(itemFunc) console:15: error: type mismatch; found : TestClass = TestClass required: TestClass = ? Seq(TestClass(foo)).map(itemFunc) ^ {code} When run in the Scala console (not the Spark shell) there are no type mismatch errors. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1836) REPL $outer type mismatch causes lookup() and equals() problems
[ https://issues.apache.org/jira/browse/SPARK-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Malak resolved SPARK-1836. -- Resolution: Duplicate REPL $outer type mismatch causes lookup() and equals() problems --- Key: SPARK-1836 URL: https://issues.apache.org/jira/browse/SPARK-1836 Project: Spark Issue Type: Bug Affects Versions: 0.9.0 Reporter: Michael Malak Anand Avati partially traced the cause to REPL wrapping classes in $outer classes. There are at least two major symptoms: 1. equals() = In REPL equals() (required in custom classes used as a key for groupByKey) seems to have to be written using instanceOf[] instead of the canonical match{} Spark Shell (equals uses match{}): {noformat} class C(val s:String) extends Serializable { override def equals(o: Any) = o match { case that: C = that.s == s case _ = false } } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new ava.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = false {noformat} Spark Shell (equals uses isInstanceOf[]): {noformat} class C(val s:String) extends Serializable { override def equals(o: Any) = if (o.isInstanceOf[C]) (o.asInstanceOf[C].s = s) else false } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new ava.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = true {noformat} Scala Shell (equals uses match{}): {noformat} class C(val s:String) extends Serializable { override def equals(o: Any) = o match { case that: C = that.s == s case _ = false } } val x = new C(a) val bos = new java.io.ByteArrayOutputStream() val out = new java.io.ObjectOutputStream(bos) out.writeObject(x); val b = bos.toByteArray(); out.close bos.close val y = new java.io.ObjectInputStream(new java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C] x.equals(y) res: Boolean = true {noformat} 2. lookup() = {noformat} class C(val s:String) extends Serializable { override def equals(o: Any) = if (o.isInstanceOf[C]) o.asInstanceOf[C].s == s else false override def hashCode = s.hashCode override def toString = s } val r = sc.parallelize(Array((new C(a),11),(new C(a),12))) r.lookup(new C(a)) console:17: error: type mismatch; found : C required: C r.lookup(new C(a)) ^ {noformat} See http://mail-archives.apache.org/mod_mbox/spark-dev/201405.mbox/%3C1400019424.80629.YahooMailNeo%40web160801.mail.bf1.yahoo.com%3E -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1936) Add apache header and remove author tags
[ https://issues.apache.org/jira/browse/SPARK-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1936. -- Resolution: Won't Fix We should not change these files' license headers because they're files we've modified from the Scala interpreter. We mention that we use modified versions of these in our LICENSE, but we can't misrepresent the original copyright. Add apache header and remove author tags Key: SPARK-1936 URL: https://issues.apache.org/jira/browse/SPARK-1936 Project: Spark Issue Type: Bug Reporter: Devaraj K Priority: Minor These below files don’t have apache header and contain author tags. {code:xml} spark\repl\src\main\scala\org\apache\spark\repl\SparkExprTyper.scala spark\repl\src\main\scala\org\apache\spark\repl\SparkILoop.scala spark\repl\src\main\scala\org\apache\spark\repl\SparkILoopInit.scala spark\repl\src\main\scala\org\apache\spark\repl\SparkIMain.scala spark\repl\src\main\scala\org\apache\spark\repl\SparkImports.scala spark\repl\src\main\scala\org\apache\spark\repl\SparkJLineCompletion.scala spark\repl\src\main\scala\org\apache\spark\repl\SparkJLineReader.scala spark\repl\src\main\scala\org\apache\spark\repl\SparkMemberHandlers.scala {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1790) Update EC2 scripts to support r3 instance types
[ https://issues.apache.org/jira/browse/SPARK-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011548#comment-14011548 ] Matei Zaharia commented on SPARK-1790: -- Thanks Sujeet! Just post here when you have a pull request to fix it. Update EC2 scripts to support r3 instance types --- Key: SPARK-1790 URL: https://issues.apache.org/jira/browse/SPARK-1790 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Matei Zaharia Labels: Starter These were recently added by Amazon as a cheaper high-memory option -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1790) Update EC2 scripts to support r3 instance types
[ https://issues.apache.org/jira/browse/SPARK-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1790: - Labels: Starter (was: starter) Update EC2 scripts to support r3 instance types --- Key: SPARK-1790 URL: https://issues.apache.org/jira/browse/SPARK-1790 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Matei Zaharia Labels: Starter These were recently added by Amazon as a cheaper high-memory option -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1952) slf4j version conflicts with pig
[ https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Compton updated SPARK-1952: Description: Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when they register a jar containing Spark. The error appears to be related to org.slf4j.spi.LocationAwareLogger.log. {code} Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598) at java.lang.Thread.dispatchUncaughtException(Thread.java:1874) {code} To reproduce: compile Spark via ```$ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt assembly``` and register the resulting jar into a pig script. E.g. ``` REGISTER /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar; data0 = LOAD 'data' USING PigStorage(); ttt = LIMIT data0 10; DUMP ttt; ``` The Spark-1.0 jar includes some slf4j dependencies that were not present in 0.9.1 ``` rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep LocationAware 3259 Mon Mar 25 21:49:34 PDT 2013 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class 479 Fri Dec 13 16:44:40 PST 2013 parquet/org/slf4j/spi/LocationAwareLogger.class ``` vs. ``` rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep LocationAware 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class ``` was: Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when they register a jar containing Spark. The error appears to be related to org.slf4j.spi.LocationAwareLogger.log. ``` Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598) at java.lang.Thread.dispatchUncaughtException(Thread.java:1874) ``` To reproduce: compile Spark via ```$ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt assembly``` and register the resulting jar into a pig script. E.g. ``` REGISTER /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar; data0 = LOAD 'data' USING PigStorage(); ttt = LIMIT data0 10; DUMP ttt; ``` The Spark-1.0 jar includes some slf4j dependencies that were not present in 0.9.1 ``` rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep LocationAware 3259 Mon Mar 25 21:49:34 PDT 2013 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class 479 Fri Dec 13 16:44:40 PST 2013 parquet/org/slf4j/spi/LocationAwareLogger.class ``` vs. ``` rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep LocationAware 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class ``` slf4j version conflicts with pig Key: SPARK-1952 URL: https://issues.apache.org/jira/browse/SPARK-1952 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: pig 12.1 on Cloudera Hadoop, CDH3 Reporter: Ryan Compton Labels: pig, slf4j Fix For: 1.0.0 Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when they register a jar containing Spark. The error appears to be related to org.slf4j.spi.LocationAwareLogger.log. {code} Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598) at java.lang.Thread.dispatchUncaughtException(Thread.java:1874) {code} To reproduce: compile Spark via ```$ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt assembly``` and
[jira] [Updated] (SPARK-1952) slf4j version conflicts with pig
[ https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Compton updated SPARK-1952: Description: Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when they register a jar containing Spark. The error appears to be related to org.slf4j.spi.LocationAwareLogger.log. {code} Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598) at java.lang.Thread.dispatchUncaughtException(Thread.java:1874) {code} To reproduce: compile Spark via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt assembly and register the resulting jar into a pig script. E.g. {code} REGISTER /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar; data0 = LOAD 'data' USING PigStorage(); ttt = LIMIT data0 10; DUMP ttt; {code} The Spark-1.0 jar includes some slf4j dependencies that were not present in 0.9.1 {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep LocationAware 3259 Mon Mar 25 21:49:34 PDT 2013 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class 479 Fri Dec 13 16:44:40 PST 2013 parquet/org/slf4j/spi/LocationAwareLogger.class {code} vs. {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep LocationAware 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class {code} was: Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when they register a jar containing Spark. The error appears to be related to org.slf4j.spi.LocationAwareLogger.log. {code} Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598) at java.lang.Thread.dispatchUncaughtException(Thread.java:1874) {code} To reproduce: compile Spark via ```$ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt assembly``` and register the resulting jar into a pig script. E.g. {code} REGISTER /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar; data0 = LOAD 'data' USING PigStorage(); ttt = LIMIT data0 10; DUMP ttt; {code} The Spark-1.0 jar includes some slf4j dependencies that were not present in 0.9.1 {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep LocationAware 3259 Mon Mar 25 21:49:34 PDT 2013 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class 479 Fri Dec 13 16:44:40 PST 2013 parquet/org/slf4j/spi/LocationAwareLogger.class {code} vs. {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep LocationAware 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class {code} slf4j version conflicts with pig Key: SPARK-1952 URL: https://issues.apache.org/jira/browse/SPARK-1952 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: pig 12.1 on Cloudera Hadoop, CDH3 Reporter: Ryan Compton Labels: pig, slf4j Fix For: 1.0.0 Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when they register a jar containing Spark. The error appears to be related to org.slf4j.spi.LocationAwareLogger.log. {code} Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598) at java.lang.Thread.dispatchUncaughtException(Thread.java:1874) {code} To reproduce: compile Spark via $
[jira] [Updated] (SPARK-1952) slf4j version conflicts with pig
[ https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Compton updated SPARK-1952: Description: Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when they register a jar containing Spark. The error appears to be related to org.slf4j.spi.LocationAwareLogger.log. {code} Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598) at java.lang.Thread.dispatchUncaughtException(Thread.java:1874) {code} To reproduce: compile Spark via ```$ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt assembly``` and register the resulting jar into a pig script. E.g. {code} REGISTER /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar; data0 = LOAD 'data' USING PigStorage(); ttt = LIMIT data0 10; DUMP ttt; {code} The Spark-1.0 jar includes some slf4j dependencies that were not present in 0.9.1 {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep LocationAware 3259 Mon Mar 25 21:49:34 PDT 2013 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class 479 Fri Dec 13 16:44:40 PST 2013 parquet/org/slf4j/spi/LocationAwareLogger.class {code} vs. {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep LocationAware 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class {code} was: Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when they register a jar containing Spark. The error appears to be related to org.slf4j.spi.LocationAwareLogger.log. {code} Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598) at java.lang.Thread.dispatchUncaughtException(Thread.java:1874) {code} To reproduce: compile Spark via ```$ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt assembly``` and register the resulting jar into a pig script. E.g. ``` REGISTER /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar; data0 = LOAD 'data' USING PigStorage(); ttt = LIMIT data0 10; DUMP ttt; ``` The Spark-1.0 jar includes some slf4j dependencies that were not present in 0.9.1 ``` rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep LocationAware 3259 Mon Mar 25 21:49:34 PDT 2013 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class 479 Fri Dec 13 16:44:40 PST 2013 parquet/org/slf4j/spi/LocationAwareLogger.class ``` vs. ``` rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep LocationAware 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class ``` slf4j version conflicts with pig Key: SPARK-1952 URL: https://issues.apache.org/jira/browse/SPARK-1952 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: pig 12.1 on Cloudera Hadoop, CDH3 Reporter: Ryan Compton Labels: pig, slf4j Fix For: 1.0.0 Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when they register a jar containing Spark. The error appears to be related to org.slf4j.spi.LocationAwareLogger.log. {code} Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598) at java.lang.Thread.dispatchUncaughtException(Thread.java:1874) {code} To reproduce: compile Spark via ```$ SPARK_HADOOP_VERSION=0.20.2-cdh3u4
[jira] [Created] (SPARK-1954) Make it easier to get Spark on YARN code to compile in IntelliJ
Sandy Ryza created SPARK-1954: - Summary: Make it easier to get Spark on YARN code to compile in IntelliJ Key: SPARK-1954 URL: https://issues.apache.org/jira/browse/SPARK-1954 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Sandy Ryza When loading a project through a Maven pom, IntelliJ allows switching on profiles, but, to my knowledge, doesn't provide a way to set arbitrary properties. To get Spark-on-YARN code to compile in IntelliJ, I need to manually change the hadoop.version in the root pom.xml to 2.2.0 or higher. This is very cumbersome when switching branches. It would be really helpful to add a profile that sets the Hadoop version that IntelliJ can switch on. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1955) VertexRDD can incorrectly assume index sharing
Ankur Dave created SPARK-1955: - Summary: VertexRDD can incorrectly assume index sharing Key: SPARK-1955 URL: https://issues.apache.org/jira/browse/SPARK-1955 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Ankur Dave Assignee: Ankur Dave Priority: Minor Many VertexRDD operations (diff, leftJoin, innerJoin) can use a fast zip join if both operands are VertexRDDs sharing the same index (i.e., one operand is derived from the other). However, this check is implemented by matching on the operand type and using the fast join strategy if it is a VertexRDD. When the two VertexRDDs have the same partitioner but different indexes, this is fine, because each VertexPartition will detect the index mismatch and fall back to the slow but correct local join strategy. However, when they have different numbers of partitions or different partition functions, an exception or even silently incorrect results can occur. For example: {code} // Construct VertexRDDs with different numbers of partitions val a = VertexRDD(sc.parallelize(List((0L, 1), (1L, 2)), 1)) val b = VertexRDD(sc.parallelize(List((0L, 5)), 8)) // Try to join them. Appears to work... val c = a.innerJoin(b) { (vid, x, y) = x + y } // ... but then fails with java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions c.collect {code} {code} import org.apache.spark._ // Construct VertexRDDs with different partition functions val a = VertexRDD(sc.parallelize(List((0L, 1), (1L, 2))).partitionBy(new HashPartitioner(2))) val bVerts = sc.parallelize(List((1L, 5))) val b = VertexRDD(bVerts.partitionBy(new RangePartitioner(2, bVerts))) // Try to join them. We expect (1L, 7). val c = a.innerJoin(b) { (vid, x, y) = x + y } // Silent failure: we get an empty set! c.collect {code} VertexRDD should check equality of partitioners before using the fast zip join. If the partitioners are different, the two datasets should be automatically co-partitioned. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1501) Assertions in Graph.apply test are never executed
[ https://issues.apache.org/jira/browse/SPARK-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-1501. --- Resolution: Fixed Assignee: William Benton Assertions in Graph.apply test are never executed - Key: SPARK-1501 URL: https://issues.apache.org/jira/browse/SPARK-1501 Project: Spark Issue Type: Test Components: GraphX Affects Versions: 1.0.0 Reporter: William Benton Assignee: William Benton Priority: Minor Labels: test The current Graph.apply test in GraphSuite contains assertions within an RDD transformation. These never execute because the transformation never executes. I have a (trivial) patch to fix this by collecting the graph triplets first. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1952) slf4j version conflicts with pig
[ https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011711#comment-14011711 ] Matei Zaharia commented on SPARK-1952: -- Ryan, do you know what SLF4J version Pig needs? It might be possible to just build Spark with an older one for this release. Also, did you build your Spark version with Hive? That might be bringing in these dependencies. slf4j version conflicts with pig Key: SPARK-1952 URL: https://issues.apache.org/jira/browse/SPARK-1952 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: pig 12.1 on Cloudera Hadoop, CDH3 Reporter: Ryan Compton Labels: pig, slf4j Fix For: 1.0.0 Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when they register a jar containing Spark. The error appears to be related to org.slf4j.spi.LocationAwareLogger.log. {code} Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598) at java.lang.Thread.dispatchUncaughtException(Thread.java:1874) {code} To reproduce: compile Spark via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt assembly and register the resulting jar into a pig script. E.g. {code} REGISTER /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar; data0 = LOAD 'data' USING PigStorage(); ttt = LIMIT data0 10; DUMP ttt; {code} The Spark-1.0 jar includes some slf4j dependencies that were not present in 0.9.1 {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep LocationAware 3259 Mon Mar 25 21:49:34 PDT 2013 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class 479 Fri Dec 13 16:44:40 PST 2013 parquet/org/slf4j/spi/LocationAwareLogger.class {code} vs. {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep LocationAware 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1952) slf4j version conflicts with pig
[ https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1952: --- Target Version/s: 1.0.1 (was: 1.0.0) slf4j version conflicts with pig Key: SPARK-1952 URL: https://issues.apache.org/jira/browse/SPARK-1952 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: pig 12.1 on Cloudera Hadoop, CDH3 Reporter: Ryan Compton Labels: pig, slf4j Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when they register a jar containing Spark. The error appears to be related to org.slf4j.spi.LocationAwareLogger.log. {code} Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598) at java.lang.Thread.dispatchUncaughtException(Thread.java:1874) {code} To reproduce: compile Spark via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt assembly and register the resulting jar into a pig script. E.g. {code} REGISTER /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar; data0 = LOAD 'data' USING PigStorage(); ttt = LIMIT data0 10; DUMP ttt; {code} The Spark-1.0 jar includes some slf4j dependencies that were not present in 0.9.1 {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep LocationAware 3259 Mon Mar 25 21:49:34 PDT 2013 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class 479 Fri Dec 13 16:44:40 PST 2013 parquet/org/slf4j/spi/LocationAwareLogger.class {code} vs. {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep LocationAware 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1956) Enable shuffle consolidation by default
[ https://issues.apache.org/jira/browse/SPARK-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011741#comment-14011741 ] Mridul Muralidharan commented on SPARK-1956: shuffle consolidation MUST NOT be enabled - whether by default, or intentionally. In 1.0, it is very badly broken - we have a whole litany of fixes for it, before it was reasonably stable. Current plan is to contribute most of these back in 1.1 timeframe. Enable shuffle consolidation by default --- Key: SPARK-1956 URL: https://issues.apache.org/jira/browse/SPARK-1956 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.0.0 Reporter: Sandy Ryza The only drawbacks are on ext3, and most everyone has ext4 at this point. I think it's better to aim the default at the common case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1952) slf4j version conflicts with pig
[ https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1952: --- Fix Version/s: (was: 1.0.0) slf4j version conflicts with pig Key: SPARK-1952 URL: https://issues.apache.org/jira/browse/SPARK-1952 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: pig 12.1 on Cloudera Hadoop, CDH3 Reporter: Ryan Compton Labels: pig, slf4j Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when they register a jar containing Spark. The error appears to be related to org.slf4j.spi.LocationAwareLogger.log. {code} Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598) at java.lang.Thread.dispatchUncaughtException(Thread.java:1874) {code} To reproduce: compile Spark via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt assembly and register the resulting jar into a pig script. E.g. {code} REGISTER /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar; data0 = LOAD 'data' USING PigStorage(); ttt = LIMIT data0 10; DUMP ttt; {code} The Spark-1.0 jar includes some slf4j dependencies that were not present in 0.9.1 {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep LocationAware 3259 Mon Mar 25 21:49:34 PDT 2013 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class 479 Fri Dec 13 16:44:40 PST 2013 parquet/org/slf4j/spi/LocationAwareLogger.class {code} vs. {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep LocationAware 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1916) SparkFlumeEvent with body bigger than 1020 bytes are not read properly
[ https://issues.apache.org/jira/browse/SPARK-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1916. Resolution: Fixed Fix Version/s: 0.9.2 1.0.1 Issue resolved by pull request 865 [https://github.com/apache/spark/pull/865] SparkFlumeEvent with body bigger than 1020 bytes are not read properly -- Key: SPARK-1916 URL: https://issues.apache.org/jira/browse/SPARK-1916 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 0.9.0 Reporter: David Lemieux Assignee: David Lemieux Fix For: 1.0.1, 0.9.2 Attachments: SPARK-1916.diff The readExternal implementation on SparkFlumeEvent will read only the first 1020 bytes of the actual body when streaming data from flume. This means that any event sent to Spark via Flume will be processed properly if the body is small, but will fail if the body is bigger than 1020. Considering that the default max size for a Flume Avro Event is 32K, the implementation should be updated to read more. The following is related : http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-using-Flume-body-size-limitation-tt6127.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line
[ https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1576: --- Fix Version/s: (was: 0.9.1) 0.9.2 Passing of JAVA_OPTS to YARN on command line Key: SPARK-1576 URL: https://issues.apache.org/jira/browse/SPARK-1576 Project: Spark Issue Type: Improvement Affects Versions: 0.9.0, 1.0.0, 0.9.1 Reporter: Nishkam Ravi Fix For: 0.9.0, 1.0.0, 0.9.2 Attachments: SPARK-1576.patch JAVA_OPTS can be passed by using either env variables (i.e., SPARK_JAVA_OPTS) or as config vars (after Patrick's recent change). It would be good to allow the user to pass them on command line as well to restrict scope to single application invocation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1849) Broken UTF-8 encoded data gets character replacements and thus can't be fixed
[ https://issues.apache.org/jira/browse/SPARK-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1849: --- Fix Version/s: (was: 0.9.1) 0.9.2 Broken UTF-8 encoded data gets character replacements and thus can't be fixed --- Key: SPARK-1849 URL: https://issues.apache.org/jira/browse/SPARK-1849 Project: Spark Issue Type: Bug Reporter: Harry Brundage Fix For: 1.0.0, 0.9.2 Attachments: encoding_test I'm trying to process a file which isn't valid UTF-8 data inside hadoop using Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that we should fix? It looks like {{HadoopRDD}} uses {{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement character, \uFFFD. Some example code mimicking what {{sc.textFile}} does underneath: {code} scala sc.textFile(path).collect()(0) res8: String = ?pple scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).map(pair = pair._2.toString).collect()(0).getBytes() res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101) scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).map(pair = pair._2.getBytes).collect()(0) res10: Array[Byte] = Array(-60, 112, 112, 108, 101) {code} In the above example, the first two snippets show the string representation and byte representation of the example line of text. The string shows a question mark for the replacement character and the bytes reveal the replacement character has been swapped in by {{Text.toString}}. The third snippet shows what happens if you call {{getBytes}} on the {{Text}} object which comes back from hadoop land: we get the real bytes in the file out. Now, I think this is a bug, though you may disagree. The text inside my file is perfectly valid iso-8859-1 encoded bytes, which I would like to be able to rescue and re-encode into UTF-8, because I want my application to be smart like that. I think Spark should give me the raw broken string so I can re-encode, but I can't get at the original bytes in order to guess at what the source encoding might be, as they have already been replaced. I'm dealing with data from some CDN access logs which are to put it nicely diversely encoded, but I think a use case Spark should fully support. So, my suggested fix, which I'd like some guidance, is to change {{textFile}} to spit out broken strings by not using {{Text}}'s UTF-8 encoding. Further compounding this issue is that my application is actually in PySpark, but we can talk about how bytes fly through to Scala land after this if we agree that this is an issue at all. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1759) sbt/sbt package fail cause by directory
[ https://issues.apache.org/jira/browse/SPARK-1759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1759: --- Fix Version/s: (was: 0.9.1) 0.9.2 sbt/sbt package fail cause by directory --- Key: SPARK-1759 URL: https://issues.apache.org/jira/browse/SPARK-1759 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0, 0.9.1 Environment: ubuntu14.04 Reporter: Jian Pan Fix For: 0.9.2 Original Estimate: 1h Remaining Estimate: 1h 1.create a project named simpleApp $cd simpleApp $ find . . ./simple.sbt ./src ./src/main ./src/main/scala ./src/main/scala/simpleApp.scala $ ~/Software/spark-0.9.1/sbt/sbt awk: fatal: cannot open file `./project/build.properties' for reading (No such file or directory) Attempting to fetch sbt /home/jpan/Software/spark-0.9.1/sbt/sbt: line 35: /sbt/sbt-launch-.jar: No such file or directory /home/jpan/Software/spark-0.9.1/sbt/sbt: line 35: /sbt/sbt-launch-.jar: No such file or directory Our attempt to download sbt locally to /sbt/sbt-launch-.jar failed. Please install sbt manually from http://www.scala-sbt.org/ it failed because sbt use relative path。 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1712) ParallelCollectionRDD operations hanging forever without any error messages
[ https://issues.apache.org/jira/browse/SPARK-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1712. -- Resolution: Fixed ParallelCollectionRDD operations hanging forever without any error messages Key: SPARK-1712 URL: https://issues.apache.org/jira/browse/SPARK-1712 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Environment: Linux Ubuntu 14.04, a single spark node; standalone mode. Reporter: Piotr Kołaczkowski Assignee: Guoqiang Li Priority: Blocker Attachments: executor.jstack.txt, master.jstack.txt, repl.jstack.txt, spark-hang.png, worker.jstack.txt conf/spark-defaults.conf {code} spark.akka.frameSize 5 spark.default.parallelism1 {code} {noformat} scala val collection = (1 to 100).map(i = (foo + i, i)).toVector collection: Vector[(String, Int)] = Vector((foo1,1), (foo2,2), (foo3,3), (foo4,4), (foo5,5), (foo6,6), (foo7,7), (foo8,8), (foo9,9), (foo10,10), (foo11,11), (foo12,12), (foo13,13), (foo14,14), (foo15,15), (foo16,16), (foo17,17), (foo18,18), (foo19,19), (foo20,20), (foo21,21), (foo22,22), (foo23,23), (foo24,24), (foo25,25), (foo26,26), (foo27,27), (foo28,28), (foo29,29), (foo30,30), (foo31,31), (foo32,32), (foo33,33), (foo34,34), (foo35,35), (foo36,36), (foo37,37), (foo38,38), (foo39,39), (foo40,40), (foo41,41), (foo42,42), (foo43,43), (foo44,44), (foo45,45), (foo46,46), (foo47,47), (foo48,48), (foo49,49), (foo50,50), (foo51,51), (foo52,52), (foo53,53), (foo54,54), (foo55,55), (foo56,56), (foo57,57), (foo58,58), (foo59,59), (foo60,60), (foo61,61), (foo62,62), (foo63,63), (foo64,64), (foo... scala val rdd = sc.parallelize(collection) rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at console:24 scala rdd.first res4: (String, Int) = (foo1,1) scala rdd.map(_._2).sum // nothing happens {noformat} CPU and I/O idle. Memory usage reported by JVM, after manually triggered GC: repl: 216 MB / 2 GB executor: 67 MB / 2 GB worker: 6 MB / 128 MB master: 6 MB / 128 MB No errors found in worker's stderr/stdout. It works fine with 700,000 elements and then it takes about 1 second to process the request and calculate the sum. With 700,000 items the spark executor memory doesn't even exceed 300 MB out of 2GB available. It fails with 800,000 items. Multiple parralelized collections of size 700,000 items at the same time in the same session work fine. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1712) ParallelCollectionRDD operations hanging forever without any error messages
[ https://issues.apache.org/jira/browse/SPARK-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1712: - Priority: Major (was: Blocker) ParallelCollectionRDD operations hanging forever without any error messages Key: SPARK-1712 URL: https://issues.apache.org/jira/browse/SPARK-1712 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Environment: Linux Ubuntu 14.04, a single spark node; standalone mode. Reporter: Piotr Kołaczkowski Assignee: Guoqiang Li Attachments: executor.jstack.txt, master.jstack.txt, repl.jstack.txt, spark-hang.png, worker.jstack.txt conf/spark-defaults.conf {code} spark.akka.frameSize 5 spark.default.parallelism1 {code} {noformat} scala val collection = (1 to 100).map(i = (foo + i, i)).toVector collection: Vector[(String, Int)] = Vector((foo1,1), (foo2,2), (foo3,3), (foo4,4), (foo5,5), (foo6,6), (foo7,7), (foo8,8), (foo9,9), (foo10,10), (foo11,11), (foo12,12), (foo13,13), (foo14,14), (foo15,15), (foo16,16), (foo17,17), (foo18,18), (foo19,19), (foo20,20), (foo21,21), (foo22,22), (foo23,23), (foo24,24), (foo25,25), (foo26,26), (foo27,27), (foo28,28), (foo29,29), (foo30,30), (foo31,31), (foo32,32), (foo33,33), (foo34,34), (foo35,35), (foo36,36), (foo37,37), (foo38,38), (foo39,39), (foo40,40), (foo41,41), (foo42,42), (foo43,43), (foo44,44), (foo45,45), (foo46,46), (foo47,47), (foo48,48), (foo49,49), (foo50,50), (foo51,51), (foo52,52), (foo53,53), (foo54,54), (foo55,55), (foo56,56), (foo57,57), (foo58,58), (foo59,59), (foo60,60), (foo61,61), (foo62,62), (foo63,63), (foo64,64), (foo... scala val rdd = sc.parallelize(collection) rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at console:24 scala rdd.first res4: (String, Int) = (foo1,1) scala rdd.map(_._2).sum // nothing happens {noformat} CPU and I/O idle. Memory usage reported by JVM, after manually triggered GC: repl: 216 MB / 2 GB executor: 67 MB / 2 GB worker: 6 MB / 128 MB master: 6 MB / 128 MB No errors found in worker's stderr/stdout. It works fine with 700,000 elements and then it takes about 1 second to process the request and calculate the sum. With 700,000 items the spark executor memory doesn't even exceed 300 MB out of 2GB available. It fails with 800,000 items. Multiple parralelized collections of size 700,000 items at the same time in the same session work fine. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1817) RDD zip erroneous when partitions do not divide RDD count
[ https://issues.apache.org/jira/browse/SPARK-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1817: - Priority: Minor (was: Blocker) RDD zip erroneous when partitions do not divide RDD count - Key: SPARK-1817 URL: https://issues.apache.org/jira/browse/SPARK-1817 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0, 1.0.0 Reporter: Michael Malak Assignee: Kan Zhang Priority: Minor Example: scala sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect res1: Array[(Long, Int)] = Array((2,11)) But more generally, it's whenever the number of partitions does not evenly divide the total number of elements in the RDD. See https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1817) RDD zip erroneous when partitions do not divide RDD count
[ https://issues.apache.org/jira/browse/SPARK-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1817: - Priority: Major (was: Minor) RDD zip erroneous when partitions do not divide RDD count - Key: SPARK-1817 URL: https://issues.apache.org/jira/browse/SPARK-1817 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0, 1.0.0 Reporter: Michael Malak Assignee: Kan Zhang Example: scala sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect res1: Array[(Long, Int)] = Array((2,11)) But more generally, it's whenever the number of partitions does not evenly divide the total number of elements in the RDD. See https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1712) ParallelCollectionRDD operations hanging forever without any error messages
[ https://issues.apache.org/jira/browse/SPARK-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1712: - Fix Version/s: 1.0.1 ParallelCollectionRDD operations hanging forever without any error messages Key: SPARK-1712 URL: https://issues.apache.org/jira/browse/SPARK-1712 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Environment: Linux Ubuntu 14.04, a single spark node; standalone mode. Reporter: Piotr Kołaczkowski Assignee: Guoqiang Li Fix For: 1.0.1 Attachments: executor.jstack.txt, master.jstack.txt, repl.jstack.txt, spark-hang.png, worker.jstack.txt conf/spark-defaults.conf {code} spark.akka.frameSize 5 spark.default.parallelism1 {code} {noformat} scala val collection = (1 to 100).map(i = (foo + i, i)).toVector collection: Vector[(String, Int)] = Vector((foo1,1), (foo2,2), (foo3,3), (foo4,4), (foo5,5), (foo6,6), (foo7,7), (foo8,8), (foo9,9), (foo10,10), (foo11,11), (foo12,12), (foo13,13), (foo14,14), (foo15,15), (foo16,16), (foo17,17), (foo18,18), (foo19,19), (foo20,20), (foo21,21), (foo22,22), (foo23,23), (foo24,24), (foo25,25), (foo26,26), (foo27,27), (foo28,28), (foo29,29), (foo30,30), (foo31,31), (foo32,32), (foo33,33), (foo34,34), (foo35,35), (foo36,36), (foo37,37), (foo38,38), (foo39,39), (foo40,40), (foo41,41), (foo42,42), (foo43,43), (foo44,44), (foo45,45), (foo46,46), (foo47,47), (foo48,48), (foo49,49), (foo50,50), (foo51,51), (foo52,52), (foo53,53), (foo54,54), (foo55,55), (foo56,56), (foo57,57), (foo58,58), (foo59,59), (foo60,60), (foo61,61), (foo62,62), (foo63,63), (foo64,64), (foo... scala val rdd = sc.parallelize(collection) rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at console:24 scala rdd.first res4: (String, Int) = (foo1,1) scala rdd.map(_._2).sum // nothing happens {noformat} CPU and I/O idle. Memory usage reported by JVM, after manually triggered GC: repl: 216 MB / 2 GB executor: 67 MB / 2 GB worker: 6 MB / 128 MB master: 6 MB / 128 MB No errors found in worker's stderr/stdout. It works fine with 700,000 elements and then it takes about 1 second to process the request and calculate the sum. With 700,000 items the spark executor memory doesn't even exceed 300 MB out of 2GB available. It fails with 800,000 items. Multiple parralelized collections of size 700,000 items at the same time in the same session work fine. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1952) slf4j version conflicts with pig
[ https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011789#comment-14011789 ] Ryan Compton commented on SPARK-1952: - Pig depends on slf4j 1.6.1 {code} rfcompton@node19 /d/t/c/pig-0.12.1 cat ivy/libraries.properties | grep 4j log4j.version=1.2.16 slf4j-api.version=1.6.1 slf4j-log4j12.version=1.6.1 {code} I don't use Hive, and according to http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html#hive-support it's not packaged with Spark by default so I don't think it's Hive. slf4j version conflicts with pig Key: SPARK-1952 URL: https://issues.apache.org/jira/browse/SPARK-1952 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: pig 12.1 on Cloudera Hadoop, CDH3 Reporter: Ryan Compton Labels: pig, slf4j Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when they register a jar containing Spark. The error appears to be related to org.slf4j.spi.LocationAwareLogger.log. {code} Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598) at java.lang.Thread.dispatchUncaughtException(Thread.java:1874) {code} To reproduce: compile Spark via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt assembly and register the resulting jar into a pig script. E.g. {code} REGISTER /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar; data0 = LOAD 'data' USING PigStorage(); ttt = LIMIT data0 10; DUMP ttt; {code} The Spark-1.0 jar includes some slf4j dependencies that were not present in 0.9.1 {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep LocationAware 3259 Mon Mar 25 21:49:34 PDT 2013 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class 479 Fri Dec 13 16:44:40 PST 2013 parquet/org/slf4j/spi/LocationAwareLogger.class {code} vs. {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep LocationAware 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1952) slf4j version conflicts with pig
[ https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011806#comment-14011806 ] Patrick Wendell commented on SPARK-1952: Hm, unfortunately I dont' see any obvious culprits here. The slf4j version had only a small bump in Spark 1.0 to 1.7.5 from 1.7.2, I don't think it would have radically changed the classes that are included in the jar. The parquet slf4j stuff is expected, since parquet shades slf4j, it will have it's own copy of slf4j sitting around, but this shouldn't conflict at all. slf4j version conflicts with pig Key: SPARK-1952 URL: https://issues.apache.org/jira/browse/SPARK-1952 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: pig 12.1 on Cloudera Hadoop, CDH3 Reporter: Ryan Compton Labels: pig, slf4j Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when they register a jar containing Spark. The error appears to be related to org.slf4j.spi.LocationAwareLogger.log. {code} Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598) at java.lang.Thread.dispatchUncaughtException(Thread.java:1874) {code} To reproduce: compile Spark via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt assembly and register the resulting jar into a pig script. E.g. {code} REGISTER /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar; data0 = LOAD 'data' USING PigStorage(); ttt = LIMIT data0 10; DUMP ttt; {code} The Spark-1.0 jar includes some slf4j dependencies that were not present in 0.9.1 {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep LocationAware 3259 Mon Mar 25 21:49:34 PDT 2013 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class 479 Fri Dec 13 16:44:40 PST 2013 parquet/org/slf4j/spi/LocationAwareLogger.class {code} vs. {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep LocationAware 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1952) slf4j version conflicts with pig
[ https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011808#comment-14011808 ] Patrick Wendell commented on SPARK-1952: [~rcompton] - what if you modify the spark build and downgrade slf4j to 1.7.2 as a debugging step... does that fix it? slf4j version conflicts with pig Key: SPARK-1952 URL: https://issues.apache.org/jira/browse/SPARK-1952 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: pig 12.1 on Cloudera Hadoop, CDH3 Reporter: Ryan Compton Labels: pig, slf4j Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when they register a jar containing Spark. The error appears to be related to org.slf4j.spi.LocationAwareLogger.log. {code} Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598) at java.lang.Thread.dispatchUncaughtException(Thread.java:1874) {code} To reproduce: compile Spark via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt assembly and register the resulting jar into a pig script. E.g. {code} REGISTER /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar; data0 = LOAD 'data' USING PigStorage(); ttt = LIMIT data0 10; DUMP ttt; {code} The Spark-1.0 jar includes some slf4j dependencies that were not present in 0.9.1 {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep LocationAware 3259 Mon Mar 25 21:49:34 PDT 2013 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class 479 Fri Dec 13 16:44:40 PST 2013 parquet/org/slf4j/spi/LocationAwareLogger.class {code} vs. {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep LocationAware 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1954) Make it easier to get Spark on YARN code to compile in IntelliJ
[ https://issues.apache.org/jira/browse/SPARK-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011810#comment-14011810 ] Patrick Wendell commented on SPARK-1954: Have you tried running sbt/sbt gen-idea with SPARK_YARN=true and SPARK_HADOOP_VERSION=2.2.0? Make it easier to get Spark on YARN code to compile in IntelliJ --- Key: SPARK-1954 URL: https://issues.apache.org/jira/browse/SPARK-1954 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Sandy Ryza When loading a project through a Maven pom, IntelliJ allows switching on profiles, but, to my knowledge, doesn't provide a way to set arbitrary properties. To get Spark-on-YARN code to compile in IntelliJ, I need to manually change the hadoop.version in the root pom.xml to 2.2.0 or higher. This is very cumbersome when switching branches. It would be really helpful to add a profile that sets the Hadoop version that IntelliJ can switch on. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1950) spark on yarn can't start
[ https://issues.apache.org/jira/browse/SPARK-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1950. Resolution: Duplicate spark on yarn can't start -- Key: SPARK-1950 URL: https://issues.apache.org/jira/browse/SPARK-1950 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Guoqiang Li Priority: Blocker {{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit --archives /input/lbs/recommend/toona/spark/conf toona-assembly.jar 20140521}}throw an exception: {code} Exception in thread main java.io.FileNotFoundException: File file:/input/lbs/recommend/toona/spark/conf does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:402) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289) at org.apache.spark.deploy.yarn.ClientBase$class.org$apache$spark$deploy$yarn$ClientBase$$copyRemoteFile(ClientBase.scala:162) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:237) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:232) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:232) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:230) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:230) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:39) at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74) at org.apache.spark.deploy.yarn.Client.run(Client.scala:96) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:186) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} {{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit --archives hdfs://10dian72:8020/input/lbs/recommend/toona/spark/conf toona-assembly.jar 20140521}} work. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1712) ParallelCollectionRDD operations hanging forever without any error messages
[ https://issues.apache.org/jira/browse/SPARK-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011830#comment-14011830 ] Matei Zaharia commented on SPARK-1712: -- Merged the frame size check into 0.9.2 as well as 1.0.1 ParallelCollectionRDD operations hanging forever without any error messages Key: SPARK-1712 URL: https://issues.apache.org/jira/browse/SPARK-1712 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Environment: Linux Ubuntu 14.04, a single spark node; standalone mode. Reporter: Piotr Kołaczkowski Assignee: Guoqiang Li Fix For: 0.9.2, 1.0.1 Attachments: executor.jstack.txt, master.jstack.txt, repl.jstack.txt, spark-hang.png, worker.jstack.txt conf/spark-defaults.conf {code} spark.akka.frameSize 5 spark.default.parallelism1 {code} {noformat} scala val collection = (1 to 100).map(i = (foo + i, i)).toVector collection: Vector[(String, Int)] = Vector((foo1,1), (foo2,2), (foo3,3), (foo4,4), (foo5,5), (foo6,6), (foo7,7), (foo8,8), (foo9,9), (foo10,10), (foo11,11), (foo12,12), (foo13,13), (foo14,14), (foo15,15), (foo16,16), (foo17,17), (foo18,18), (foo19,19), (foo20,20), (foo21,21), (foo22,22), (foo23,23), (foo24,24), (foo25,25), (foo26,26), (foo27,27), (foo28,28), (foo29,29), (foo30,30), (foo31,31), (foo32,32), (foo33,33), (foo34,34), (foo35,35), (foo36,36), (foo37,37), (foo38,38), (foo39,39), (foo40,40), (foo41,41), (foo42,42), (foo43,43), (foo44,44), (foo45,45), (foo46,46), (foo47,47), (foo48,48), (foo49,49), (foo50,50), (foo51,51), (foo52,52), (foo53,53), (foo54,54), (foo55,55), (foo56,56), (foo57,57), (foo58,58), (foo59,59), (foo60,60), (foo61,61), (foo62,62), (foo63,63), (foo64,64), (foo... scala val rdd = sc.parallelize(collection) rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at console:24 scala rdd.first res4: (String, Int) = (foo1,1) scala rdd.map(_._2).sum // nothing happens {noformat} CPU and I/O idle. Memory usage reported by JVM, after manually triggered GC: repl: 216 MB / 2 GB executor: 67 MB / 2 GB worker: 6 MB / 128 MB master: 6 MB / 128 MB No errors found in worker's stderr/stdout. It works fine with 700,000 elements and then it takes about 1 second to process the request and calculate the sum. With 700,000 items the spark executor memory doesn't even exceed 300 MB out of 2GB available. It fails with 800,000 items. Multiple parralelized collections of size 700,000 items at the same time in the same session work fine. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1957) Pluggable disk store for BlockManager
Raymond Liu created SPARK-1957: -- Summary: Pluggable disk store for BlockManager Key: SPARK-1957 URL: https://issues.apache.org/jira/browse/SPARK-1957 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Raymond Liu As the first step toward the goal of SPAK-1733, support a pluggable disk store to allow different disk storage to be plug into the BlockManager's DiskStore layer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011897#comment-14011897 ] Sean Owen commented on SPARK-1518: -- they write their app against the Spark API's in Maven central (they can do this no matter which cluster they want to run on) Yeah this is the issue. OK, if I compile against Spark artifacts as a runtime dependency and submit an app to the cluster, it should be OK no matter what build of Spark is running. The binding from Spark to Hadoop is hidden from the app. I am thinking of the case where I want to build an app that is a client of Spark -- embedding it. Then I am including the client of Hadoop for example. I have to match my cluster than and there is no Hadoop 2 Spark artifact. Am I missing something big here? that's my premise about why there would ever be a need for different artifacts. It's the same use case as in Sandy's blog: http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/ Spark master doesn't compile against hadoop-common trunk Key: SPARK-1518 URL: https://issues.apache.org/jira/browse/SPARK-1518 Project: Spark Issue Type: Bug Reporter: Marcelo Vanzin Assignee: Colin Patrick McCabe Priority: Critical FSDataOutputStream::sync() has disappeared from trunk in Hadoop; FileLogger.scala is calling it. I've changed it locally to hsync() so I can compile the code, but haven't checked yet whether those are equivalent. hsync() seems to have been there forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1957) Pluggable disk store for BlockManager
[ https://issues.apache.org/jira/browse/SPARK-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Liu updated SPARK-1957: --- Issue Type: Sub-task (was: New Feature) Parent: SPARK-1733 Pluggable disk store for BlockManager - Key: SPARK-1957 URL: https://issues.apache.org/jira/browse/SPARK-1957 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Raymond Liu As the first step toward the goal of SPAK-1733, support a pluggable disk store to allow different disk storage to be plug into the BlockManager's DiskStore layer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk
[ https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011942#comment-14011942 ] Matei Zaharia commented on SPARK-1518: -- Sean, the model for linking to Hadoop has been that users also add a dependency on hadoop-client if they want to access HDFS for the past few releases. See http://spark.apache.org/docs/latest/scala-programming-guide.html#linking-with-spark for example. This model is there because Hadoop itself has decided to create the hadoop-client Maven artifact as a way to get apps to link to it. It works for all the recent versions of Hadoop as far as I know -- users don't have to link against a custom-built Spark for their distro. Regarding binary builds on apache.org, we want users to be able to start using Spark as conveniently as possible on any distribution. It is the goal of the Apache project to have people use Apache Spark as easily as possible. Spark master doesn't compile against hadoop-common trunk Key: SPARK-1518 URL: https://issues.apache.org/jira/browse/SPARK-1518 Project: Spark Issue Type: Bug Reporter: Marcelo Vanzin Assignee: Colin Patrick McCabe Priority: Critical FSDataOutputStream::sync() has disappeared from trunk in Hadoop; FileLogger.scala is calling it. I've changed it locally to hsync() so I can compile the code, but haven't checked yet whether those are equivalent. hsync() seems to have been there forever, so it hopefully works with all versions Spark cares about. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1112) When spark.akka.frameSize 10, task results bigger than 10MiB block execution
[ https://issues.apache.org/jira/browse/SPARK-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011968#comment-14011968 ] Kevin (Sangwoo) Kim commented on SPARK-1112: Hi all, I'm very new to Spark and doing some tests, I've experienced similar issue. (tested with Spark Shell, 0.9.1, r3.8xlarge instance on EC2 - 32 core / 244GiB MEM) I was trying to broadcast 700MB of data and Spark hangs when I run collect() method for the data. Here's the strange things : 1) when I tried val userInfo = sc.textFile(file:///spark/logs/user_sign_up2.csv).map{line = val split = line.split(,); (split(1), split)} it runs well. 2) when I tried val userInfo = sc.textFile(file:///spark/logs/user_sign_up2.csv).map{line = val split = line.split(,); (split(1), split(5))} Spark hangs. 3) when I slightly control the data size using sample() method or cutting the data file, it runs well. Our team investigated logs from master and worker then we found worker finished all tasks but master couldn't retrieve the result from a task the result size larger than 10MB We tried to apply the workaround setting spark.akka.frameSize to 9, it works like a charm. I guess it might hard to reproduce the issue, please contact me if there's need of testing or getting logs. Thanks! When spark.akka.frameSize 10, task results bigger than 10MiB block execution -- Key: SPARK-1112 URL: https://issues.apache.org/jira/browse/SPARK-1112 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Guillaume Pitel Priority: Blocker Fix For: 0.9.2 When I set the spark.akka.frameSize to something over 10, the messages sent from the executors to the driver completely block the execution if the message is bigger than 10MiB and smaller than the frameSize (if it's above the frameSize, it's ok) Workaround is to set the spark.akka.frameSize to 10. In this case, since 0.8.1, the blockManager deal with the data to be sent. It seems slower than akka direct message though. The configuration seems to be correctly read (see actorSystemConfig.txt), so I don't see where the 10MiB could come from -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1112) When spark.akka.frameSize 10, task results bigger than 10MiB block execution
[ https://issues.apache.org/jira/browse/SPARK-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011968#comment-14011968 ] Kevin (Sangwoo) Kim edited comment on SPARK-1112 at 5/29/14 2:01 AM: - Hi all, I'm very new to Spark and doing some tests, I've experienced similar issue. (tested with Spark Shell, 0.9.1, r3.8xlarge instance on EC2 - 32 core / 244GiB MEM) I was trying to broadcast 700MB of data and Spark hangs when I run collect() method for the data. Here's the strange things : 1) when I tried {code}val userInfo = sc.textFile(file:///spark/logs/user_sign_up2.csv).map{line = val split = line.split(,); (split(1), split)}{code} it runs well. 2) when I tried {code}val userInfo = sc.textFile(file:///spark/logs/user_sign_up2.csv).map{line = val split = line.split(,); (split(1), split(5))} {code} Spark hangs. 3) when I slightly control the data size using sample() method or cutting the data file, it runs well. Our team investigated logs from master and worker then we found worker finished all tasks but master couldn't retrieve the result from a task the result size larger than 10MB We tried to apply the workaround setting spark.akka.frameSize to 9, it works like a charm. I guess it might hard to reproduce the issue, please contact me if there's need of testing or getting logs. Thanks! was (Author: swkimme): Hi all, I'm very new to Spark and doing some tests, I've experienced similar issue. (tested with Spark Shell, 0.9.1, r3.8xlarge instance on EC2 - 32 core / 244GiB MEM) I was trying to broadcast 700MB of data and Spark hangs when I run collect() method for the data. Here's the strange things : 1) when I tried val userInfo = sc.textFile(file:///spark/logs/user_sign_up2.csv).map{line = val split = line.split(,); (split(1), split)} it runs well. 2) when I tried val userInfo = sc.textFile(file:///spark/logs/user_sign_up2.csv).map{line = val split = line.split(,); (split(1), split(5))} Spark hangs. 3) when I slightly control the data size using sample() method or cutting the data file, it runs well. Our team investigated logs from master and worker then we found worker finished all tasks but master couldn't retrieve the result from a task the result size larger than 10MB We tried to apply the workaround setting spark.akka.frameSize to 9, it works like a charm. I guess it might hard to reproduce the issue, please contact me if there's need of testing or getting logs. Thanks! When spark.akka.frameSize 10, task results bigger than 10MiB block execution -- Key: SPARK-1112 URL: https://issues.apache.org/jira/browse/SPARK-1112 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Guillaume Pitel Priority: Blocker Fix For: 0.9.2 When I set the spark.akka.frameSize to something over 10, the messages sent from the executors to the driver completely block the execution if the message is bigger than 10MiB and smaller than the frameSize (if it's above the frameSize, it's ok) Workaround is to set the spark.akka.frameSize to 10. In this case, since 0.8.1, the blockManager deal with the data to be sent. It seems slower than akka direct message though. The configuration seems to be correctly read (see actorSystemConfig.txt), so I don't see where the 10MiB could come from -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1112) When spark.akka.frameSize 10, task results bigger than 10MiB block execution
[ https://issues.apache.org/jira/browse/SPARK-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1112: - Priority: Critical (was: Blocker) When spark.akka.frameSize 10, task results bigger than 10MiB block execution -- Key: SPARK-1112 URL: https://issues.apache.org/jira/browse/SPARK-1112 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Guillaume Pitel Priority: Critical Fix For: 0.9.2 When I set the spark.akka.frameSize to something over 10, the messages sent from the executors to the driver completely block the execution if the message is bigger than 10MiB and smaller than the frameSize (if it's above the frameSize, it's ok) Workaround is to set the spark.akka.frameSize to 10. In this case, since 0.8.1, the blockManager deal with the data to be sent. It seems slower than akka direct message though. The configuration seems to be correctly read (see actorSystemConfig.txt), so I don't see where the 10MiB could come from -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1112) When spark.akka.frameSize 10, task results bigger than 10MiB block execution
[ https://issues.apache.org/jira/browse/SPARK-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011978#comment-14011978 ] Matei Zaharia commented on SPARK-1112: -- I'm curious, why did you want to make the frameSize this big -- are the tasks themselves also big or just the results? There might be other buffers in Akka that can't be made bigger than this. It's possible that this changed in a newer Akka version (because larger frame sizes used to work before). When spark.akka.frameSize 10, task results bigger than 10MiB block execution -- Key: SPARK-1112 URL: https://issues.apache.org/jira/browse/SPARK-1112 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Guillaume Pitel Priority: Critical Fix For: 0.9.2 When I set the spark.akka.frameSize to something over 10, the messages sent from the executors to the driver completely block the execution if the message is bigger than 10MiB and smaller than the frameSize (if it's above the frameSize, it's ok) Workaround is to set the spark.akka.frameSize to 10. In this case, since 0.8.1, the blockManager deal with the data to be sent. It seems slower than akka direct message though. The configuration seems to be correctly read (see actorSystemConfig.txt), so I don't see where the 10MiB could come from -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1112) When spark.akka.frameSize 10, task results bigger than 10MiB block execution
[ https://issues.apache.org/jira/browse/SPARK-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011993#comment-14011993 ] Kevin (Sangwoo) Kim commented on SPARK-1112: [~matei] I've found the default of spark.akka.frameSize is 10 from the config document, http://spark.apache.org/docs/0.9.1/configuration.html just tried to slightly larger and smaller (11 and 9) values. I did collect() method on the userInfo and it might contains large data. (edited the first comment.) When spark.akka.frameSize 10, task results bigger than 10MiB block execution -- Key: SPARK-1112 URL: https://issues.apache.org/jira/browse/SPARK-1112 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Guillaume Pitel Priority: Critical Fix For: 0.9.2 When I set the spark.akka.frameSize to something over 10, the messages sent from the executors to the driver completely block the execution if the message is bigger than 10MiB and smaller than the frameSize (if it's above the frameSize, it's ok) Workaround is to set the spark.akka.frameSize to 10. In this case, since 0.8.1, the blockManager deal with the data to be sent. It seems slower than akka direct message though. The configuration seems to be correctly read (see actorSystemConfig.txt), so I don't see where the 10MiB could come from -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1112) When spark.akka.frameSize 10, task results bigger than 10MiB block execution
[ https://issues.apache.org/jira/browse/SPARK-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011968#comment-14011968 ] Kevin (Sangwoo) Kim edited comment on SPARK-1112 at 5/29/14 2:50 AM: - Hi all, I'm very new to Spark and doing some tests, I've experienced similar issue. (tested with Spark Shell, 0.9.1, r3.8xlarge instance on EC2 - 32 core / 244GiB MEM) I was trying to broadcast 700MB of data and Spark hangs when I run collect() method for the data. Here's the strange things : 1) when I tried {code}val userInfo = sc.textFile(file:///spark/logs/user_sign_up2.csv).map{line = val split = line.split(,); (split(1), split)} val userInfoMap = userInfo.collectAsMap {code} it runs well. 2) when I tried {code}val userInfo = sc.textFile(file:///spark/logs/user_sign_up2.csv).map{line = val split = line.split(,); (split(1), split(5))} val userInfoMap = userInfo.collectAsMap {code} Spark hangs. 3) when I slightly control the data size using sample() method or cutting the data file, it runs well. Our team investigated logs from master and worker then we found worker finished all tasks but master couldn't retrieve the result from a task the result size larger than 10MB We tried to apply the workaround setting spark.akka.frameSize to 9, it works like a charm. I guess it might hard to reproduce the issue, please contact me if there's need of testing or getting logs. Thanks! was (Author: swkimme): Hi all, I'm very new to Spark and doing some tests, I've experienced similar issue. (tested with Spark Shell, 0.9.1, r3.8xlarge instance on EC2 - 32 core / 244GiB MEM) I was trying to broadcast 700MB of data and Spark hangs when I run collect() method for the data. Here's the strange things : 1) when I tried {code}val userInfo = sc.textFile(file:///spark/logs/user_sign_up2.csv).map{line = val split = line.split(,); (split(1), split)}{code} it runs well. 2) when I tried {code}val userInfo = sc.textFile(file:///spark/logs/user_sign_up2.csv).map{line = val split = line.split(,); (split(1), split(5))} {code} Spark hangs. 3) when I slightly control the data size using sample() method or cutting the data file, it runs well. Our team investigated logs from master and worker then we found worker finished all tasks but master couldn't retrieve the result from a task the result size larger than 10MB We tried to apply the workaround setting spark.akka.frameSize to 9, it works like a charm. I guess it might hard to reproduce the issue, please contact me if there's need of testing or getting logs. Thanks! When spark.akka.frameSize 10, task results bigger than 10MiB block execution -- Key: SPARK-1112 URL: https://issues.apache.org/jira/browse/SPARK-1112 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0 Reporter: Guillaume Pitel Priority: Critical Fix For: 0.9.2 When I set the spark.akka.frameSize to something over 10, the messages sent from the executors to the driver completely block the execution if the message is bigger than 10MiB and smaller than the frameSize (if it's above the frameSize, it's ok) Workaround is to set the spark.akka.frameSize to 10. In this case, since 0.8.1, the blockManager deal with the data to be sent. It seems slower than akka direct message though. The configuration seems to be correctly read (see actorSystemConfig.txt), so I don't see where the 10MiB could come from -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1958) Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan.
Michael Armbrust created SPARK-1958: --- Summary: Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan. Key: SPARK-1958 URL: https://issues.apache.org/jira/browse/SPARK-1958 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Fix For: 1.1.0 In some cases (like LIMIT) executeCollect() makes optimizations that execute().collect() will not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1959) String NULL is interpreted as null value
Cheng Lian created SPARK-1959: - Summary: String NULL is interpreted as null value Key: SPARK-1959 URL: https://issues.apache.org/jira/browse/SPARK-1959 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Cheng Lian The {{HiveTableScan}} operator unwraps string NULL (case insensitive) into null values even if the column type is {{STRING}}. To reproduce the bug, we use {{sql/hive/src/test/resources/groupby_groupingid.txt}} as test input, copied to {{/tmp/groupby_groupingid.txt}}. Hive session: {code} hive CREATE TABLE test_null(key INT, value STRING); hive LOAD DATA LOCAL INPATH '/tmp/groupby_groupingid.txt' INTO table test_null; hive SELECT * FROM test_null WHERE value IS NOT NULL; ... OK 1 NULL 1 1 2 2 3 3 3 NULL 4 5 1 NULL 1 1 2 2 3 3 3 NULL 4 5 {code} We can see that the {{NULL}} cells in the original input file are interpreted as string {{NULL}} in Hive. Spark SQL session ({{sbt/sbt hive/console}}): {code} scala hql(CREATE TABLE test_null(key INT, value STRING)) scala hql(LOAD DATA LOCAL INPATH '/tmp/groupby_groupingid.txt' INTO table test_null) scala hql(SELECT * FROM test_null WHERE value IS NOT NULL).foreach(println) ... [1,1] [2,2] [3,3] [4,5] {code} As we can see, string {{NULL}} is interpreted as null values in Spark SQL. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1901) Standalone worker update exector's state ahead of executor process exit
[ https://issues.apache.org/jira/browse/SPARK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1901: --- Fix Version/s: (was: 1.0.0) 1.0.1 Standalone worker update exector's state ahead of executor process exit --- Key: SPARK-1901 URL: https://issues.apache.org/jira/browse/SPARK-1901 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 0.9.0 Environment: spark-1.0 rc10 Reporter: Zhen Peng Assignee: Zhen Peng Fix For: 1.0.1 Standalone worker updates executor's state prematurely, making the resource status in an inconsistent state until the executor process really died. In our cluster, we found this situation may cause new submitted applications removed by Master for launching executor fail. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1954) Make it easier to get Spark on YARN code to compile in IntelliJ
[ https://issues.apache.org/jira/browse/SPARK-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012027#comment-14012027 ] Sandy Ryza commented on SPARK-1954: --- Cool. Your suggestion does appear to work. Make it easier to get Spark on YARN code to compile in IntelliJ --- Key: SPARK-1954 URL: https://issues.apache.org/jira/browse/SPARK-1954 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Sandy Ryza When loading a project through a Maven pom, IntelliJ allows switching on profiles, but, to my knowledge, doesn't provide a way to set arbitrary properties. To get Spark-on-YARN code to compile in IntelliJ, I need to manually change the hadoop.version in the root pom.xml to 2.2.0 or higher. This is very cumbersome when switching branches. It would be really helpful to add a profile that sets the Hadoop version that IntelliJ can switch on. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1959) String NULL is interpreted as null value
[ https://issues.apache.org/jira/browse/SPARK-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012031#comment-14012031 ] Cheng Lian commented on SPARK-1959: --- The problematic line should be [this one|https://github.com/apache/spark/blob/master/sql%2Fhive%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fhive%2FhiveOperators.scala#L154]. I wonder under what circumstances, would Hive return a Java string {{NULL}} to represent a null value? Is it safe to simply remove this line? [~marmbrus] String NULL is interpreted as null value -- Key: SPARK-1959 URL: https://issues.apache.org/jira/browse/SPARK-1959 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Cheng Lian The {{HiveTableScan}} operator unwraps string NULL (case insensitive) into null values even if the column type is {{STRING}}. To reproduce the bug, we use {{sql/hive/src/test/resources/groupby_groupingid.txt}} as test input, copied to {{/tmp/groupby_groupingid.txt}}. Hive session: {code} hive CREATE TABLE test_null(key INT, value STRING); hive LOAD DATA LOCAL INPATH '/tmp/groupby_groupingid.txt' INTO table test_null; hive SELECT * FROM test_null WHERE value IS NOT NULL; ... OK 1 NULL 1 1 2 2 3 3 3 NULL 4 5 {code} We can see that the {{NULL}} cells in the original input file are interpreted as string {{NULL}} in Hive. Spark SQL session ({{sbt/sbt hive/console}}): {code} scala hql(CREATE TABLE test_null(key INT, value STRING)) scala hql(LOAD DATA LOCAL INPATH '/tmp/groupby_groupingid.txt' INTO table test_null) scala hql(SELECT * FROM test_null WHERE value IS NOT NULL).foreach(println) ... [1,1] [2,2] [3,3] [4,5] {code} As we can see, string {{NULL}} is interpreted as null values in Spark SQL. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1952) slf4j version conflicts with pig
[ https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012040#comment-14012040 ] Patrick Wendell commented on SPARK-1952: So I think the issue here is simply that Spark depends on slf4j 1.7.X, pig depends on slf4j 1.6.X, and those aren't compatible. If you look it's complaining about the function signature of that log() method which changed between 1.6 and 1.7. Further compounding things, Pig uses commons logging, so it's logging things through (commons logging - slf4j). http://grepcode.com/file/repo1.maven.org/maven2/org.slf4j/slf4j-api/1.6.1/org/slf4j/spi/LocationAwareLogger.java#LocationAwareLogger.log%28org.slf4j.Marker%2Cjava.lang.String%2Cint%2Cjava.lang.String%2Cjava.lang.Object%5B%5D%2Cjava.lang.Throwable%29 http://grepcode.com/file/repo1.maven.org/maven2/org.slf4j/slf4j-api/1.7.5/org/slf4j/spi/LocationAwareLogger.java#LocationAwareLogger.log%28org.slf4j.Marker%2Cjava.lang.String%2Cint%2Cjava.lang.String%2Cjava.lang.Object%5B%5D%2Cjava.lang.Throwable%29 The Spark code actually doesn't use any new API's that aren't in slf4j 1.6, so I could see how this worked in 0.9.0. I think the problem here is that Spark 1.0 is now pulling in jul-to-slf4j 1.7.X and that _does_ use newer API's in SLF4j 7. So I'd remove this from the Spark 1.0 build and see if that works (we have an explicit dependency on that). Basically, try to produce a Spark asembly without SLF4JLocationAwareLog.class. I think that should work if I'm understanding this correctly. slf4j version conflicts with pig Key: SPARK-1952 URL: https://issues.apache.org/jira/browse/SPARK-1952 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: pig 12.1 on Cloudera Hadoop, CDH3 Reporter: Ryan Compton Labels: pig, slf4j Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when they register a jar containing Spark. The error appears to be related to org.slf4j.spi.LocationAwareLogger.log. {code} Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598) at java.lang.Thread.dispatchUncaughtException(Thread.java:1874) {code} To reproduce: compile Spark via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt assembly and register the resulting jar into a pig script. E.g. {code} REGISTER /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar; data0 = LOAD 'data' USING PigStorage(); ttt = LIMIT data0 10; DUMP ttt; {code} The Spark-1.0 jar includes some slf4j dependencies that were not present in 0.9.1 {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep LocationAware 3259 Mon Mar 25 21:49:34 PDT 2013 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class 479 Fri Dec 13 16:44:40 PST 2013 parquet/org/slf4j/spi/LocationAwareLogger.class {code} vs. {code} rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep LocationAware 455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1954) Make it easier to get Spark on YARN code to compile in IntelliJ
[ https://issues.apache.org/jira/browse/SPARK-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1954: --- Component/s: Build Make it easier to get Spark on YARN code to compile in IntelliJ --- Key: SPARK-1954 URL: https://issues.apache.org/jira/browse/SPARK-1954 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0 Reporter: Sandy Ryza When loading a project through a Maven pom, IntelliJ allows switching on profiles, but, to my knowledge, doesn't provide a way to set arbitrary properties. To get Spark-on-YARN code to compile in IntelliJ, I need to manually change the hadoop.version in the root pom.xml to 2.2.0 or higher. This is very cumbersome when switching branches. It would be really helpful to add a profile that sets the Hadoop version that IntelliJ can switch on. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1913) Parquet table column pruning error caused by filter pushdown
[ https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-1913: Assignee: Cheng Lian Parquet table column pruning error caused by filter pushdown Key: SPARK-1913 URL: https://issues.apache.org/jira/browse/SPARK-1913 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Environment: mac os 10.9.2 Reporter: Chen Chao Assignee: Cheng Lian When scanning Parquet tables, attributes referenced only in predicates that are pushed down are not passed to the `ParquetTableScan` operator and causes exception. Verified in the {{sbt hive/console}}: {code} loadTestTable(src) table(src).saveAsParquetFile(src.parquet) parquetFile(src.parquet).registerAsTable(src_parquet) hql(SELECT value FROM src_parquet WHERE key 10).collect().foreach(println) {code} Exception {code} parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/scratch/rxin/spark/src.parquet/part-r-2.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:177) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717) at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.IllegalArgumentException: Column key does not exist. at parquet.filter.ColumnRecordFilter$1.bind(ColumnRecordFilter.java:51) at org.apache.spark.sql.parquet.ComparisonFilter.bind(ParquetFilters.scala:306) at parquet.io.FilteredRecordReader.init(FilteredRecordReader.java:46) at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74) at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172) ... 28 more {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1913) Parquet table column pruning error caused by filter pushdown
[ https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-1913. - Resolution: Fixed Parquet table column pruning error caused by filter pushdown Key: SPARK-1913 URL: https://issues.apache.org/jira/browse/SPARK-1913 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Environment: mac os 10.9.2 Reporter: Chen Chao Assignee: Cheng Lian When scanning Parquet tables, attributes referenced only in predicates that are pushed down are not passed to the `ParquetTableScan` operator and causes exception. Verified in the {{sbt hive/console}}: {code} loadTestTable(src) table(src).saveAsParquetFile(src.parquet) parquetFile(src.parquet).registerAsTable(src_parquet) hql(SELECT value FROM src_parquet WHERE key 10).collect().foreach(println) {code} Exception {code} parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/scratch/rxin/spark/src.parquet/part-r-2.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:177) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717) at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.IllegalArgumentException: Column key does not exist. at parquet.filter.ColumnRecordFilter$1.bind(ColumnRecordFilter.java:51) at org.apache.spark.sql.parquet.ComparisonFilter.bind(ParquetFilters.scala:306) at parquet.io.FilteredRecordReader.init(FilteredRecordReader.java:46) at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74) at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172) ... 28 more {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1960) EOFException when file size 0 exists when use sc.sequenceFile[K,V](path)
[ https://issues.apache.org/jira/browse/SPARK-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eunsu Yun updated SPARK-1960: - Description: java.io.EOFException throws when use sc.sequenceFile[K,V] if there is a file which size is 0. I also tested sc.textFile() in the same condition and it does not throw EOFException. val text = sc.sequenceFile[Long, String](data-gz/*.dat.gz) val result = text.filter(filterValid) result.saveAsTextFile(data-out/) -- java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1845) at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1810) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1759) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1773) at org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:49) at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64) at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:156) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) .. was: java.io.EOFException throws when use sc.sequenceFile[K,V] if there is a file which size is 0. I also tested sc.textFile() in the same condition and it does not throw EOFException. val text = sc.sequenceFile[Long, String](data-gz/*.dat.gz) val result = text.filter(filterValid) result.saveAsTextFile(data-out/) -- java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1845) at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1810) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1759) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1773) at org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:49) at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64) at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:156) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) .. Summary: EOFException when file size 0 exists when use sc.sequenceFile[K,V](path) (was: EOFException when 0 size file exists when use sc.sequenceFile[K,V](path)) EOFException when file size 0 exists when use sc.sequenceFile[K,V](path) -- Key: SPARK-1960 URL: https://issues.apache.org/jira/browse/SPARK-1960 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Eunsu Yun java.io.EOFException throws when use sc.sequenceFile[K,V] if there is a file which size is 0. I also tested sc.textFile() in the same condition and it does not throw EOFException. val text = sc.sequenceFile[Long, String](data-gz/*.dat.gz) val result = text.filter(filterValid) result.saveAsTextFile(data-out/) -- java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readFully(DataInputStream.java:169) at