[jira] [Created] (SPARK-1945) Add full Java examples in MLlib docs

2014-05-28 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1945:


 Summary: Add full Java examples in MLlib docs
 Key: SPARK-1945
 URL: https://issues.apache.org/jira/browse/SPARK-1945
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Matei Zaharia


Right now some of the Java tabs only say the following:

All of MLlib’s methods use Java-friendly types, so you can import and call 
them there the same way you do in Scala. The only caveat is that the methods 
take Scala RDD objects, while the Spark Java API uses a separate JavaRDD class. 
You can convert a Java RDD to a Scala one by calling .rdd() on your JavaRDD 
object.

Would be nice to translate the Scala code into Java instead.

Also, a few pages (most notably the Matrix one) don't have Java examples at all.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1153) Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs.

2014-05-28 Thread npanj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010846#comment-14010846
 ] 

npanj edited comment on SPARK-1153 at 5/28/14 6:48 AM:
---

An alternative approach, that I have been using: 
1 Use a preprocessing step that maps UUID to an Long.
2. Build graph based on Longs

For Mapping in step 1:
- Rank your uuids.
- some kind of has function?

For 1, graphx can provide a tool to generate map.

I will like to hear how others are building graphs out of non-Long node types.





was (Author: npanj):
An alternative approach, that I have been using: 
1 Use a preprocessing step that maps UUID to an Long.
2. Build graph based on Longs

For Mapping in step 1:
- Rank your uuids.
- some kind of has function?

For 1, graphx can provide a tool to generate map.

I will like to hear how others are building graphs out of non-Long node types




 Generalize VertexId in GraphX so that UUIDs can be used as vertex IDs.
 --

 Key: SPARK-1153
 URL: https://issues.apache.org/jira/browse/SPARK-1153
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 0.9.0
Reporter: Deepak Nulu

 Currently, {{VertexId}} is a type-synonym for {{Long}}. I would like to be 
 able to use {{UUID}} as the vertex ID type because the data I want to process 
 with GraphX uses that type for its primay-keys. Others might have a different 
 type for their primary-keys. Generalizing {{VertexId}} (with a type class) 
 will help in such cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1946) Submit stage after executors have been registered

2014-05-28 Thread Zhihui (JIRA)
Zhihui created SPARK-1946:
-

 Summary: Submit stage after executors have been registered
 Key: SPARK-1946
 URL: https://issues.apache.org/jira/browse/SPARK-1946
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Zhihui


Because creating TaskSetManager and registering executors are asynchronous, in 
most situation, early stages' tasks run without preferred locality.

A simple solution is sleeping few seconds in application, so that executors 
have enough time to register.

A better way is to make DAGScheduler submit stage after a few of executors have 
been registered by configuration properties.

# submit stage only after successfully registered executors arrived the ratio, 
default value 0
spark.executor.registeredRatio = 0.8

# whatever registeredRatio is arrived, submit stage after the 
maxRegisteredWaitingTime(millisecond), default value 1
spark.executor.maxRegisteredWaitingTime = 5000




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1946) Submit stage after executors have been registered

2014-05-28 Thread Zhihui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihui updated SPARK-1946:
--

Attachment: Spark Task Scheduler Optimization Proposal.pptx

 Submit stage after executors have been registered
 -

 Key: SPARK-1946
 URL: https://issues.apache.org/jira/browse/SPARK-1946
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Zhihui
 Attachments: Spark Task Scheduler Optimization Proposal.pptx


 Because creating TaskSetManager and registering executors are asynchronous, 
 in most situation, early stages' tasks run without preferred locality.
 A simple solution is sleeping few seconds in application, so that executors 
 have enough time to register.
 A better way is to make DAGScheduler submit stage after a few of executors 
 have been registered by configuration properties.
 # submit stage only after successfully registered executors arrived the 
 ratio, default value 0
 spark.executor.registeredRatio = 0.8
 # whatever registeredRatio is arrived, submit stage after the 
 maxRegisteredWaitingTime(millisecond), default value 1
 spark.executor.maxRegisteredWaitingTime = 5000



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1946) Submit stage after executors have been registered

2014-05-28 Thread Zhihui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihui updated SPARK-1946:
--

Description: 
Because creating TaskSetManager and registering executors are asynchronous, in 
most situation, early stages' tasks run without preferred locality.

A simple solution is sleeping few seconds in application, so that executors 
have enough time to register.

A better way is to make DAGScheduler submit stage after a few of executors have 
been registered by configuration properties.

\# submit stage only after successfully registered executors arrived the ratio, 
default value 0
spark.executor.registeredRatio = 0.8

\# whatever registeredRatio is arrived, submit stage after the 
maxRegisteredWaitingTime(millisecond), default value 1
spark.executor.maxRegisteredWaitingTime = 5000


  was:
Because creating TaskSetManager and registering executors are asynchronous, in 
most situation, early stages' tasks run without preferred locality.

A simple solution is sleeping few seconds in application, so that executors 
have enough time to register.

A better way is to make DAGScheduler submit stage after a few of executors have 
been registered by configuration properties.

# submit stage only after successfully registered executors arrived the ratio, 
default value 0
spark.executor.registeredRatio = 0.8

# whatever registeredRatio is arrived, submit stage after the 
maxRegisteredWaitingTime(millisecond), default value 1
spark.executor.maxRegisteredWaitingTime = 5000



 Submit stage after executors have been registered
 -

 Key: SPARK-1946
 URL: https://issues.apache.org/jira/browse/SPARK-1946
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Zhihui
 Attachments: Spark Task Scheduler Optimization Proposal.pptx


 Because creating TaskSetManager and registering executors are asynchronous, 
 in most situation, early stages' tasks run without preferred locality.
 A simple solution is sleeping few seconds in application, so that executors 
 have enough time to register.
 A better way is to make DAGScheduler submit stage after a few of executors 
 have been registered by configuration properties.
 \# submit stage only after successfully registered executors arrived the 
 ratio, default value 0
 spark.executor.registeredRatio = 0.8
 \# whatever registeredRatio is arrived, submit stage after the 
 maxRegisteredWaitingTime(millisecond), default value 1
 spark.executor.maxRegisteredWaitingTime = 5000



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1495) support leftsemijoin for sparkSQL

2014-05-28 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010932#comment-14010932
 ] 

Adrian Wang commented on SPARK-1495:


Another PR [https://github.com/apache/spark/pull/837] submitted.

 support leftsemijoin for sparkSQL
 -

 Key: SPARK-1495
 URL: https://issues.apache.org/jira/browse/SPARK-1495
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Adrian Wang
 Fix For: 1.1.0


 I created Github PR #395 for this 
 issue.[https://github.com/apache/spark/pull/395]
 As marmbrus comments there, one design question is which of the following is 
 better:
 1. multiple operators that handle different kinds of joins, letting the 
 planner pick the correct one
 2. putting the switching logic inside of the operator as is done here



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010937#comment-14010937
 ] 

Sean Owen commented on SPARK-1518:
--

Re: versioning one more time, really supporting a bunch of versions may get 
costly. It's already tricky to manage two builds times YARN-or-not, 
Hive-or-not, times 4 flavors of Hadoop. I doubt the assemblies are yet 
problem-free in all cases. 

In practice it look like one generic Hadoop 1, Hadoop 2, and CDH 4 release is 
produced, and 1 set of Maven artifact. (PS again I am not sure Spark should 
contain a CDH-specific distribution? realizing it's really a proxy for a 
particular Hadoop combo. Same goes for a MapR profile, which is really for 
vendors to maintain) That means right now you can't build a Spark app for 
anything but Hadoop 1.x with Maven, without installing it yourself, and there's 
not an official distro for anything but two major Hadoop versions. Support for 
niche versions isn't really there or promised anyway, and fleshing out 
support may make doing so pretty burdensome. 

There is no suggested action here; if anything I suggest that the right thing 
is to add Maven artifacts with classifiers, add a few binary artifacts, 
subtract a few vendor artifacts, but this is a different action.

 Spark master doesn't compile against hadoop-common trunk
 

 Key: SPARK-1518
 URL: https://issues.apache.org/jira/browse/SPARK-1518
 Project: Spark
  Issue Type: Bug
Reporter: Marcelo Vanzin
Assignee: Colin Patrick McCabe
Priority: Critical

 FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
 FileLogger.scala is calling it.
 I've changed it locally to hsync() so I can compile the code, but haven't 
 checked yet whether those are equivalent. hsync() seems to have been there 
 forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1947) Child of SumDistinct or Average should be widened to prevent overflows the same as Sum.

2014-05-28 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010945#comment-14010945
 ] 

Takuya Ueshin commented on SPARK-1947:
--

PRed: https://github.com/apache/spark/pull/902

 Child of SumDistinct or Average should be widened to prevent overflows the 
 same as Sum.
 ---

 Key: SPARK-1947
 URL: https://issues.apache.org/jira/browse/SPARK-1947
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin

 Child of {{SumDistinct}} or {{Average}} should be widened to prevent 
 overflows the same as {{Sum}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1948) Scalac crashes when building Spark in IntelliJ IDEA

2014-05-28 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-1948:
-

 Summary: Scalac crashes when building Spark in IntelliJ IDEA
 Key: SPARK-1948
 URL: https://issues.apache.org/jira/browse/SPARK-1948
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0
Reporter: Cheng Lian
Priority: Minor


After [commit 0be8b45|https://github.com/apache/spark/commit/0be8b45], the 
master branch fails to compile within IntelliJ IDEA and causes {{scalac}} to 
crash. But building Spark with SBT is OK. This issue is not blocking, but it's 
annoying since it prevents developers from debugging Spark within IDEA.

I can't figure out the exact reason, only nailed down to this commit with 
binary searching. Maybe I should fire a bug issue to IDEA instead?

How to reproduce:

# Checkout [commit 0be8b45|https://github.com/apache/spark/commit/0be8b45]
# Run {{sbt/sbt clean gen-idea}} under Spark source directory
# Open the project in IntelliJ IDEA
# Build the project

The {{scalac}} crash report is attached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1948) Scalac crashes when building Spark in IntelliJ IDEA

2014-05-28 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-1948:
--

Attachment: scalac-crash.log

 Scalac crashes when building Spark in IntelliJ IDEA
 ---

 Key: SPARK-1948
 URL: https://issues.apache.org/jira/browse/SPARK-1948
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0
Reporter: Cheng Lian
Priority: Minor
 Attachments: scalac-crash.log


 After [commit 0be8b45|https://github.com/apache/spark/commit/0be8b45], the 
 master branch fails to compile within IntelliJ IDEA and causes {{scalac}} to 
 crash. But building Spark with SBT is OK. This issue is not blocking, but 
 it's annoying since it prevents developers from debugging Spark within IDEA.
 I can't figure out the exact reason, only nailed down to this commit with 
 binary searching. Maybe I should fire a bug issue to IDEA instead?
 How to reproduce:
 # Checkout [commit 0be8b45|https://github.com/apache/spark/commit/0be8b45]
 # Run {{sbt/sbt clean gen-idea}} under Spark source directory
 # Open the project in IntelliJ IDEA
 # Build the project
 The {{scalac}} crash report is attached.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1951) spark on yarn can't start

2014-05-28 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-1951:
---

Description: 
{{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit  --archives 
/input/lbs/recommend/toona/spark/conf  toona-assembly.jar 20140521}} throw an 
exception:
{code}
Exception in thread main java.io.FileNotFoundException: File 
file:/input/lbs/recommend/toona/spark/conf does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:402)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
at 
org.apache.spark.deploy.yarn.ClientBase$class.org$apache$spark$deploy$yarn$ClientBase$$copyRemoteFile(ClientBase.scala:162)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:237)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:232)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:232)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:230)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:230)
at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:39)
at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:96)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:186)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

{{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit  --archives 
hdfs://10dian72:8020/input/lbs/recommend/toona/spark/conf  toona-assembly.jar 
20140521}} work.

  was:
{{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit  --archives 
/input/lbs/recommend/toona/spark/conf  toona-assembly.jar 20140521}}throw an 
exception:
{code}
Exception in thread main java.io.FileNotFoundException: File 
file:/input/lbs/recommend/toona/spark/conf does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:402)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
at 
org.apache.spark.deploy.yarn.ClientBase$class.org$apache$spark$deploy$yarn$ClientBase$$copyRemoteFile(ClientBase.scala:162)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:237)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:232)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:232)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:230)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:230)
at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:39)
at 

[jira] [Created] (SPARK-1950) spark on yarn can't start

2014-05-28 Thread Guoqiang Li (JIRA)
Guoqiang Li created SPARK-1950:
--

 Summary: spark on yarn can't start 
 Key: SPARK-1950
 URL: https://issues.apache.org/jira/browse/SPARK-1950
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Guoqiang Li
Priority: Blocker


{{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit  --archives 
/input/lbs/recommend/toona/spark/conf  toona-assembly.jar 20140521}}throw an 
exception:
{code}
Exception in thread main java.io.FileNotFoundException: File 
file:/input/lbs/recommend/toona/spark/conf does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:402)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
at 
org.apache.spark.deploy.yarn.ClientBase$class.org$apache$spark$deploy$yarn$ClientBase$$copyRemoteFile(ClientBase.scala:162)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:237)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:232)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:232)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:230)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:230)
at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:39)
at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:96)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:186)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

{{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit  --archives 
hdfs://10dian72:8020/input/lbs/recommend/toona/spark/conf  toona-assembly.jar 
20140521}} work.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1950) spark on yarn can't start

2014-05-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011330#comment-14011330
 ] 

Sean Owen commented on SPARK-1950:
--

(Looks like you opened this twice? 
https://issues.apache.org/jira/browse/SPARK-1951 )

 spark on yarn can't start 
 --

 Key: SPARK-1950
 URL: https://issues.apache.org/jira/browse/SPARK-1950
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Guoqiang Li
Priority: Blocker

 {{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit  --archives 
 /input/lbs/recommend/toona/spark/conf  toona-assembly.jar 20140521}}throw an 
 exception:
 {code}
 Exception in thread main java.io.FileNotFoundException: File 
 file:/input/lbs/recommend/toona/spark/conf does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:402)
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
   at 
 org.apache.spark.deploy.yarn.ClientBase$class.org$apache$spark$deploy$yarn$ClientBase$$copyRemoteFile(ClientBase.scala:162)
   at 
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:237)
   at 
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:232)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at 
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:232)
   at 
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:230)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:230)
   at 
 org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:39)
   at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74)
   at org.apache.spark.deploy.yarn.Client.run(Client.scala:96)
   at org.apache.spark.deploy.yarn.Client$.main(Client.scala:186)
   at org.apache.spark.deploy.yarn.Client.main(Client.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 {code}
 {{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit  --archives 
 hdfs://10dian72:8020/input/lbs/recommend/toona/spark/conf  toona-assembly.jar 
 20140521}} work.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-28 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011350#comment-14011350
 ] 

Colin Patrick McCabe commented on SPARK-1518:
-

bq. Re: versioning one more time, really supporting a bunch of versions may get 
costly. It's already tricky to manage two builds times YARN-or-not, 
Hive-or-not, times 4 flavors of Hadoop. I doubt the assemblies are yet 
problem-free in all cases.

I think in this particular case, we can use reflection to support both Hadoop 
1.X and newer stuff.

bq. I am not sure Spark should contain a CDH-specific distribution? realizing 
it's really a proxy for a particular Hadoop combo. Same goes for a MapR 
profile, which is really for vendors to maintain)

I agree 100%.  We should keep vendor stuff out of the Apache release.  Vendors 
can create their own build setups (that's what they get paid to do, after all.)

bq. There is no suggested action here; if anything I suggest that the right 
thing is to add Maven artifacts with classifiers, add a few binary artifacts, 
subtract a few vendor artifacts, but this is a different action.

If you have some ideas for how to improve the Maven build, it could be worth 
creating a JIRA.  I think you're right that we need to make it more flexible so 
that people can build against more versions without editing the pom.  It might 
be helpful to look at how HBase handles this in its {{pom.xml}} files.

 Spark master doesn't compile against hadoop-common trunk
 

 Key: SPARK-1518
 URL: https://issues.apache.org/jira/browse/SPARK-1518
 Project: Spark
  Issue Type: Bug
Reporter: Marcelo Vanzin
Assignee: Colin Patrick McCabe
Priority: Critical

 FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
 FileLogger.scala is calling it.
 I've changed it locally to hsync() so I can compile the code, but haven't 
 checked yet whether those are equivalent. hsync() seems to have been there 
 forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1916) SparkFlumeEvent with body bigger than 1020 bytes are not read properly

2014-05-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1916:
---

Assignee: David Lemieux

 SparkFlumeEvent with body bigger than 1020 bytes are not read properly
 --

 Key: SPARK-1916
 URL: https://issues.apache.org/jira/browse/SPARK-1916
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9.0
Reporter: David Lemieux
Assignee: David Lemieux
 Attachments: SPARK-1916.diff


 The readExternal implementation on SparkFlumeEvent will read only the first 
 1020 bytes of the actual body when streaming data from flume.
 This means that any event sent to Spark via Flume will be processed properly 
 if the body is small, but will fail if the body is bigger than 1020.
 Considering that the default max size for a Flume Avro Event is 32K, the 
 implementation should be updated to read more.
 The following is related : 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-using-Flume-body-size-limitation-tt6127.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1836) REPL $outer type mismatch causes lookup() and equals() problems

2014-05-28 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011477#comment-14011477
 ] 

Michael Armbrust commented on SPARK-1836:
-

Yeah I think its likely they are related.  We can re-open this one later if 
fixing the other one doesn't solve your issue.

 REPL $outer type mismatch causes lookup() and equals() problems
 ---

 Key: SPARK-1836
 URL: https://issues.apache.org/jira/browse/SPARK-1836
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Michael Malak

 Anand Avati partially traced the cause to REPL wrapping classes in $outer 
 classes. There are at least two major symptoms:
 1. equals()
 =
 In REPL equals() (required in custom classes used as a key for groupByKey) 
 seems to have to be written using instanceOf[] instead of the canonical 
 match{}
 Spark Shell (equals uses match{}):
 {noformat}
 class C(val s:String) extends Serializable {
   override def equals(o: Any) = o match {
 case that: C = that.s == s
 case _ = false
   }
 }
 val x = new C(a)
 val bos = new java.io.ByteArrayOutputStream()
 val out = new java.io.ObjectOutputStream(bos)
 out.writeObject(x);
 val b = bos.toByteArray();
 out.close
 bos.close
 val y = new java.io.ObjectInputStream(new 
 ava.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
 x.equals(y)
 res: Boolean = false
 {noformat}
 Spark Shell (equals uses isInstanceOf[]):
 {noformat}
 class C(val s:String) extends Serializable {
   override def equals(o: Any) = if (o.isInstanceOf[C]) (o.asInstanceOf[C].s = 
 s) else false
 }
 val x = new C(a)
 val bos = new java.io.ByteArrayOutputStream()
 val out = new java.io.ObjectOutputStream(bos)
 out.writeObject(x);
 val b = bos.toByteArray();
 out.close
 bos.close
 val y = new java.io.ObjectInputStream(new 
 ava.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
 x.equals(y)
 res: Boolean = true
 {noformat}
 Scala Shell (equals uses match{}):
 {noformat}
 class C(val s:String) extends Serializable {
   override def equals(o: Any) = o match {
 case that: C = that.s == s
 case _ = false
   }
 }
 val x = new C(a)
 val bos = new java.io.ByteArrayOutputStream()
 val out = new java.io.ObjectOutputStream(bos)
 out.writeObject(x);
 val b = bos.toByteArray();
 out.close
 bos.close
 val y = new java.io.ObjectInputStream(new 
 java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
 x.equals(y)
 res: Boolean = true
 {noformat}
 2. lookup()
 =
 {noformat}
 class C(val s:String) extends Serializable {
   override def equals(o: Any) = if (o.isInstanceOf[C]) o.asInstanceOf[C].s == 
 s else false
   override def hashCode = s.hashCode
   override def toString = s
 }
 val r = sc.parallelize(Array((new C(a),11),(new C(a),12)))
 r.lookup(new C(a))
 console:17: error: type mismatch;
  found   : C
  required: C
   r.lookup(new C(a))
^
 {noformat}
 See
 http://mail-archives.apache.org/mod_mbox/spark-dev/201405.mbox/%3C1400019424.80629.YahooMailNeo%40web160801.mail.bf1.yahoo.com%3E



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1199) Type mismatch in Spark shell when using case class defined in shell

2014-05-28 Thread Michael Malak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011492#comment-14011492
 ] 

Michael Malak commented on SPARK-1199:
--

See also additional test cases in 
https://issues.apache.org/jira/browse/SPARK-1836 which has now been marked as a 
duplicate.

 Type mismatch in Spark shell when using case class defined in shell
 ---

 Key: SPARK-1199
 URL: https://issues.apache.org/jira/browse/SPARK-1199
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Andrew Kerr
Priority: Critical
 Fix For: 1.1.0


 Define a class in the shell:
 {code}
 case class TestClass(a:String)
 {code}
 and an RDD
 {code}
 val data = sc.parallelize(Seq(a)).map(TestClass(_))
 {code}
 define a function on it and map over the RDD
 {code}
 def itemFunc(a:TestClass):TestClass = a
 data.map(itemFunc)
 {code}
 Error:
 {code}
 console:19: error: type mismatch;
  found   : TestClass = TestClass
  required: TestClass = ?
   data.map(itemFunc)
 {code}
 Similarly with a mapPartitions:
 {code}
 def partitionFunc(a:Iterator[TestClass]):Iterator[TestClass] = a
 data.mapPartitions(partitionFunc)
 {code}
 {code}
 console:19: error: type mismatch;
  found   : Iterator[TestClass] = Iterator[TestClass]
  required: Iterator[TestClass] = Iterator[?]
 Error occurred in an application involving default arguments.
   data.mapPartitions(partitionFunc)
 {code}
 The behavior is the same whether in local mode or on a cluster.
 This isn't specific to RDDs. A Scala collection in the Spark shell has the 
 same problem.
 {code}
 scala Seq(TestClass(foo)).map(itemFunc)
 console:15: error: type mismatch;
  found   : TestClass = TestClass
  required: TestClass = ?
   Seq(TestClass(foo)).map(itemFunc)
 ^
 {code}
 When run in the Scala console (not the Spark shell) there are no type 
 mismatch errors.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1836) REPL $outer type mismatch causes lookup() and equals() problems

2014-05-28 Thread Michael Malak (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Malak resolved SPARK-1836.
--

Resolution: Duplicate

 REPL $outer type mismatch causes lookup() and equals() problems
 ---

 Key: SPARK-1836
 URL: https://issues.apache.org/jira/browse/SPARK-1836
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Michael Malak

 Anand Avati partially traced the cause to REPL wrapping classes in $outer 
 classes. There are at least two major symptoms:
 1. equals()
 =
 In REPL equals() (required in custom classes used as a key for groupByKey) 
 seems to have to be written using instanceOf[] instead of the canonical 
 match{}
 Spark Shell (equals uses match{}):
 {noformat}
 class C(val s:String) extends Serializable {
   override def equals(o: Any) = o match {
 case that: C = that.s == s
 case _ = false
   }
 }
 val x = new C(a)
 val bos = new java.io.ByteArrayOutputStream()
 val out = new java.io.ObjectOutputStream(bos)
 out.writeObject(x);
 val b = bos.toByteArray();
 out.close
 bos.close
 val y = new java.io.ObjectInputStream(new 
 ava.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
 x.equals(y)
 res: Boolean = false
 {noformat}
 Spark Shell (equals uses isInstanceOf[]):
 {noformat}
 class C(val s:String) extends Serializable {
   override def equals(o: Any) = if (o.isInstanceOf[C]) (o.asInstanceOf[C].s = 
 s) else false
 }
 val x = new C(a)
 val bos = new java.io.ByteArrayOutputStream()
 val out = new java.io.ObjectOutputStream(bos)
 out.writeObject(x);
 val b = bos.toByteArray();
 out.close
 bos.close
 val y = new java.io.ObjectInputStream(new 
 ava.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
 x.equals(y)
 res: Boolean = true
 {noformat}
 Scala Shell (equals uses match{}):
 {noformat}
 class C(val s:String) extends Serializable {
   override def equals(o: Any) = o match {
 case that: C = that.s == s
 case _ = false
   }
 }
 val x = new C(a)
 val bos = new java.io.ByteArrayOutputStream()
 val out = new java.io.ObjectOutputStream(bos)
 out.writeObject(x);
 val b = bos.toByteArray();
 out.close
 bos.close
 val y = new java.io.ObjectInputStream(new 
 java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
 x.equals(y)
 res: Boolean = true
 {noformat}
 2. lookup()
 =
 {noformat}
 class C(val s:String) extends Serializable {
   override def equals(o: Any) = if (o.isInstanceOf[C]) o.asInstanceOf[C].s == 
 s else false
   override def hashCode = s.hashCode
   override def toString = s
 }
 val r = sc.parallelize(Array((new C(a),11),(new C(a),12)))
 r.lookup(new C(a))
 console:17: error: type mismatch;
  found   : C
  required: C
   r.lookup(new C(a))
^
 {noformat}
 See
 http://mail-archives.apache.org/mod_mbox/spark-dev/201405.mbox/%3C1400019424.80629.YahooMailNeo%40web160801.mail.bf1.yahoo.com%3E



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1936) Add apache header and remove author tags

2014-05-28 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-1936.
--

Resolution: Won't Fix

We should not change these files' license headers because they're files we've 
modified from the Scala interpreter. We mention that we use modified versions 
of these in our LICENSE, but we can't misrepresent the original copyright.

 Add apache header and remove author tags
 

 Key: SPARK-1936
 URL: https://issues.apache.org/jira/browse/SPARK-1936
 Project: Spark
  Issue Type: Bug
Reporter: Devaraj K
Priority: Minor

 These below files don’t have apache header and contain author tags.
 {code:xml}
 spark\repl\src\main\scala\org\apache\spark\repl\SparkExprTyper.scala
 spark\repl\src\main\scala\org\apache\spark\repl\SparkILoop.scala
 spark\repl\src\main\scala\org\apache\spark\repl\SparkILoopInit.scala
 spark\repl\src\main\scala\org\apache\spark\repl\SparkIMain.scala
 spark\repl\src\main\scala\org\apache\spark\repl\SparkImports.scala
 spark\repl\src\main\scala\org\apache\spark\repl\SparkJLineCompletion.scala
 spark\repl\src\main\scala\org\apache\spark\repl\SparkJLineReader.scala
 spark\repl\src\main\scala\org\apache\spark\repl\SparkMemberHandlers.scala
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1790) Update EC2 scripts to support r3 instance types

2014-05-28 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011548#comment-14011548
 ] 

Matei Zaharia commented on SPARK-1790:
--

Thanks Sujeet! Just post here when you have a pull request to fix it.

 Update EC2 scripts to support r3 instance types
 ---

 Key: SPARK-1790
 URL: https://issues.apache.org/jira/browse/SPARK-1790
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Matei Zaharia
  Labels: Starter

 These were recently added by Amazon as a cheaper high-memory option



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1790) Update EC2 scripts to support r3 instance types

2014-05-28 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1790:
-

Labels: Starter  (was: starter)

 Update EC2 scripts to support r3 instance types
 ---

 Key: SPARK-1790
 URL: https://issues.apache.org/jira/browse/SPARK-1790
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Matei Zaharia
  Labels: Starter

 These were recently added by Amazon as a cheaper high-memory option



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1952) slf4j version conflicts with pig

2014-05-28 Thread Ryan Compton (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Compton updated SPARK-1952:


Description: 
Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when 
they register a jar containing Spark. The error appears to be related to 
org.slf4j.spi.LocationAwareLogger.log.

{code}
Caused by: java.lang.RuntimeException: Could not resolve error that
occured when launching map reduce job: java.lang.NoSuchMethodError:
org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598)
at java.lang.Thread.dispatchUncaughtException(Thread.java:1874)
{code}

To reproduce: compile Spark via ```$ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt 
assembly``` and register the resulting jar into a pig script. E.g.

```
REGISTER 
/usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar;
data0 = LOAD 'data' USING PigStorage();
ttt = LIMIT data0 10;
DUMP ttt;
```

The Spark-1.0 jar includes some slf4j dependencies that were not present in 
0.9.1

```
rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep 
LocationAware
  3259 Mon Mar 25 21:49:34 PDT 2013 
org/apache/commons/logging/impl/SLF4JLocationAwareLog.class
   455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
   479 Fri Dec 13 16:44:40 PST 2013 
parquet/org/slf4j/spi/LocationAwareLogger.class
```

vs.

```
rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep 
LocationAware
   455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
```


  was:
Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when 
they register a jar containing Spark. The error appears to be related to 
org.slf4j.spi.LocationAwareLogger.log.

```
Caused by: java.lang.RuntimeException: Could not resolve error that
occured when launching map reduce job: java.lang.NoSuchMethodError:
org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598)
at java.lang.Thread.dispatchUncaughtException(Thread.java:1874)
```

To reproduce: compile Spark via ```$ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt 
assembly``` and register the resulting jar into a pig script. E.g.

```
REGISTER 
/usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar;
data0 = LOAD 'data' USING PigStorage();
ttt = LIMIT data0 10;
DUMP ttt;
```

The Spark-1.0 jar includes some slf4j dependencies that were not present in 
0.9.1

```
rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep 
LocationAware
  3259 Mon Mar 25 21:49:34 PDT 2013 
org/apache/commons/logging/impl/SLF4JLocationAwareLog.class
   455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
   479 Fri Dec 13 16:44:40 PST 2013 
parquet/org/slf4j/spi/LocationAwareLogger.class
```

vs.

```
rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep 
LocationAware
   455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
```



 slf4j version conflicts with pig
 

 Key: SPARK-1952
 URL: https://issues.apache.org/jira/browse/SPARK-1952
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: pig 12.1 on Cloudera Hadoop, CDH3
Reporter: Ryan Compton
  Labels: pig, slf4j
 Fix For: 1.0.0


 Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when 
 they register a jar containing Spark. The error appears to be related to 
 org.slf4j.spi.LocationAwareLogger.log.
 {code}
 Caused by: java.lang.RuntimeException: Could not resolve error that
 occured when launching map reduce job: java.lang.NoSuchMethodError:
 org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598)
 at java.lang.Thread.dispatchUncaughtException(Thread.java:1874)
 {code}
 To reproduce: compile Spark via ```$ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 
 sbt/sbt assembly``` and 

[jira] [Updated] (SPARK-1952) slf4j version conflicts with pig

2014-05-28 Thread Ryan Compton (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Compton updated SPARK-1952:


Description: 
Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when 
they register a jar containing Spark. The error appears to be related to 
org.slf4j.spi.LocationAwareLogger.log.

{code}
Caused by: java.lang.RuntimeException: Could not resolve error that
occured when launching map reduce job: java.lang.NoSuchMethodError:
org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598)
at java.lang.Thread.dispatchUncaughtException(Thread.java:1874)
{code}

To reproduce: compile Spark via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt 
assembly and register the resulting jar into a pig script. E.g.

{code}
REGISTER 
/usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar;
data0 = LOAD 'data' USING PigStorage();
ttt = LIMIT data0 10;
DUMP ttt;
{code}
The Spark-1.0 jar includes some slf4j dependencies that were not present in 
0.9.1

{code}
rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep 
LocationAware
  3259 Mon Mar 25 21:49:34 PDT 2013 
org/apache/commons/logging/impl/SLF4JLocationAwareLog.class
   455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
   479 Fri Dec 13 16:44:40 PST 2013 
parquet/org/slf4j/spi/LocationAwareLogger.class
{code}

vs.

{code}
rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep 
LocationAware
   455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
{code}


  was:
Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when 
they register a jar containing Spark. The error appears to be related to 
org.slf4j.spi.LocationAwareLogger.log.

{code}
Caused by: java.lang.RuntimeException: Could not resolve error that
occured when launching map reduce job: java.lang.NoSuchMethodError:
org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598)
at java.lang.Thread.dispatchUncaughtException(Thread.java:1874)
{code}

To reproduce: compile Spark via ```$ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt 
assembly``` and register the resulting jar into a pig script. E.g.

{code}
REGISTER 
/usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar;
data0 = LOAD 'data' USING PigStorage();
ttt = LIMIT data0 10;
DUMP ttt;
{code}
The Spark-1.0 jar includes some slf4j dependencies that were not present in 
0.9.1

{code}
rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep 
LocationAware
  3259 Mon Mar 25 21:49:34 PDT 2013 
org/apache/commons/logging/impl/SLF4JLocationAwareLog.class
   455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
   479 Fri Dec 13 16:44:40 PST 2013 
parquet/org/slf4j/spi/LocationAwareLogger.class
{code}

vs.

{code}
rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep 
LocationAware
   455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
{code}



 slf4j version conflicts with pig
 

 Key: SPARK-1952
 URL: https://issues.apache.org/jira/browse/SPARK-1952
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: pig 12.1 on Cloudera Hadoop, CDH3
Reporter: Ryan Compton
  Labels: pig, slf4j
 Fix For: 1.0.0


 Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when 
 they register a jar containing Spark. The error appears to be related to 
 org.slf4j.spi.LocationAwareLogger.log.
 {code}
 Caused by: java.lang.RuntimeException: Could not resolve error that
 occured when launching map reduce job: java.lang.NoSuchMethodError:
 org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598)
 at java.lang.Thread.dispatchUncaughtException(Thread.java:1874)
 {code}
 To reproduce: compile Spark via $ 

[jira] [Updated] (SPARK-1952) slf4j version conflicts with pig

2014-05-28 Thread Ryan Compton (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Compton updated SPARK-1952:


Description: 
Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when 
they register a jar containing Spark. The error appears to be related to 
org.slf4j.spi.LocationAwareLogger.log.

{code}
Caused by: java.lang.RuntimeException: Could not resolve error that
occured when launching map reduce job: java.lang.NoSuchMethodError:
org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598)
at java.lang.Thread.dispatchUncaughtException(Thread.java:1874)
{code}

To reproduce: compile Spark via ```$ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt 
assembly``` and register the resulting jar into a pig script. E.g.

{code}
REGISTER 
/usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar;
data0 = LOAD 'data' USING PigStorage();
ttt = LIMIT data0 10;
DUMP ttt;
{code}
The Spark-1.0 jar includes some slf4j dependencies that were not present in 
0.9.1

{code}
rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep 
LocationAware
  3259 Mon Mar 25 21:49:34 PDT 2013 
org/apache/commons/logging/impl/SLF4JLocationAwareLog.class
   455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
   479 Fri Dec 13 16:44:40 PST 2013 
parquet/org/slf4j/spi/LocationAwareLogger.class
{code}

vs.

{code}
rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep 
LocationAware
   455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
{code}


  was:
Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when 
they register a jar containing Spark. The error appears to be related to 
org.slf4j.spi.LocationAwareLogger.log.

{code}
Caused by: java.lang.RuntimeException: Could not resolve error that
occured when launching map reduce job: java.lang.NoSuchMethodError:
org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598)
at java.lang.Thread.dispatchUncaughtException(Thread.java:1874)
{code}

To reproduce: compile Spark via ```$ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt 
assembly``` and register the resulting jar into a pig script. E.g.

```
REGISTER 
/usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar;
data0 = LOAD 'data' USING PigStorage();
ttt = LIMIT data0 10;
DUMP ttt;
```

The Spark-1.0 jar includes some slf4j dependencies that were not present in 
0.9.1

```
rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep 
LocationAware
  3259 Mon Mar 25 21:49:34 PDT 2013 
org/apache/commons/logging/impl/SLF4JLocationAwareLog.class
   455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
   479 Fri Dec 13 16:44:40 PST 2013 
parquet/org/slf4j/spi/LocationAwareLogger.class
```

vs.

```
rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep 
LocationAware
   455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
```



 slf4j version conflicts with pig
 

 Key: SPARK-1952
 URL: https://issues.apache.org/jira/browse/SPARK-1952
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: pig 12.1 on Cloudera Hadoop, CDH3
Reporter: Ryan Compton
  Labels: pig, slf4j
 Fix For: 1.0.0


 Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when 
 they register a jar containing Spark. The error appears to be related to 
 org.slf4j.spi.LocationAwareLogger.log.
 {code}
 Caused by: java.lang.RuntimeException: Could not resolve error that
 occured when launching map reduce job: java.lang.NoSuchMethodError:
 org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598)
 at java.lang.Thread.dispatchUncaughtException(Thread.java:1874)
 {code}
 To reproduce: compile Spark via ```$ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 
 

[jira] [Created] (SPARK-1954) Make it easier to get Spark on YARN code to compile in IntelliJ

2014-05-28 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-1954:
-

 Summary: Make it easier to get Spark on YARN code to compile in 
IntelliJ
 Key: SPARK-1954
 URL: https://issues.apache.org/jira/browse/SPARK-1954
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Sandy Ryza


When loading a project through a Maven pom, IntelliJ allows switching on 
profiles, but, to my knowledge, doesn't provide a way to set arbitrary 
properties. 

To get Spark-on-YARN code to compile in IntelliJ, I need to manually change the 
hadoop.version in the root pom.xml to 2.2.0 or higher.  This is very cumbersome 
when switching branches.

It would be really helpful to add a profile that sets the Hadoop version that 
IntelliJ can switch on.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1955) VertexRDD can incorrectly assume index sharing

2014-05-28 Thread Ankur Dave (JIRA)
Ankur Dave created SPARK-1955:
-

 Summary: VertexRDD can incorrectly assume index sharing
 Key: SPARK-1955
 URL: https://issues.apache.org/jira/browse/SPARK-1955
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Ankur Dave
Assignee: Ankur Dave
Priority: Minor


Many VertexRDD operations (diff, leftJoin, innerJoin) can use a fast zip join 
if both operands are VertexRDDs sharing the same index (i.e., one operand is 
derived from the other). However, this check is implemented by matching on the 
operand type and using the fast join strategy if it is a VertexRDD.

When the two VertexRDDs have the same partitioner but different indexes, this 
is fine, because each VertexPartition will detect the index mismatch and fall 
back to the slow but correct local join strategy.

However, when they have different numbers of partitions or different partition 
functions, an exception or even silently incorrect results can occur.

For example:

{code}
// Construct VertexRDDs with different numbers of partitions
val a = VertexRDD(sc.parallelize(List((0L, 1), (1L, 2)), 1))
val b = VertexRDD(sc.parallelize(List((0L, 5)), 8))
// Try to join them. Appears to work...
val c = a.innerJoin(b) { (vid, x, y) = x + y }
// ... but then fails with java.lang.IllegalArgumentException: Can't zip RDDs 
with unequal numbers of partitions
c.collect
{code}

{code}
import org.apache.spark._
// Construct VertexRDDs with different partition functions
val a = VertexRDD(sc.parallelize(List((0L, 1), (1L, 2))).partitionBy(new 
HashPartitioner(2)))
val bVerts = sc.parallelize(List((1L, 5)))
val b = VertexRDD(bVerts.partitionBy(new RangePartitioner(2, bVerts)))
// Try to join them. We expect (1L, 7).
val c = a.innerJoin(b) { (vid, x, y) = x + y }
// Silent failure: we get an empty set!
c.collect
{code}

VertexRDD should check equality of partitioners before using the fast zip join. 
If the partitioners are different, the two datasets should be automatically 
co-partitioned.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1501) Assertions in Graph.apply test are never executed

2014-05-28 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-1501.
---

Resolution: Fixed
  Assignee: William Benton

 Assertions in Graph.apply test are never executed
 -

 Key: SPARK-1501
 URL: https://issues.apache.org/jira/browse/SPARK-1501
 Project: Spark
  Issue Type: Test
  Components: GraphX
Affects Versions: 1.0.0
Reporter: William Benton
Assignee: William Benton
Priority: Minor
  Labels: test

 The current Graph.apply test in GraphSuite contains assertions within an RDD 
 transformation.  These never execute because the transformation never 
 executes.  I have a (trivial) patch to fix this by collecting the graph 
 triplets first.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1952) slf4j version conflicts with pig

2014-05-28 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011711#comment-14011711
 ] 

Matei Zaharia commented on SPARK-1952:
--

Ryan, do you know what SLF4J version Pig needs? It might be possible to just 
build Spark with an older one for this release.

Also, did you build your Spark version with Hive? That might be bringing in 
these dependencies.

 slf4j version conflicts with pig
 

 Key: SPARK-1952
 URL: https://issues.apache.org/jira/browse/SPARK-1952
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: pig 12.1 on Cloudera Hadoop, CDH3
Reporter: Ryan Compton
  Labels: pig, slf4j
 Fix For: 1.0.0


 Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when 
 they register a jar containing Spark. The error appears to be related to 
 org.slf4j.spi.LocationAwareLogger.log.
 {code}
 Caused by: java.lang.RuntimeException: Could not resolve error that
 occured when launching map reduce job: java.lang.NoSuchMethodError:
 org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598)
 at java.lang.Thread.dispatchUncaughtException(Thread.java:1874)
 {code}
 To reproduce: compile Spark via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt 
 assembly and register the resulting jar into a pig script. E.g.
 {code}
 REGISTER 
 /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar;
 data0 = LOAD 'data' USING PigStorage();
 ttt = LIMIT data0 10;
 DUMP ttt;
 {code}
 The Spark-1.0 jar includes some slf4j dependencies that were not present in 
 0.9.1
 {code}
 rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
 spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep 
 LocationAware
   3259 Mon Mar 25 21:49:34 PDT 2013 
 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class
455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
479 Fri Dec 13 16:44:40 PST 2013 
 parquet/org/slf4j/spi/LocationAwareLogger.class
 {code}
 vs.
 {code}
 rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
 spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep 
 LocationAware
455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1952) slf4j version conflicts with pig

2014-05-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1952:
---

Target Version/s: 1.0.1  (was: 1.0.0)

 slf4j version conflicts with pig
 

 Key: SPARK-1952
 URL: https://issues.apache.org/jira/browse/SPARK-1952
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: pig 12.1 on Cloudera Hadoop, CDH3
Reporter: Ryan Compton
  Labels: pig, slf4j

 Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when 
 they register a jar containing Spark. The error appears to be related to 
 org.slf4j.spi.LocationAwareLogger.log.
 {code}
 Caused by: java.lang.RuntimeException: Could not resolve error that
 occured when launching map reduce job: java.lang.NoSuchMethodError:
 org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598)
 at java.lang.Thread.dispatchUncaughtException(Thread.java:1874)
 {code}
 To reproduce: compile Spark via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt 
 assembly and register the resulting jar into a pig script. E.g.
 {code}
 REGISTER 
 /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar;
 data0 = LOAD 'data' USING PigStorage();
 ttt = LIMIT data0 10;
 DUMP ttt;
 {code}
 The Spark-1.0 jar includes some slf4j dependencies that were not present in 
 0.9.1
 {code}
 rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
 spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep 
 LocationAware
   3259 Mon Mar 25 21:49:34 PDT 2013 
 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class
455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
479 Fri Dec 13 16:44:40 PST 2013 
 parquet/org/slf4j/spi/LocationAwareLogger.class
 {code}
 vs.
 {code}
 rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
 spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep 
 LocationAware
455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1956) Enable shuffle consolidation by default

2014-05-28 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011741#comment-14011741
 ] 

Mridul Muralidharan commented on SPARK-1956:


shuffle consolidation MUST NOT be enabled - whether by default, or 
intentionally.
In 1.0, it is very badly broken - we have a whole litany of fixes for it, 
before it was reasonably stable.

Current plan is to contribute most of these back in 1.1 timeframe.

 Enable shuffle consolidation by default
 ---

 Key: SPARK-1956
 URL: https://issues.apache.org/jira/browse/SPARK-1956
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.0.0
Reporter: Sandy Ryza

 The only drawbacks are on ext3, and most everyone has ext4 at this point.  I 
 think it's better to aim the default at the common case.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1952) slf4j version conflicts with pig

2014-05-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1952:
---

Fix Version/s: (was: 1.0.0)

 slf4j version conflicts with pig
 

 Key: SPARK-1952
 URL: https://issues.apache.org/jira/browse/SPARK-1952
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: pig 12.1 on Cloudera Hadoop, CDH3
Reporter: Ryan Compton
  Labels: pig, slf4j

 Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when 
 they register a jar containing Spark. The error appears to be related to 
 org.slf4j.spi.LocationAwareLogger.log.
 {code}
 Caused by: java.lang.RuntimeException: Could not resolve error that
 occured when launching map reduce job: java.lang.NoSuchMethodError:
 org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598)
 at java.lang.Thread.dispatchUncaughtException(Thread.java:1874)
 {code}
 To reproduce: compile Spark via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt 
 assembly and register the resulting jar into a pig script. E.g.
 {code}
 REGISTER 
 /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar;
 data0 = LOAD 'data' USING PigStorage();
 ttt = LIMIT data0 10;
 DUMP ttt;
 {code}
 The Spark-1.0 jar includes some slf4j dependencies that were not present in 
 0.9.1
 {code}
 rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
 spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep 
 LocationAware
   3259 Mon Mar 25 21:49:34 PDT 2013 
 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class
455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
479 Fri Dec 13 16:44:40 PST 2013 
 parquet/org/slf4j/spi/LocationAwareLogger.class
 {code}
 vs.
 {code}
 rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
 spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep 
 LocationAware
455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1916) SparkFlumeEvent with body bigger than 1020 bytes are not read properly

2014-05-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1916.


   Resolution: Fixed
Fix Version/s: 0.9.2
   1.0.1

Issue resolved by pull request 865
[https://github.com/apache/spark/pull/865]

 SparkFlumeEvent with body bigger than 1020 bytes are not read properly
 --

 Key: SPARK-1916
 URL: https://issues.apache.org/jira/browse/SPARK-1916
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9.0
Reporter: David Lemieux
Assignee: David Lemieux
 Fix For: 1.0.1, 0.9.2

 Attachments: SPARK-1916.diff


 The readExternal implementation on SparkFlumeEvent will read only the first 
 1020 bytes of the actual body when streaming data from flume.
 This means that any event sent to Spark via Flume will be processed properly 
 if the body is small, but will fail if the body is bigger than 1020.
 Considering that the default max size for a Flume Avro Event is 32K, the 
 implementation should be updated to read more.
 The following is related : 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-using-Flume-body-size-limitation-tt6127.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line

2014-05-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1576:
---

Fix Version/s: (was: 0.9.1)
   0.9.2

 Passing of JAVA_OPTS to YARN on command line
 

 Key: SPARK-1576
 URL: https://issues.apache.org/jira/browse/SPARK-1576
 Project: Spark
  Issue Type: Improvement
Affects Versions: 0.9.0, 1.0.0, 0.9.1
Reporter: Nishkam Ravi
 Fix For: 0.9.0, 1.0.0, 0.9.2

 Attachments: SPARK-1576.patch


 JAVA_OPTS can be passed by using either env variables (i.e., SPARK_JAVA_OPTS) 
 or as config vars (after Patrick's recent change). It would be good to allow 
 the user to pass them on command line as well to restrict scope to single 
 application invocation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1849) Broken UTF-8 encoded data gets character replacements and thus can't be fixed

2014-05-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1849:
---

Fix Version/s: (was: 0.9.1)
   0.9.2

 Broken UTF-8 encoded data gets character replacements and thus can't be 
 fixed
 ---

 Key: SPARK-1849
 URL: https://issues.apache.org/jira/browse/SPARK-1849
 Project: Spark
  Issue Type: Bug
Reporter: Harry Brundage
 Fix For: 1.0.0, 0.9.2

 Attachments: encoding_test


 I'm trying to process a file which isn't valid UTF-8 data inside hadoop using 
 Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that 
 we should fix? It looks like {{HadoopRDD}} uses 
 {{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I 
 believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement 
 character, \uFFFD. Some example code mimicking what {{sc.textFile}} does 
 underneath:
 {code}
 scala sc.textFile(path).collect()(0)
 res8: String = ?pple
 scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
 classOf[Text]).map(pair = pair._2.toString).collect()(0).getBytes()
 res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101)
 scala sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
 classOf[Text]).map(pair = pair._2.getBytes).collect()(0)
 res10: Array[Byte] = Array(-60, 112, 112, 108, 101)
 {code}
 In the above example, the first two snippets show the string representation 
 and byte representation of the example line of text. The string shows a 
 question mark for the replacement character and the bytes reveal the 
 replacement character has been swapped in by {{Text.toString}}. The third 
 snippet shows what happens if you call {{getBytes}} on the {{Text}} object 
 which comes back from hadoop land: we get the real bytes in the file out.
 Now, I think this is a bug, though you may disagree. The text inside my file 
 is perfectly valid iso-8859-1 encoded bytes, which I would like to be able to 
 rescue and re-encode into UTF-8, because I want my application to be smart 
 like that. I think Spark should give me the raw broken string so I can 
 re-encode, but I can't get at the original bytes in order to guess at what 
 the source encoding might be, as they have already been replaced. I'm dealing 
 with data from some CDN access logs which are to put it nicely diversely 
 encoded, but I think a use case Spark should fully support. So, my suggested 
 fix, which I'd like some guidance, is to change {{textFile}} to spit out 
 broken strings by not using {{Text}}'s UTF-8 encoding.
 Further compounding this issue is that my application is actually in PySpark, 
 but we can talk about how bytes fly through to Scala land after this if we 
 agree that this is an issue at all. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1759) sbt/sbt package fail cause by directory

2014-05-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1759:
---

Fix Version/s: (was: 0.9.1)
   0.9.2

 sbt/sbt package fail cause by directory
 ---

 Key: SPARK-1759
 URL: https://issues.apache.org/jira/browse/SPARK-1759
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 0.9.1
 Environment: ubuntu14.04
Reporter: Jian Pan
 Fix For: 0.9.2

   Original Estimate: 1h
  Remaining Estimate: 1h

 1.create a project named simpleApp
 $cd simpleApp
 $ find .
 .
 ./simple.sbt
 ./src
 ./src/main
 ./src/main/scala
 ./src/main/scala/simpleApp.scala
 $ ~/Software/spark-0.9.1/sbt/sbt 
 awk: fatal: cannot open file `./project/build.properties' for reading (No 
 such file or directory)
 Attempting to fetch sbt
 /home/jpan/Software/spark-0.9.1/sbt/sbt: line 35: /sbt/sbt-launch-.jar: No 
 such file or directory
 /home/jpan/Software/spark-0.9.1/sbt/sbt: line 35: /sbt/sbt-launch-.jar: No 
 such file or directory
 Our attempt to download sbt locally to /sbt/sbt-launch-.jar failed. Please 
 install sbt manually from http://www.scala-sbt.org/
 it failed because sbt  use relative path。



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1712) ParallelCollectionRDD operations hanging forever without any error messages

2014-05-28 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-1712.
--

Resolution: Fixed

 ParallelCollectionRDD operations hanging forever without any error messages 
 

 Key: SPARK-1712
 URL: https://issues.apache.org/jira/browse/SPARK-1712
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
 Environment: Linux Ubuntu 14.04, a single spark node; standalone mode.
Reporter: Piotr Kołaczkowski
Assignee: Guoqiang Li
Priority: Blocker
 Attachments: executor.jstack.txt, master.jstack.txt, repl.jstack.txt, 
 spark-hang.png, worker.jstack.txt


  conf/spark-defaults.conf
 {code}
 spark.akka.frameSize 5
 spark.default.parallelism1
 {code}
 {noformat}
 scala val collection = (1 to 100).map(i = (foo + i, i)).toVector
 collection: Vector[(String, Int)] = Vector((foo1,1), (foo2,2), (foo3,3), 
 (foo4,4), (foo5,5), (foo6,6), (foo7,7), (foo8,8), (foo9,9), (foo10,10), 
 (foo11,11), (foo12,12), (foo13,13), (foo14,14), (foo15,15), (foo16,16), 
 (foo17,17), (foo18,18), (foo19,19), (foo20,20), (foo21,21), (foo22,22), 
 (foo23,23), (foo24,24), (foo25,25), (foo26,26), (foo27,27), (foo28,28), 
 (foo29,29), (foo30,30), (foo31,31), (foo32,32), (foo33,33), (foo34,34), 
 (foo35,35), (foo36,36), (foo37,37), (foo38,38), (foo39,39), (foo40,40), 
 (foo41,41), (foo42,42), (foo43,43), (foo44,44), (foo45,45), (foo46,46), 
 (foo47,47), (foo48,48), (foo49,49), (foo50,50), (foo51,51), (foo52,52), 
 (foo53,53), (foo54,54), (foo55,55), (foo56,56), (foo57,57), (foo58,58), 
 (foo59,59), (foo60,60), (foo61,61), (foo62,62), (foo63,63), (foo64,64), 
 (foo...
 scala val rdd = sc.parallelize(collection)
 rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at 
 parallelize at console:24
 scala rdd.first
 res4: (String, Int) = (foo1,1)
 scala rdd.map(_._2).sum
 // nothing happens
 {noformat}
 CPU and I/O idle. 
 Memory usage reported by JVM, after manually triggered GC:
 repl: 216 MB / 2 GB
 executor: 67 MB / 2 GB
 worker: 6 MB / 128 MB
 master: 6 MB / 128 MB
 No errors found in worker's stderr/stdout. 
 It works fine with 700,000 elements and then it takes about 1 second to 
 process the request and calculate the sum. With 700,000 items the spark 
 executor memory doesn't even exceed 300 MB out of 2GB available. It fails 
 with 800,000 items.
 Multiple parralelized collections of size 700,000 items at the same time in 
 the same session work fine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1712) ParallelCollectionRDD operations hanging forever without any error messages

2014-05-28 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1712:
-

Priority: Major  (was: Blocker)

 ParallelCollectionRDD operations hanging forever without any error messages 
 

 Key: SPARK-1712
 URL: https://issues.apache.org/jira/browse/SPARK-1712
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
 Environment: Linux Ubuntu 14.04, a single spark node; standalone mode.
Reporter: Piotr Kołaczkowski
Assignee: Guoqiang Li
 Attachments: executor.jstack.txt, master.jstack.txt, repl.jstack.txt, 
 spark-hang.png, worker.jstack.txt


  conf/spark-defaults.conf
 {code}
 spark.akka.frameSize 5
 spark.default.parallelism1
 {code}
 {noformat}
 scala val collection = (1 to 100).map(i = (foo + i, i)).toVector
 collection: Vector[(String, Int)] = Vector((foo1,1), (foo2,2), (foo3,3), 
 (foo4,4), (foo5,5), (foo6,6), (foo7,7), (foo8,8), (foo9,9), (foo10,10), 
 (foo11,11), (foo12,12), (foo13,13), (foo14,14), (foo15,15), (foo16,16), 
 (foo17,17), (foo18,18), (foo19,19), (foo20,20), (foo21,21), (foo22,22), 
 (foo23,23), (foo24,24), (foo25,25), (foo26,26), (foo27,27), (foo28,28), 
 (foo29,29), (foo30,30), (foo31,31), (foo32,32), (foo33,33), (foo34,34), 
 (foo35,35), (foo36,36), (foo37,37), (foo38,38), (foo39,39), (foo40,40), 
 (foo41,41), (foo42,42), (foo43,43), (foo44,44), (foo45,45), (foo46,46), 
 (foo47,47), (foo48,48), (foo49,49), (foo50,50), (foo51,51), (foo52,52), 
 (foo53,53), (foo54,54), (foo55,55), (foo56,56), (foo57,57), (foo58,58), 
 (foo59,59), (foo60,60), (foo61,61), (foo62,62), (foo63,63), (foo64,64), 
 (foo...
 scala val rdd = sc.parallelize(collection)
 rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at 
 parallelize at console:24
 scala rdd.first
 res4: (String, Int) = (foo1,1)
 scala rdd.map(_._2).sum
 // nothing happens
 {noformat}
 CPU and I/O idle. 
 Memory usage reported by JVM, after manually triggered GC:
 repl: 216 MB / 2 GB
 executor: 67 MB / 2 GB
 worker: 6 MB / 128 MB
 master: 6 MB / 128 MB
 No errors found in worker's stderr/stdout. 
 It works fine with 700,000 elements and then it takes about 1 second to 
 process the request and calculate the sum. With 700,000 items the spark 
 executor memory doesn't even exceed 300 MB out of 2GB available. It fails 
 with 800,000 items.
 Multiple parralelized collections of size 700,000 items at the same time in 
 the same session work fine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1817) RDD zip erroneous when partitions do not divide RDD count

2014-05-28 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1817:
-

Priority: Minor  (was: Blocker)

 RDD zip erroneous when partitions do not divide RDD count
 -

 Key: SPARK-1817
 URL: https://issues.apache.org/jira/browse/SPARK-1817
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0
Reporter: Michael Malak
Assignee: Kan Zhang
Priority: Minor

 Example:
 scala sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect
 res1: Array[(Long, Int)] = Array((2,11))
 But more generally, it's whenever the number of partitions does not evenly 
 divide the total number of elements in the RDD.
 See https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1817) RDD zip erroneous when partitions do not divide RDD count

2014-05-28 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1817:
-

Priority: Major  (was: Minor)

 RDD zip erroneous when partitions do not divide RDD count
 -

 Key: SPARK-1817
 URL: https://issues.apache.org/jira/browse/SPARK-1817
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0
Reporter: Michael Malak
Assignee: Kan Zhang

 Example:
 scala sc.parallelize(1L to 2L,4).zip(sc.parallelize(11 to 12,4)).collect
 res1: Array[(Long, Int)] = Array((2,11))
 But more generally, it's whenever the number of partitions does not evenly 
 divide the total number of elements in the RDD.
 See https://groups.google.com/forum/#!msg/spark-users/demrmjHFnoc/Ek3ijiXHr2MJ



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1712) ParallelCollectionRDD operations hanging forever without any error messages

2014-05-28 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1712:
-

Fix Version/s: 1.0.1

 ParallelCollectionRDD operations hanging forever without any error messages 
 

 Key: SPARK-1712
 URL: https://issues.apache.org/jira/browse/SPARK-1712
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
 Environment: Linux Ubuntu 14.04, a single spark node; standalone mode.
Reporter: Piotr Kołaczkowski
Assignee: Guoqiang Li
 Fix For: 1.0.1

 Attachments: executor.jstack.txt, master.jstack.txt, repl.jstack.txt, 
 spark-hang.png, worker.jstack.txt


  conf/spark-defaults.conf
 {code}
 spark.akka.frameSize 5
 spark.default.parallelism1
 {code}
 {noformat}
 scala val collection = (1 to 100).map(i = (foo + i, i)).toVector
 collection: Vector[(String, Int)] = Vector((foo1,1), (foo2,2), (foo3,3), 
 (foo4,4), (foo5,5), (foo6,6), (foo7,7), (foo8,8), (foo9,9), (foo10,10), 
 (foo11,11), (foo12,12), (foo13,13), (foo14,14), (foo15,15), (foo16,16), 
 (foo17,17), (foo18,18), (foo19,19), (foo20,20), (foo21,21), (foo22,22), 
 (foo23,23), (foo24,24), (foo25,25), (foo26,26), (foo27,27), (foo28,28), 
 (foo29,29), (foo30,30), (foo31,31), (foo32,32), (foo33,33), (foo34,34), 
 (foo35,35), (foo36,36), (foo37,37), (foo38,38), (foo39,39), (foo40,40), 
 (foo41,41), (foo42,42), (foo43,43), (foo44,44), (foo45,45), (foo46,46), 
 (foo47,47), (foo48,48), (foo49,49), (foo50,50), (foo51,51), (foo52,52), 
 (foo53,53), (foo54,54), (foo55,55), (foo56,56), (foo57,57), (foo58,58), 
 (foo59,59), (foo60,60), (foo61,61), (foo62,62), (foo63,63), (foo64,64), 
 (foo...
 scala val rdd = sc.parallelize(collection)
 rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at 
 parallelize at console:24
 scala rdd.first
 res4: (String, Int) = (foo1,1)
 scala rdd.map(_._2).sum
 // nothing happens
 {noformat}
 CPU and I/O idle. 
 Memory usage reported by JVM, after manually triggered GC:
 repl: 216 MB / 2 GB
 executor: 67 MB / 2 GB
 worker: 6 MB / 128 MB
 master: 6 MB / 128 MB
 No errors found in worker's stderr/stdout. 
 It works fine with 700,000 elements and then it takes about 1 second to 
 process the request and calculate the sum. With 700,000 items the spark 
 executor memory doesn't even exceed 300 MB out of 2GB available. It fails 
 with 800,000 items.
 Multiple parralelized collections of size 700,000 items at the same time in 
 the same session work fine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1952) slf4j version conflicts with pig

2014-05-28 Thread Ryan Compton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011789#comment-14011789
 ] 

Ryan Compton commented on SPARK-1952:
-

Pig depends on slf4j 1.6.1

{code}
rfcompton@node19 /d/t/c/pig-0.12.1 cat ivy/libraries.properties | grep 4j
log4j.version=1.2.16
slf4j-api.version=1.6.1
slf4j-log4j12.version=1.6.1
{code}

I don't use Hive, and according to 
http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html#hive-support
 it's not packaged with Spark by default so I don't think it's Hive.





 slf4j version conflicts with pig
 

 Key: SPARK-1952
 URL: https://issues.apache.org/jira/browse/SPARK-1952
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: pig 12.1 on Cloudera Hadoop, CDH3
Reporter: Ryan Compton
  Labels: pig, slf4j

 Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when 
 they register a jar containing Spark. The error appears to be related to 
 org.slf4j.spi.LocationAwareLogger.log.
 {code}
 Caused by: java.lang.RuntimeException: Could not resolve error that
 occured when launching map reduce job: java.lang.NoSuchMethodError:
 org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598)
 at java.lang.Thread.dispatchUncaughtException(Thread.java:1874)
 {code}
 To reproduce: compile Spark via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt 
 assembly and register the resulting jar into a pig script. E.g.
 {code}
 REGISTER 
 /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar;
 data0 = LOAD 'data' USING PigStorage();
 ttt = LIMIT data0 10;
 DUMP ttt;
 {code}
 The Spark-1.0 jar includes some slf4j dependencies that were not present in 
 0.9.1
 {code}
 rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
 spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep 
 LocationAware
   3259 Mon Mar 25 21:49:34 PDT 2013 
 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class
455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
479 Fri Dec 13 16:44:40 PST 2013 
 parquet/org/slf4j/spi/LocationAwareLogger.class
 {code}
 vs.
 {code}
 rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
 spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep 
 LocationAware
455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1952) slf4j version conflicts with pig

2014-05-28 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011806#comment-14011806
 ] 

Patrick Wendell commented on SPARK-1952:


Hm, unfortunately I dont' see any obvious culprits here. The slf4j version had 
only a small bump in Spark 1.0 to 1.7.5 from 1.7.2, I don't think it would have 
radically changed the classes that are included in the jar. The parquet slf4j 
stuff is expected, since parquet shades slf4j, it will have it's own copy of 
slf4j sitting around, but this shouldn't conflict at all.

 slf4j version conflicts with pig
 

 Key: SPARK-1952
 URL: https://issues.apache.org/jira/browse/SPARK-1952
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: pig 12.1 on Cloudera Hadoop, CDH3
Reporter: Ryan Compton
  Labels: pig, slf4j

 Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when 
 they register a jar containing Spark. The error appears to be related to 
 org.slf4j.spi.LocationAwareLogger.log.
 {code}
 Caused by: java.lang.RuntimeException: Could not resolve error that
 occured when launching map reduce job: java.lang.NoSuchMethodError:
 org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598)
 at java.lang.Thread.dispatchUncaughtException(Thread.java:1874)
 {code}
 To reproduce: compile Spark via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt 
 assembly and register the resulting jar into a pig script. E.g.
 {code}
 REGISTER 
 /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar;
 data0 = LOAD 'data' USING PigStorage();
 ttt = LIMIT data0 10;
 DUMP ttt;
 {code}
 The Spark-1.0 jar includes some slf4j dependencies that were not present in 
 0.9.1
 {code}
 rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
 spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep 
 LocationAware
   3259 Mon Mar 25 21:49:34 PDT 2013 
 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class
455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
479 Fri Dec 13 16:44:40 PST 2013 
 parquet/org/slf4j/spi/LocationAwareLogger.class
 {code}
 vs.
 {code}
 rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
 spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep 
 LocationAware
455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1952) slf4j version conflicts with pig

2014-05-28 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011808#comment-14011808
 ] 

Patrick Wendell commented on SPARK-1952:


[~rcompton] - what if you modify the spark build and downgrade slf4j to 1.7.2 
as a debugging step... does that fix it?

 slf4j version conflicts with pig
 

 Key: SPARK-1952
 URL: https://issues.apache.org/jira/browse/SPARK-1952
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: pig 12.1 on Cloudera Hadoop, CDH3
Reporter: Ryan Compton
  Labels: pig, slf4j

 Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when 
 they register a jar containing Spark. The error appears to be related to 
 org.slf4j.spi.LocationAwareLogger.log.
 {code}
 Caused by: java.lang.RuntimeException: Could not resolve error that
 occured when launching map reduce job: java.lang.NoSuchMethodError:
 org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598)
 at java.lang.Thread.dispatchUncaughtException(Thread.java:1874)
 {code}
 To reproduce: compile Spark via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt 
 assembly and register the resulting jar into a pig script. E.g.
 {code}
 REGISTER 
 /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar;
 data0 = LOAD 'data' USING PigStorage();
 ttt = LIMIT data0 10;
 DUMP ttt;
 {code}
 The Spark-1.0 jar includes some slf4j dependencies that were not present in 
 0.9.1
 {code}
 rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
 spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep 
 LocationAware
   3259 Mon Mar 25 21:49:34 PDT 2013 
 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class
455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
479 Fri Dec 13 16:44:40 PST 2013 
 parquet/org/slf4j/spi/LocationAwareLogger.class
 {code}
 vs.
 {code}
 rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
 spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep 
 LocationAware
455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1954) Make it easier to get Spark on YARN code to compile in IntelliJ

2014-05-28 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011810#comment-14011810
 ] 

Patrick Wendell commented on SPARK-1954:


Have you tried running sbt/sbt gen-idea with SPARK_YARN=true and 
SPARK_HADOOP_VERSION=2.2.0?

 Make it easier to get Spark on YARN code to compile in IntelliJ
 ---

 Key: SPARK-1954
 URL: https://issues.apache.org/jira/browse/SPARK-1954
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Sandy Ryza

 When loading a project through a Maven pom, IntelliJ allows switching on 
 profiles, but, to my knowledge, doesn't provide a way to set arbitrary 
 properties. 
 To get Spark-on-YARN code to compile in IntelliJ, I need to manually change 
 the hadoop.version in the root pom.xml to 2.2.0 or higher.  This is very 
 cumbersome when switching branches.
 It would be really helpful to add a profile that sets the Hadoop version that 
 IntelliJ can switch on.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1950) spark on yarn can't start

2014-05-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1950.


Resolution: Duplicate

 spark on yarn can't start 
 --

 Key: SPARK-1950
 URL: https://issues.apache.org/jira/browse/SPARK-1950
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Guoqiang Li
Priority: Blocker

 {{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit  --archives 
 /input/lbs/recommend/toona/spark/conf  toona-assembly.jar 20140521}}throw an 
 exception:
 {code}
 Exception in thread main java.io.FileNotFoundException: File 
 file:/input/lbs/recommend/toona/spark/conf does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
   at 
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:402)
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
   at 
 org.apache.spark.deploy.yarn.ClientBase$class.org$apache$spark$deploy$yarn$ClientBase$$copyRemoteFile(ClientBase.scala:162)
   at 
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:237)
   at 
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4$$anonfun$apply$2.apply(ClientBase.scala:232)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at 
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:232)
   at 
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$4.apply(ClientBase.scala:230)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:230)
   at 
 org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:39)
   at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74)
   at org.apache.spark.deploy.yarn.Client.run(Client.scala:96)
   at org.apache.spark.deploy.yarn.Client$.main(Client.scala:186)
   at org.apache.spark.deploy.yarn.Client.main(Client.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 {code}
 {{HADOOP_CONF_DIR=/etc/hadoop/conf ./bin/spark-submit  --archives 
 hdfs://10dian72:8020/input/lbs/recommend/toona/spark/conf  toona-assembly.jar 
 20140521}} work.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1712) ParallelCollectionRDD operations hanging forever without any error messages

2014-05-28 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011830#comment-14011830
 ] 

Matei Zaharia commented on SPARK-1712:
--

Merged the frame size check into 0.9.2 as well as 1.0.1

 ParallelCollectionRDD operations hanging forever without any error messages 
 

 Key: SPARK-1712
 URL: https://issues.apache.org/jira/browse/SPARK-1712
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
 Environment: Linux Ubuntu 14.04, a single spark node; standalone mode.
Reporter: Piotr Kołaczkowski
Assignee: Guoqiang Li
 Fix For: 0.9.2, 1.0.1

 Attachments: executor.jstack.txt, master.jstack.txt, repl.jstack.txt, 
 spark-hang.png, worker.jstack.txt


  conf/spark-defaults.conf
 {code}
 spark.akka.frameSize 5
 spark.default.parallelism1
 {code}
 {noformat}
 scala val collection = (1 to 100).map(i = (foo + i, i)).toVector
 collection: Vector[(String, Int)] = Vector((foo1,1), (foo2,2), (foo3,3), 
 (foo4,4), (foo5,5), (foo6,6), (foo7,7), (foo8,8), (foo9,9), (foo10,10), 
 (foo11,11), (foo12,12), (foo13,13), (foo14,14), (foo15,15), (foo16,16), 
 (foo17,17), (foo18,18), (foo19,19), (foo20,20), (foo21,21), (foo22,22), 
 (foo23,23), (foo24,24), (foo25,25), (foo26,26), (foo27,27), (foo28,28), 
 (foo29,29), (foo30,30), (foo31,31), (foo32,32), (foo33,33), (foo34,34), 
 (foo35,35), (foo36,36), (foo37,37), (foo38,38), (foo39,39), (foo40,40), 
 (foo41,41), (foo42,42), (foo43,43), (foo44,44), (foo45,45), (foo46,46), 
 (foo47,47), (foo48,48), (foo49,49), (foo50,50), (foo51,51), (foo52,52), 
 (foo53,53), (foo54,54), (foo55,55), (foo56,56), (foo57,57), (foo58,58), 
 (foo59,59), (foo60,60), (foo61,61), (foo62,62), (foo63,63), (foo64,64), 
 (foo...
 scala val rdd = sc.parallelize(collection)
 rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at 
 parallelize at console:24
 scala rdd.first
 res4: (String, Int) = (foo1,1)
 scala rdd.map(_._2).sum
 // nothing happens
 {noformat}
 CPU and I/O idle. 
 Memory usage reported by JVM, after manually triggered GC:
 repl: 216 MB / 2 GB
 executor: 67 MB / 2 GB
 worker: 6 MB / 128 MB
 master: 6 MB / 128 MB
 No errors found in worker's stderr/stdout. 
 It works fine with 700,000 elements and then it takes about 1 second to 
 process the request and calculate the sum. With 700,000 items the spark 
 executor memory doesn't even exceed 300 MB out of 2GB available. It fails 
 with 800,000 items.
 Multiple parralelized collections of size 700,000 items at the same time in 
 the same session work fine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1957) Pluggable disk store for BlockManager

2014-05-28 Thread Raymond Liu (JIRA)
Raymond Liu created SPARK-1957:
--

 Summary: Pluggable disk store for BlockManager
 Key: SPARK-1957
 URL: https://issues.apache.org/jira/browse/SPARK-1957
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Raymond Liu


As the first step toward the goal of SPAK-1733, support a pluggable disk store 
to allow different disk storage to be plug into the BlockManager's DiskStore 
layer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011897#comment-14011897
 ] 

Sean Owen commented on SPARK-1518:
--

they write their app against the Spark API's in Maven central (they can do 
this no matter which cluster they want to run on) 

Yeah this is the issue. OK, if I compile against Spark artifacts as a runtime 
dependency and submit an app to the cluster, it should be OK no matter what 
build of Spark is running. The binding from Spark to Hadoop is hidden from the 
app.

I am thinking of the case where I want to build an app that is a client of 
Spark -- embedding it. Then I am including the client of Hadoop for example. I 
have to match my cluster than and there is no Hadoop 2 Spark artifact.

Am I missing something big here? that's my premise about why there would ever 
be a need for different artifacts. It's the same use case as in Sandy's blog: 
http://blog.cloudera.com/blog/2014/04/how-to-run-a-simple-apache-spark-app-in-cdh-5/

 Spark master doesn't compile against hadoop-common trunk
 

 Key: SPARK-1518
 URL: https://issues.apache.org/jira/browse/SPARK-1518
 Project: Spark
  Issue Type: Bug
Reporter: Marcelo Vanzin
Assignee: Colin Patrick McCabe
Priority: Critical

 FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
 FileLogger.scala is calling it.
 I've changed it locally to hsync() so I can compile the code, but haven't 
 checked yet whether those are equivalent. hsync() seems to have been there 
 forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1957) Pluggable disk store for BlockManager

2014-05-28 Thread Raymond Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Liu updated SPARK-1957:
---

Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-1733

 Pluggable disk store for BlockManager
 -

 Key: SPARK-1957
 URL: https://issues.apache.org/jira/browse/SPARK-1957
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Raymond Liu

 As the first step toward the goal of SPAK-1733, support a pluggable disk 
 store to allow different disk storage to be plug into the BlockManager's 
 DiskStore layer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1518) Spark master doesn't compile against hadoop-common trunk

2014-05-28 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011942#comment-14011942
 ] 

Matei Zaharia commented on SPARK-1518:
--

Sean, the model for linking to Hadoop has been that users also add a dependency 
on hadoop-client if they want to access HDFS for the past few releases. See 
http://spark.apache.org/docs/latest/scala-programming-guide.html#linking-with-spark
 for example. This model is there because Hadoop itself has decided to create 
the hadoop-client Maven artifact as a way to get apps to link to it. It works 
for all the recent versions of Hadoop as far as I know -- users don't have to 
link against a custom-built Spark for their distro.

Regarding binary builds on apache.org, we want users to be able to start using 
Spark as conveniently as possible on any distribution. It is the goal of the 
Apache project to have people use Apache Spark as easily as possible.

 Spark master doesn't compile against hadoop-common trunk
 

 Key: SPARK-1518
 URL: https://issues.apache.org/jira/browse/SPARK-1518
 Project: Spark
  Issue Type: Bug
Reporter: Marcelo Vanzin
Assignee: Colin Patrick McCabe
Priority: Critical

 FSDataOutputStream::sync() has disappeared from trunk in Hadoop; 
 FileLogger.scala is calling it.
 I've changed it locally to hsync() so I can compile the code, but haven't 
 checked yet whether those are equivalent. hsync() seems to have been there 
 forever, so it hopefully works with all versions Spark cares about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1112) When spark.akka.frameSize 10, task results bigger than 10MiB block execution

2014-05-28 Thread Kevin (Sangwoo) Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011968#comment-14011968
 ] 

Kevin (Sangwoo) Kim commented on SPARK-1112:


Hi all, 

I'm very new to Spark and doing some tests, I've experienced similar issue.
(tested with Spark Shell, 0.9.1, r3.8xlarge instance on EC2 - 32 core / 244GiB 
MEM)

I was trying to broadcast 700MB of data and Spark hangs when I run collect() 
method for the data. 

Here's the strange things :
1) when I tried 
val userInfo = sc.textFile(file:///spark/logs/user_sign_up2.csv).map{line 
= val split = line.split(,); (split(1), split)}
it runs well.
2) when I tried 
val userInfo = sc.textFile(file:///spark/logs/user_sign_up2.csv).map{line 
= val split = line.split(,); (split(1), split(5))} 
Spark hangs.
3) when I slightly control the data size using sample() method or cutting the 
data file, it runs well. 

Our team investigated logs from master and worker then we found worker finished 
all tasks but master couldn't retrieve the result from a task the result size 
larger than 10MB

We tried to apply the workaround setting spark.akka.frameSize to 9, it works 
like a charm.

I guess it might hard to reproduce the issue, please contact me if there's need 
of testing or getting logs. 

Thanks!

 When spark.akka.frameSize  10, task results bigger than 10MiB block execution
 --

 Key: SPARK-1112
 URL: https://issues.apache.org/jira/browse/SPARK-1112
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Guillaume Pitel
Priority: Blocker
 Fix For: 0.9.2


 When I set the spark.akka.frameSize to something over 10, the messages sent 
 from the executors to the driver completely block the execution if the 
 message is bigger than 10MiB and smaller than the frameSize (if it's above 
 the frameSize, it's ok)
 Workaround is to set the spark.akka.frameSize to 10. In this case, since 
 0.8.1, the blockManager deal with  the data to be sent. It seems slower than 
 akka direct message though.
 The configuration seems to be correctly read (see actorSystemConfig.txt), so 
 I don't see where the 10MiB could come from 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1112) When spark.akka.frameSize 10, task results bigger than 10MiB block execution

2014-05-28 Thread Kevin (Sangwoo) Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011968#comment-14011968
 ] 

Kevin (Sangwoo) Kim edited comment on SPARK-1112 at 5/29/14 2:01 AM:
-

Hi all, 

I'm very new to Spark and doing some tests, I've experienced similar issue.
(tested with Spark Shell, 0.9.1, r3.8xlarge instance on EC2 - 32 core / 244GiB 
MEM)

I was trying to broadcast 700MB of data and Spark hangs when I run collect() 
method for the data. 

Here's the strange things :
1) when I tried 
{code}val userInfo = 
sc.textFile(file:///spark/logs/user_sign_up2.csv).map{line = val split = 
line.split(,); (split(1), split)}{code}
it runs well.
2) when I tried 
{code}val userInfo = 
sc.textFile(file:///spark/logs/user_sign_up2.csv).map{line = val split = 
line.split(,); (split(1), split(5))} {code}
Spark hangs.
3) when I slightly control the data size using sample() method or cutting the 
data file, it runs well. 

Our team investigated logs from master and worker then we found worker finished 
all tasks but master couldn't retrieve the result from a task the result size 
larger than 10MB

We tried to apply the workaround setting spark.akka.frameSize to 9, it works 
like a charm.

I guess it might hard to reproduce the issue, please contact me if there's need 
of testing or getting logs. 

Thanks!


was (Author: swkimme):
Hi all, 

I'm very new to Spark and doing some tests, I've experienced similar issue.
(tested with Spark Shell, 0.9.1, r3.8xlarge instance on EC2 - 32 core / 244GiB 
MEM)

I was trying to broadcast 700MB of data and Spark hangs when I run collect() 
method for the data. 

Here's the strange things :
1) when I tried 
val userInfo = sc.textFile(file:///spark/logs/user_sign_up2.csv).map{line 
= val split = line.split(,); (split(1), split)}
it runs well.
2) when I tried 
val userInfo = sc.textFile(file:///spark/logs/user_sign_up2.csv).map{line 
= val split = line.split(,); (split(1), split(5))} 
Spark hangs.
3) when I slightly control the data size using sample() method or cutting the 
data file, it runs well. 

Our team investigated logs from master and worker then we found worker finished 
all tasks but master couldn't retrieve the result from a task the result size 
larger than 10MB

We tried to apply the workaround setting spark.akka.frameSize to 9, it works 
like a charm.

I guess it might hard to reproduce the issue, please contact me if there's need 
of testing or getting logs. 

Thanks!

 When spark.akka.frameSize  10, task results bigger than 10MiB block execution
 --

 Key: SPARK-1112
 URL: https://issues.apache.org/jira/browse/SPARK-1112
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Guillaume Pitel
Priority: Blocker
 Fix For: 0.9.2


 When I set the spark.akka.frameSize to something over 10, the messages sent 
 from the executors to the driver completely block the execution if the 
 message is bigger than 10MiB and smaller than the frameSize (if it's above 
 the frameSize, it's ok)
 Workaround is to set the spark.akka.frameSize to 10. In this case, since 
 0.8.1, the blockManager deal with  the data to be sent. It seems slower than 
 akka direct message though.
 The configuration seems to be correctly read (see actorSystemConfig.txt), so 
 I don't see where the 10MiB could come from 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1112) When spark.akka.frameSize 10, task results bigger than 10MiB block execution

2014-05-28 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-1112:
-

Priority: Critical  (was: Blocker)

 When spark.akka.frameSize  10, task results bigger than 10MiB block execution
 --

 Key: SPARK-1112
 URL: https://issues.apache.org/jira/browse/SPARK-1112
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Guillaume Pitel
Priority: Critical
 Fix For: 0.9.2


 When I set the spark.akka.frameSize to something over 10, the messages sent 
 from the executors to the driver completely block the execution if the 
 message is bigger than 10MiB and smaller than the frameSize (if it's above 
 the frameSize, it's ok)
 Workaround is to set the spark.akka.frameSize to 10. In this case, since 
 0.8.1, the blockManager deal with  the data to be sent. It seems slower than 
 akka direct message though.
 The configuration seems to be correctly read (see actorSystemConfig.txt), so 
 I don't see where the 10MiB could come from 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1112) When spark.akka.frameSize 10, task results bigger than 10MiB block execution

2014-05-28 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011978#comment-14011978
 ] 

Matei Zaharia commented on SPARK-1112:
--

I'm curious, why did you want to make the frameSize this big -- are the tasks 
themselves also big or just the results? There might be other buffers in Akka 
that can't be made bigger than this. It's possible that this changed in a newer 
Akka version (because larger frame sizes used to work before).

 When spark.akka.frameSize  10, task results bigger than 10MiB block execution
 --

 Key: SPARK-1112
 URL: https://issues.apache.org/jira/browse/SPARK-1112
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Guillaume Pitel
Priority: Critical
 Fix For: 0.9.2


 When I set the spark.akka.frameSize to something over 10, the messages sent 
 from the executors to the driver completely block the execution if the 
 message is bigger than 10MiB and smaller than the frameSize (if it's above 
 the frameSize, it's ok)
 Workaround is to set the spark.akka.frameSize to 10. In this case, since 
 0.8.1, the blockManager deal with  the data to be sent. It seems slower than 
 akka direct message though.
 The configuration seems to be correctly read (see actorSystemConfig.txt), so 
 I don't see where the 10MiB could come from 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1112) When spark.akka.frameSize 10, task results bigger than 10MiB block execution

2014-05-28 Thread Kevin (Sangwoo) Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011993#comment-14011993
 ] 

Kevin (Sangwoo) Kim commented on SPARK-1112:


[~matei]
I've found the default of spark.akka.frameSize is 10 from the config document, 
http://spark.apache.org/docs/0.9.1/configuration.html
just tried to slightly larger and smaller (11 and 9) values.

I did collect() method on the userInfo and it might contains large data. 
(edited the first comment.)



 When spark.akka.frameSize  10, task results bigger than 10MiB block execution
 --

 Key: SPARK-1112
 URL: https://issues.apache.org/jira/browse/SPARK-1112
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Guillaume Pitel
Priority: Critical
 Fix For: 0.9.2


 When I set the spark.akka.frameSize to something over 10, the messages sent 
 from the executors to the driver completely block the execution if the 
 message is bigger than 10MiB and smaller than the frameSize (if it's above 
 the frameSize, it's ok)
 Workaround is to set the spark.akka.frameSize to 10. In this case, since 
 0.8.1, the blockManager deal with  the data to be sent. It seems slower than 
 akka direct message though.
 The configuration seems to be correctly read (see actorSystemConfig.txt), so 
 I don't see where the 10MiB could come from 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1112) When spark.akka.frameSize 10, task results bigger than 10MiB block execution

2014-05-28 Thread Kevin (Sangwoo) Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011968#comment-14011968
 ] 

Kevin (Sangwoo) Kim edited comment on SPARK-1112 at 5/29/14 2:50 AM:
-

Hi all, 

I'm very new to Spark and doing some tests, I've experienced similar issue.
(tested with Spark Shell, 0.9.1, r3.8xlarge instance on EC2 - 32 core / 244GiB 
MEM)

I was trying to broadcast 700MB of data and Spark hangs when I run collect() 
method for the data. 

Here's the strange things :
1) when I tried 
{code}val userInfo = 
sc.textFile(file:///spark/logs/user_sign_up2.csv).map{line = val split = 
line.split(,); (split(1), split)}
val userInfoMap = userInfo.collectAsMap
{code}
it runs well.
2) when I tried 
{code}val userInfo = 
sc.textFile(file:///spark/logs/user_sign_up2.csv).map{line = val split = 
line.split(,); (split(1), split(5))} 
val userInfoMap = userInfo.collectAsMap
{code}
Spark hangs.
3) when I slightly control the data size using sample() method or cutting the 
data file, it runs well. 

Our team investigated logs from master and worker then we found worker finished 
all tasks but master couldn't retrieve the result from a task the result size 
larger than 10MB

We tried to apply the workaround setting spark.akka.frameSize to 9, it works 
like a charm.

I guess it might hard to reproduce the issue, please contact me if there's need 
of testing or getting logs. 

Thanks!


was (Author: swkimme):
Hi all, 

I'm very new to Spark and doing some tests, I've experienced similar issue.
(tested with Spark Shell, 0.9.1, r3.8xlarge instance on EC2 - 32 core / 244GiB 
MEM)

I was trying to broadcast 700MB of data and Spark hangs when I run collect() 
method for the data. 

Here's the strange things :
1) when I tried 
{code}val userInfo = 
sc.textFile(file:///spark/logs/user_sign_up2.csv).map{line = val split = 
line.split(,); (split(1), split)}{code}
it runs well.
2) when I tried 
{code}val userInfo = 
sc.textFile(file:///spark/logs/user_sign_up2.csv).map{line = val split = 
line.split(,); (split(1), split(5))} {code}
Spark hangs.
3) when I slightly control the data size using sample() method or cutting the 
data file, it runs well. 

Our team investigated logs from master and worker then we found worker finished 
all tasks but master couldn't retrieve the result from a task the result size 
larger than 10MB

We tried to apply the workaround setting spark.akka.frameSize to 9, it works 
like a charm.

I guess it might hard to reproduce the issue, please contact me if there's need 
of testing or getting logs. 

Thanks!

 When spark.akka.frameSize  10, task results bigger than 10MiB block execution
 --

 Key: SPARK-1112
 URL: https://issues.apache.org/jira/browse/SPARK-1112
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Guillaume Pitel
Priority: Critical
 Fix For: 0.9.2


 When I set the spark.akka.frameSize to something over 10, the messages sent 
 from the executors to the driver completely block the execution if the 
 message is bigger than 10MiB and smaller than the frameSize (if it's above 
 the frameSize, it's ok)
 Workaround is to set the spark.akka.frameSize to 10. In this case, since 
 0.8.1, the blockManager deal with  the data to be sent. It seems slower than 
 akka direct message though.
 The configuration seems to be correctly read (see actorSystemConfig.txt), so 
 I don't see where the 10MiB could come from 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1958) Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan.

2014-05-28 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-1958:
---

 Summary: Calling .collect() on a SchemaRDD should call 
executeCollect() on the underlying query plan.
 Key: SPARK-1958
 URL: https://issues.apache.org/jira/browse/SPARK-1958
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
 Fix For: 1.1.0


In some cases (like LIMIT) executeCollect() makes optimizations that 
execute().collect() will not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1959) String NULL is interpreted as null value

2014-05-28 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-1959:
-

 Summary: String NULL is interpreted as null value
 Key: SPARK-1959
 URL: https://issues.apache.org/jira/browse/SPARK-1959
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Cheng Lian


The {{HiveTableScan}} operator unwraps string NULL (case insensitive) into 
null values even if the column type is {{STRING}}.

To reproduce the bug, we use 
{{sql/hive/src/test/resources/groupby_groupingid.txt}} as test input, copied to 
{{/tmp/groupby_groupingid.txt}}.

Hive session:

{code}
hive CREATE TABLE test_null(key INT, value STRING);
hive LOAD DATA LOCAL INPATH '/tmp/groupby_groupingid.txt' INTO table test_null;
hive SELECT * FROM test_null WHERE value IS NOT NULL;
...
OK
1   NULL
1   1
2   2
3   3
3   NULL
4   5
1   NULL
1   1
2   2
3   3
3   NULL
4   5
{code}

We can see that the {{NULL}} cells in the original input file are interpreted 
as string {{NULL}} in Hive.

Spark SQL session ({{sbt/sbt hive/console}}):

{code}
scala hql(CREATE TABLE test_null(key INT, value STRING))
scala hql(LOAD DATA LOCAL INPATH '/tmp/groupby_groupingid.txt' INTO table 
test_null)
scala hql(SELECT * FROM test_null WHERE value IS NOT NULL).foreach(println)
...
[1,1]
[2,2]
[3,3]
[4,5]
{code}

As we can see, string {{NULL}} is interpreted as null values in Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1901) Standalone worker update exector's state ahead of executor process exit

2014-05-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1901:
---

Fix Version/s: (was: 1.0.0)
   1.0.1

 Standalone worker update exector's state ahead of executor process exit
 ---

 Key: SPARK-1901
 URL: https://issues.apache.org/jira/browse/SPARK-1901
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 0.9.0
 Environment: spark-1.0 rc10
Reporter: Zhen Peng
Assignee: Zhen Peng
 Fix For: 1.0.1


 Standalone worker updates executor's state prematurely, making the resource 
 status in an inconsistent state until the executor process really died.
 In our cluster, we found this situation may cause new submitted applications 
 removed by Master for launching executor fail.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1954) Make it easier to get Spark on YARN code to compile in IntelliJ

2014-05-28 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012027#comment-14012027
 ] 

Sandy Ryza commented on SPARK-1954:
---

Cool.  Your suggestion does appear to work.

 Make it easier to get Spark on YARN code to compile in IntelliJ
 ---

 Key: SPARK-1954
 URL: https://issues.apache.org/jira/browse/SPARK-1954
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Sandy Ryza

 When loading a project through a Maven pom, IntelliJ allows switching on 
 profiles, but, to my knowledge, doesn't provide a way to set arbitrary 
 properties. 
 To get Spark-on-YARN code to compile in IntelliJ, I need to manually change 
 the hadoop.version in the root pom.xml to 2.2.0 or higher.  This is very 
 cumbersome when switching branches.
 It would be really helpful to add a profile that sets the Hadoop version that 
 IntelliJ can switch on.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1959) String NULL is interpreted as null value

2014-05-28 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012031#comment-14012031
 ] 

Cheng Lian commented on SPARK-1959:
---

The problematic line should be [this 
one|https://github.com/apache/spark/blob/master/sql%2Fhive%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fhive%2FhiveOperators.scala#L154].
 I wonder under what circumstances, would Hive return a Java string {{NULL}} 
to represent a null value? Is it safe to simply remove this line? [~marmbrus]

 String NULL is interpreted as null value
 --

 Key: SPARK-1959
 URL: https://issues.apache.org/jira/browse/SPARK-1959
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Cheng Lian

 The {{HiveTableScan}} operator unwraps string NULL (case insensitive) into 
 null values even if the column type is {{STRING}}.
 To reproduce the bug, we use 
 {{sql/hive/src/test/resources/groupby_groupingid.txt}} as test input, copied 
 to {{/tmp/groupby_groupingid.txt}}.
 Hive session:
 {code}
 hive CREATE TABLE test_null(key INT, value STRING);
 hive LOAD DATA LOCAL INPATH '/tmp/groupby_groupingid.txt' INTO table 
 test_null;
 hive SELECT * FROM test_null WHERE value IS NOT NULL;
 ...
 OK
 1   NULL
 1   1
 2   2
 3   3
 3   NULL
 4   5
 {code}
 We can see that the {{NULL}} cells in the original input file are interpreted 
 as string {{NULL}} in Hive.
 Spark SQL session ({{sbt/sbt hive/console}}):
 {code}
 scala hql(CREATE TABLE test_null(key INT, value STRING))
 scala hql(LOAD DATA LOCAL INPATH '/tmp/groupby_groupingid.txt' INTO table 
 test_null)
 scala hql(SELECT * FROM test_null WHERE value IS NOT NULL).foreach(println)
 ...
 [1,1]
 [2,2]
 [3,3]
 [4,5]
 {code}
 As we can see, string {{NULL}} is interpreted as null values in Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1952) slf4j version conflicts with pig

2014-05-28 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012040#comment-14012040
 ] 

Patrick Wendell commented on SPARK-1952:


So I think the issue here is simply that Spark depends on slf4j 1.7.X, pig 
depends on slf4j 1.6.X, and those aren't compatible. If you look it's 
complaining about the function signature of that log() method which changed 
between 1.6 and 1.7. Further compounding things, Pig uses commons logging, so 
it's logging things through (commons logging - slf4j).

http://grepcode.com/file/repo1.maven.org/maven2/org.slf4j/slf4j-api/1.6.1/org/slf4j/spi/LocationAwareLogger.java#LocationAwareLogger.log%28org.slf4j.Marker%2Cjava.lang.String%2Cint%2Cjava.lang.String%2Cjava.lang.Object%5B%5D%2Cjava.lang.Throwable%29

http://grepcode.com/file/repo1.maven.org/maven2/org.slf4j/slf4j-api/1.7.5/org/slf4j/spi/LocationAwareLogger.java#LocationAwareLogger.log%28org.slf4j.Marker%2Cjava.lang.String%2Cint%2Cjava.lang.String%2Cjava.lang.Object%5B%5D%2Cjava.lang.Throwable%29

The Spark code actually doesn't use any new API's that aren't in slf4j 1.6, so 
I could see how this worked in 0.9.0.

I think the problem here is that Spark 1.0 is now pulling in jul-to-slf4j 1.7.X 
and that _does_ use newer API's in SLF4j 7. So I'd remove this from the Spark 
1.0 build and see if that works (we have an explicit dependency on that). 
Basically, try to produce a Spark asembly without SLF4JLocationAwareLog.class.

I think that should work if I'm understanding this correctly.

 slf4j version conflicts with pig
 

 Key: SPARK-1952
 URL: https://issues.apache.org/jira/browse/SPARK-1952
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: pig 12.1 on Cloudera Hadoop, CDH3
Reporter: Ryan Compton
  Labels: pig, slf4j

 Upgrading from Spark-0.9.1 to Spark-1.0.0 causes all Pig scripts to fail when 
 they register a jar containing Spark. The error appears to be related to 
 org.slf4j.spi.LocationAwareLogger.log.
 {code}
 Caused by: java.lang.RuntimeException: Could not resolve error that
 occured when launching map reduce job: java.lang.NoSuchMethodError:
 org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$JobControlThreadExceptionHandler.uncaughtException(MapReduceLauncher.java:598)
 at java.lang.Thread.dispatchUncaughtException(Thread.java:1874)
 {code}
 To reproduce: compile Spark via $ SPARK_HADOOP_VERSION=0.20.2-cdh3u4 sbt/sbt 
 assembly and register the resulting jar into a pig script. E.g.
 {code}
 REGISTER 
 /usr/share/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar;
 data0 = LOAD 'data' USING PigStorage();
 ttt = LIMIT data0 10;
 DUMP ttt;
 {code}
 The Spark-1.0 jar includes some slf4j dependencies that were not present in 
 0.9.1
 {code}
 rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
 spark-assembly-1.0.0-SNAPSHOT-hadoop0.20.2-cdh3u4.jar | grep -i slf | grep 
 LocationAware
   3259 Mon Mar 25 21:49:34 PDT 2013 
 org/apache/commons/logging/impl/SLF4JLocationAwareLog.class
455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
479 Fri Dec 13 16:44:40 PST 2013 
 parquet/org/slf4j/spi/LocationAwareLogger.class
 {code}
 vs.
 {code}
 rfcompton@node19 /u/s/o/s/a/t/scala-2.10 jar tvf 
 spark-assembly-0.9.1-hadoop0.20.2-cdh3u3.jar | grep -i slf | grep 
 LocationAware
455 Mon Mar 25 21:49:22 PDT 2013 org/slf4j/spi/LocationAwareLogger.class
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1954) Make it easier to get Spark on YARN code to compile in IntelliJ

2014-05-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1954:
---

Component/s: Build

 Make it easier to get Spark on YARN code to compile in IntelliJ
 ---

 Key: SPARK-1954
 URL: https://issues.apache.org/jira/browse/SPARK-1954
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.0
Reporter: Sandy Ryza

 When loading a project through a Maven pom, IntelliJ allows switching on 
 profiles, but, to my knowledge, doesn't provide a way to set arbitrary 
 properties. 
 To get Spark-on-YARN code to compile in IntelliJ, I need to manually change 
 the hadoop.version in the root pom.xml to 2.2.0 or higher.  This is very 
 cumbersome when switching branches.
 It would be really helpful to add a profile that sets the Hadoop version that 
 IntelliJ can switch on.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1913) Parquet table column pruning error caused by filter pushdown

2014-05-28 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-1913:


Assignee: Cheng Lian

 Parquet table column pruning error caused by filter pushdown
 

 Key: SPARK-1913
 URL: https://issues.apache.org/jira/browse/SPARK-1913
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
 Environment: mac os 10.9.2
Reporter: Chen Chao
Assignee: Cheng Lian

 When scanning Parquet tables, attributes referenced only in predicates that 
 are pushed down are not passed to the `ParquetTableScan` operator and causes 
 exception. Verified in the {{sbt hive/console}}:
 {code}
 loadTestTable(src)
 table(src).saveAsParquetFile(src.parquet)
 parquetFile(src.parquet).registerAsTable(src_parquet)
 hql(SELECT value FROM src_parquet WHERE key  10).collect().foreach(println)
 {code}
 Exception
 {code}
 parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in 
 file file:/scratch/rxin/spark/src.parquet/part-r-2.parquet
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:177)
   at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
   at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
   at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
   at org.apache.spark.scheduler.Task.run(Task.scala:51)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 Caused by: java.lang.IllegalArgumentException: Column key does not exist.
   at parquet.filter.ColumnRecordFilter$1.bind(ColumnRecordFilter.java:51)
   at 
 org.apache.spark.sql.parquet.ComparisonFilter.bind(ParquetFilters.scala:306)
   at parquet.io.FilteredRecordReader.init(FilteredRecordReader.java:46)
   at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74)
   at 
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110)
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
   ... 28 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1913) Parquet table column pruning error caused by filter pushdown

2014-05-28 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-1913.
-

Resolution: Fixed

 Parquet table column pruning error caused by filter pushdown
 

 Key: SPARK-1913
 URL: https://issues.apache.org/jira/browse/SPARK-1913
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
 Environment: mac os 10.9.2
Reporter: Chen Chao
Assignee: Cheng Lian

 When scanning Parquet tables, attributes referenced only in predicates that 
 are pushed down are not passed to the `ParquetTableScan` operator and causes 
 exception. Verified in the {{sbt hive/console}}:
 {code}
 loadTestTable(src)
 table(src).saveAsParquetFile(src.parquet)
 parquetFile(src.parquet).registerAsTable(src_parquet)
 hql(SELECT value FROM src_parquet WHERE key  10).collect().foreach(println)
 {code}
 Exception
 {code}
 parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in 
 file file:/scratch/rxin/spark/src.parquet/part-r-2.parquet
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:177)
   at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
   at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
   at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
   at org.apache.spark.scheduler.Task.run(Task.scala:51)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 Caused by: java.lang.IllegalArgumentException: Column key does not exist.
   at parquet.filter.ColumnRecordFilter$1.bind(ColumnRecordFilter.java:51)
   at 
 org.apache.spark.sql.parquet.ComparisonFilter.bind(ParquetFilters.scala:306)
   at parquet.io.FilteredRecordReader.init(FilteredRecordReader.java:46)
   at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74)
   at 
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110)
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
   ... 28 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1960) EOFException when file size 0 exists when use sc.sequenceFile[K,V](path)

2014-05-28 Thread Eunsu Yun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eunsu Yun updated SPARK-1960:
-

Description: 
java.io.EOFException throws when use sc.sequenceFile[K,V] if there is a file 
which size is 0. 
I also tested sc.textFile() in the same condition and it does not throw 
EOFException.

val text = sc.sequenceFile[Long, String](data-gz/*.dat.gz)
val result = text.filter(filterValid)
result.saveAsTextFile(data-out/)


--

java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1845)
at 
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1810)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1759)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1773)
at 
org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:49)
at 
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:156)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
..

  was:

java.io.EOFException throws when use sc.sequenceFile[K,V] if there is a file 
which size is 0. 
I also tested sc.textFile() in the same condition and it does not throw 
EOFException.

val text = sc.sequenceFile[Long, String](data-gz/*.dat.gz)
val result = text.filter(filterValid)
result.saveAsTextFile(data-out/)


--

java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1845)
at 
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1810)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1759)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1773)
at 
org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:49)
at 
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
at org.apache.spark.rdd.HadoopRDD$$anon$1.init(HadoopRDD.scala:156)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:149)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
..

Summary: EOFException when file size 0 exists when use 
sc.sequenceFile[K,V](path)  (was: EOFException when 0 size file exists when 
use sc.sequenceFile[K,V](path))

 EOFException when file size 0 exists when use sc.sequenceFile[K,V](path)
 --

 Key: SPARK-1960
 URL: https://issues.apache.org/jira/browse/SPARK-1960
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Eunsu Yun

 java.io.EOFException throws when use sc.sequenceFile[K,V] if there is a file 
 which size is 0. 
 I also tested sc.textFile() in the same condition and it does not throw 
 EOFException.
 val text = sc.sequenceFile[Long, String](data-gz/*.dat.gz)
 val result = text.filter(filterValid)
 result.saveAsTextFile(data-out/)
 --
 java.io.EOFException
   at java.io.DataInputStream.readFully(DataInputStream.java:197)
   at java.io.DataInputStream.readFully(DataInputStream.java:169)
   at