[jira] [Created] (SPARK-1919) In Windows, Spark shell cannot load classes in spark.jars (--jars)

2014-05-23 Thread Andrew Or (JIRA)
Andrew Or created SPARK-1919:


 Summary: In Windows, Spark shell cannot load classes in spark.jars 
(--jars)
 Key: SPARK-1919
 URL: https://issues.apache.org/jira/browse/SPARK-1919
 Project: Spark
  Issue Type: Bug
  Components: Windows
Affects Versions: 1.0.0
Reporter: Andrew Or


Not sure what the issue is, but Spark submit does not have the same problem, 
even if the jars specified are the same.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1918) PySpark shell --py-files does not work for zip files

2014-05-23 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007874#comment-14007874
 ] 

Andrew Or commented on SPARK-1918:
--

https://github.com/apache/spark/pull/853

> PySpark shell --py-files does not work for zip files
> 
>
> Key: SPARK-1918
> URL: https://issues.apache.org/jira/browse/SPARK-1918
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.0
>Reporter: Andrew Or
> Fix For: 1.0.0
>
>
> For pyspark shell, we never add --py-files to the python path. This is 
> specific to non-python files, because python does not automatically look into 
> zip files even if they are also uploaded to the HTTP server through 
> `sc.addFile`.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1918) PySpark shell --py-files does not work for zip files

2014-05-23 Thread Andrew Or (JIRA)
Andrew Or created SPARK-1918:


 Summary: PySpark shell --py-files does not work for zip files
 Key: SPARK-1918
 URL: https://issues.apache.org/jira/browse/SPARK-1918
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Andrew Or
 Fix For: 1.0.0


For pyspark shell, we never add --py-files to the python path. This is specific 
to non-python files, because python does not automatically look into zip files 
even if they are also uploaded to the HTTP server through `sc.addFile`.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1138) Spark 0.9.0 does not work with Hadoop / HDFS

2014-05-23 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007837#comment-14007837
 ] 

Reynold Xin commented on SPARK-1138:


Just want to chime in that I also encountered this stack trace, and the problem 
was an older Netty (in particular, I have Netty 3.4 on my classpath). Once I 
included Netty 3.6.6, the problem went away. 

> Spark 0.9.0 does not work with Hadoop / HDFS
> 
>
> Key: SPARK-1138
> URL: https://issues.apache.org/jira/browse/SPARK-1138
> Project: Spark
>  Issue Type: Bug
>Reporter: Sam Abeyratne
>
> UPDATE: This problem is certainly related to trying to use Spark 0.9.0 and 
> the latest cloudera Hadoop / HDFS in the same jar.  It seems no matter how I 
> fiddle with the deps, the do not play nice together.
> I'm getting a java.util.concurrent.TimeoutException when trying to create a 
> spark context with 0.9.  I cannot, whatever I do, change the timeout.  I've 
> tried using System.setProperty, the SparkConf mechanism of creating a 
> SparkContext and the -D flags when executing my jar.  I seem to be able to 
> run simple jobs from the spark-shell OK, but my more complicated jobs require 
> external libraries so I need to build jars and execute them.
> Some code that causes this:
> println("Creating config")
> val conf = new SparkConf()
>   .setMaster(clusterMaster)
>   .setAppName("MyApp")
>   .setSparkHome(sparkHome)
>   .set("spark.akka.askTimeout", parsed.getOrElse(timeouts, "100"))
>   .set("spark.akka.timeout", parsed.getOrElse(timeouts, "100"))
> println("Creating sc")
> implicit val sc = new SparkContext(conf)
> The output:
> Creating config
> Creating sc
> log4j:WARN No appenders could be found for logger 
> (akka.event.slf4j.Slf4jLogger).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> [ERROR] [02/26/2014 11:05:25.491] [main] [Remoting] Remoting error: [Startup 
> timed out] [
> akka.remote.RemoteTransportException: Startup timed out
>   at 
> akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:129)
>   at akka.remote.Remoting.start(Remoting.scala:191)
>   at 
> akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
>   at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579)
>   at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577)
>   at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588)
>   at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
>   at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
>   at 
> org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96)
>   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:126)
>   at org.apache.spark.SparkContext.(SparkContext.scala:139)
>   at 
> com.adbrain.accuracy.EvaluateAdtruthIDs$.main(EvaluateAdtruthIDs.scala:40)
>   at 
> com.adbrain.accuracy.EvaluateAdtruthIDs.main(EvaluateAdtruthIDs.scala)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
> [1 milliseconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at akka.remote.Remoting.start(Remoting.scala:173)
>   ... 11 more
> ]
> Exception in thread "main" java.util.concurrent.TimeoutException: Futures 
> timed out after [1 milliseconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at akka.remote.Remoting.start(Remoting.scala:173)
>   at 
> akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
>   at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579)
>   at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577)
>   at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588)
>   at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
>   at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
>   at 
> org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96)
>   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:126)
>   at org.apache.spark.SparkCont

[jira] [Commented] (SPARK-1917) PySpark fails to import functions from {{scipy.special}}

2014-05-23 Thread Uri Laserson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007741#comment-14007741
 ] 

Uri Laserson commented on SPARK-1917:
-

https://github.com/apache/spark/pull/866

> PySpark fails to import functions from {{scipy.special}}
> 
>
> Key: SPARK-1917
> URL: https://issues.apache.org/jira/browse/SPARK-1917
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.0, 1.0.0
>Reporter: Uri Laserson
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> PySpark is able to load {{numpy}} functions, but not {{scipy.special}} 
> functions.  For example  take this snippet:
> {code}
> from numpy import exp
> from scipy.special import gammaln
> a = range(1, 11)
> b = sc.parallelize(a)
> c = b.map(exp)
> d = b.map(special.gammaln)
> {code}
> Calling {{c.collect()}} will return the expected result.  However, calling 
> {{d.collect()}} will fail with
> {code}
> KeyError: (('gammaln',), , 
> ('scipy.special', 'gammaln'))
> {code}
> in {{cloudpickle.py}} module in {{_getobject}}.
> The reason is that {{_getobject}} executes {{__import__(modname)}}, which 
> only loads the top-level package {{X}} in case {{modname}} is like {{X.Y}}.  
> It is failing because {{gammaln}} is not a member of {{scipy}}.  The fix (for 
> which I will shortly submit a PR) is to add {{fromlist=[attribute]}} to the 
> {{__import__}} call, which will load the innermost module.
> See 
> [https://docs.python.org/2/library/functions.html#__import__]
> and
> [http://stackoverflow.com/questions/9544331/from-a-b-import-x-using-import]



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1913) Parquet table column pruning error caused by filter pushdown

2014-05-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-1913:
---

Description: 
When scanning Parquet tables, attributes referenced only in predicates that are 
pushed down are not passed to the `ParquetTableScan` operator and causes 
exception. Verified in the {{sbt hive/console}}:

{code}
loadTestTable("src")
table("src").saveAsParquetFile("src.parquet")
parquetFile("src.parquet").registerAsTable("src_parquet")
hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println)
{code}

Exception
{code}
parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in 
file file:/scratch/rxin/spark/src.parquet/part-r-2.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:177)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.IllegalArgumentException: Column key does not exist.
at parquet.filter.ColumnRecordFilter$1.bind(ColumnRecordFilter.java:51)
at 
org.apache.spark.sql.parquet.ComparisonFilter.bind(ParquetFilters.scala:306)
at parquet.io.FilteredRecordReader.(FilteredRecordReader.java:46)
at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74)
at 
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110)
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
... 28 more
{code}

  was:
When scanning Parquet tables, attributes referenced only in predicates that are 
pushed down are not passed to the `ParquetTableScan` operator and causes 
exception. Verified in the {{sbt hive/console}}:

{code}
loadTestTable("src")
table("src").saveAsParquetFile("src.parquet")
parquetFile("src.parquet").registerAsTable("src_parquet")
hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println)
{code}


> Parquet table column pruning error caused by filter pushdown
> 
>
> Key: SPARK-1913
> URL: https://issues.apache.org/jira/browse/SPARK-1913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
> Environment: mac os 10.9.2
>Reporter: Chen Chao
>
> When scanning Parquet tables, attributes referenced only in predicates that 
> are pushed down are not passed to the `ParquetTableScan` operator and causes 
> exception. Verified in the {{sbt hive/console}}:
> {code}
> loadTestTable("src")
> table("src").saveAsParquetFile("src.parquet")
> parquetFile("src.parquet").registerAsTable("src_parquet")
> hql("SELECT value FROM src_parquet WHERE key < 10").collect(

[jira] [Commented] (SPARK-1913) Parquet table column pruning error caused by filter pushdown

2014-05-23 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007710#comment-14007710
 ] 

Reynold Xin commented on SPARK-1913:


I added the exception.

> Parquet table column pruning error caused by filter pushdown
> 
>
> Key: SPARK-1913
> URL: https://issues.apache.org/jira/browse/SPARK-1913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
> Environment: mac os 10.9.2
>Reporter: Chen Chao
>
> When scanning Parquet tables, attributes referenced only in predicates that 
> are pushed down are not passed to the `ParquetTableScan` operator and causes 
> exception. Verified in the {{sbt hive/console}}:
> {code}
> loadTestTable("src")
> table("src").saveAsParquetFile("src.parquet")
> parquetFile("src.parquet").registerAsTable("src_parquet")
> hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println)
> {code}
> Exception
> {code}
> parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in 
> file file:/scratch/rxin/spark/src.parquet/part-r-2.parquet
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:177)
>   at 
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
>   at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>   at org.apache.spark.scheduler.Task.run(Task.scala:51)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Caused by: java.lang.IllegalArgumentException: Column key does not exist.
>   at parquet.filter.ColumnRecordFilter$1.bind(ColumnRecordFilter.java:51)
>   at 
> org.apache.spark.sql.parquet.ComparisonFilter.bind(ParquetFilters.scala:306)
>   at parquet.io.FilteredRecordReader.(FilteredRecordReader.java:46)
>   at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74)
>   at 
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110)
>   at 
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
>   ... 28 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1917) PySpark fails to import functions from {{scipy.special}}

2014-05-23 Thread Uri Laserson (JIRA)
Uri Laserson created SPARK-1917:
---

 Summary: PySpark fails to import functions from {{scipy.special}}
 Key: SPARK-1917
 URL: https://issues.apache.org/jira/browse/SPARK-1917
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.0, 1.0.0
Reporter: Uri Laserson


PySpark is able to load {{numpy}} functions, but not {{scipy.special}} 
functions.  For example  take this snippet:

{code}
from numpy import exp
from scipy.special import gammaln

a = range(1, 11)
b = sc.parallelize(a)
c = b.map(exp)
d = b.map(special.gammaln)
{code}

Calling {{c.collect()}} will return the expected result.  However, calling 
{{d.collect()}} will fail with

{code}
KeyError: (('gammaln',), , 
('scipy.special', 'gammaln'))
{code}

in {{cloudpickle.py}} module in {{_getobject}}.

The reason is that {{_getobject}} executes {{__import__(modname)}}, which only 
loads the top-level package {{X}} in case {{modname}} is like {{X.Y}}.  It is 
failing because {{gammaln}} is not a member of {{scipy}}.  The fix (for which I 
will shortly submit a PR) is to add {{fromlist=[attribute]}} to the 
{{__import__}} call, which will load the innermost module.

See 
[https://docs.python.org/2/library/functions.html#__import__]
and
[http://stackoverflow.com/questions/9544331/from-a-b-import-x-using-import]



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1902) Spark shell prints error when :4040 port already in use

2014-05-23 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007684#comment-14007684
 ] 

Patrick Wendell commented on SPARK-1902:


Yeah, this would be a good one to fix. IIRC I spent a long time trying to 
figure out how to silence only this message, but I couldn't do anything except 
for silencing all jetty WARN logs (which we don't want). I also considered 
first checking if the port is free before trying to bind to it, but that has 
race conditions. If someone figures out a better way to do this, that would be 
great.

> Spark shell prints error when :4040 port already in use
> ---
>
> Key: SPARK-1902
> URL: https://issues.apache.org/jira/browse/SPARK-1902
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Ash
>
> When running two shells on the same machine, I get the below error.  The 
> issue is that the first shell takes port 4040, then the next tries tries 4040 
> and fails so falls back to 4041, then a third would try 4040 and 4041 before 
> landing on 4042, etc.
> We should catch the error and instead log as "Unable to use port 4041; 
> already in use.  Attempting port 4042..."
> {noformat}
> 14/05/22 11:31:54 WARN component.AbstractLifeCycle: FAILED 
> SelectChannelConnector@0.0.0.0:4041: java.net.BindException: Address already 
> in use
> java.net.BindException: Address already in use
> at sun.nio.ch.Net.bind0(Native Method)
> at sun.nio.ch.Net.bind(Net.java:444)
> at sun.nio.ch.Net.bind(Net.java:436)
> at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
> at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
> at 
> org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
> at 
> org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
> at 
> org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
> at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
> at org.eclipse.jetty.server.Server.doStart(Server.java:293)
> at 
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191)
> at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205)
> at org.apache.spark.ui.WebUI.bind(WebUI.scala:99)
> at org.apache.spark.SparkContext.(SparkContext.scala:217)
> at 
> org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:957)
> at $line3.$read$$iwC$$iwC.(:8)
> at $line3.$read$$iwC.(:14)
> at $line3.$read.(:16)
> at $line3.$read$.(:20)
> at $line3.$read$.()
> at $line3.$eval$.(:7)
> at $line3.$eval$.()
> at $line3.$eval.$print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:121)
> at 
> org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:120)
> at 
> org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:263)
> at 
> org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:120)
> at 
> org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:56)
> 

[jira] [Updated] (SPARK-1916) SparkFlumeEvent with body bigger than 1020 bytes are not read properly

2014-05-23 Thread David Lemieux (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lemieux updated SPARK-1916:
-

Attachment: SPARK-1916.diff

Attaching a diff for now. I'll create a pull request shortly.

> SparkFlumeEvent with body bigger than 1020 bytes are not read properly
> --
>
> Key: SPARK-1916
> URL: https://issues.apache.org/jira/browse/SPARK-1916
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 0.9.0
>Reporter: David Lemieux
> Attachments: SPARK-1916.diff
>
>
> The readExternal implementation on SparkFlumeEvent will read only the first 
> 1020 bytes of the actual body when streaming data from flume.
> This means that any event sent to Spark via Flume will be processed properly 
> if the body is small, but will fail if the body is bigger than 1020.
> Considering that the default max size for a Flume Avro Event is 32K, the 
> implementation should be updated to read more.
> The following is related : 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-using-Flume-body-size-limitation-tt6127.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1916) SparkFlumeEvent with body bigger than 1020 bytes are not read properly

2014-05-23 Thread David Lemieux (JIRA)
David Lemieux created SPARK-1916:


 Summary: SparkFlumeEvent with body bigger than 1020 bytes are not 
read properly
 Key: SPARK-1916
 URL: https://issues.apache.org/jira/browse/SPARK-1916
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9.0
Reporter: David Lemieux


The readExternal implementation on SparkFlumeEvent will read only the first 
1020 bytes of the actual body when streaming data from flume.

This means that any event sent to Spark via Flume will be processed properly if 
the body is small, but will fail if the body is bigger than 1020.
Considering that the default max size for a Flume Avro Event is 32K, the 
implementation should be updated to read more.

The following is related : 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-using-Flume-body-size-limitation-tt6127.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-05-23 Thread Michael Malak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007565#comment-14007565
 ] 

Michael Malak commented on SPARK-1867:
--

Thank you, sam, that fixed it for me!

FYI, I had:

{noformat}

  org.apache.hadoop
  hadoop-common
  2.3.0-cdh5.0.1


  org.apache.hadoop
  hadoop-mapreduce-client-core
  2.3.0-cdh5.0.1

{noformat}

By removing the second one, textfile().count now works.

> Spark Documentation Error causes java.lang.IllegalStateException: unread 
> block data
> ---
>
> Key: SPARK-1867
> URL: https://issues.apache.org/jira/browse/SPARK-1867
> Project: Spark
>  Issue Type: Bug
>Reporter: sam
>
> I've employed two System Administrators on a contract basis (for quite a bit 
> of money), and both contractors have independently hit the following 
> exception.  What we are doing is:
> 1. Installing Spark 0.9.1 according to the documentation on the website, 
> along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
> 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
> cluster
> I've also included code snippets, and sbt deps at the bottom.
> When I've Googled this, there seems to be two somewhat vague responses:
> a) Mismatching spark versions on nodes/user code
> b) Need to add more jars to the SparkConf
> Now I know that (b) is not the problem having successfully run the same code 
> on other clusters while only including one jar (it's a fat jar).
> But I have no idea how to check for (a) - it appears Spark doesn't have any 
> version checks or anything - it would be nice if it checked versions and 
> threw a "mismatching version exception: you have user code using version X 
> and node Y has version Z".
> I would be very grateful for advice on this.
> The exception:
> Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task 
> 0.0:1 failed 32 times (most recent failure: Exception failure: 
> java.lang.IllegalStateException: unread block data)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
> java.lang.IllegalStateException: unread block data [duplicate 59]
> My code snippet:
> val conf = new SparkConf()
>.setMaster(clusterMaster)
>.setAppName(appName)
>.setSparkHome(sparkHome)
>.setJars(SparkContext.jarOfClass(this.getClass))
> println("count = " + new SparkContext(conf).textFile(someHdfsPath).count())
> My SBT dependencies:
> // relevant
> "org.apache.spark" % "spark-core_2.10" % "0.9.1",
> "org.apache.hadoop" % "hadoop-client" % "2.3.0-mr1-cdh5.0.0",
> // standard, probably unrelated
> "com.github.seratch" %% "awscala" % "[0.2,)",
> "org.scalacheck" %% "scalacheck" % "1.10.1" % "test",
> "org.specs2" %% "specs2" % "1.14" % "test",
> "org.scala-lang" % "scala-reflect" % "2.10.3",
> "org.scalaz" %% "scalaz-core" % "7.0.5",
> "net.minidev" % "json-smart" % "1.2"



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()

2014-05-23 Thread Madhu Siddalingaiah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007479#comment-14007479
 ] 

Madhu Siddalingaiah commented on SPARK-983:
---

I have the beginnings of a SortedIterator working for data that will fit in 
memory. It does more or less the same thing as partition sort in 
OrderedRDDFunctions, but it's an iterator. If we know that a partition cannot 
fit in memory, it's possible to split it up into chunks, sort each chunk, write 
to disk, and merge the chunks on disk.

To determine when to split, is it reasonable to use Runtime.freeMemory() / 
maxMemory() along with configuration limits? I could just keep adding to an 
in-memory sortable list until some memory threshold is reached, then sort/spill 
to disk, repeat until all data has been sorted and spilled. Then it's a basic 
merge operation.

Any comments?

> Support external sorting for RDD#sortByKey()
> 
>
> Key: SPARK-983
> URL: https://issues.apache.org/jira/browse/SPARK-983
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 0.9.0
>Reporter: Reynold Xin
>
> Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a 
> buffer to hold the entire partition, then sorts it. This will cause an OOM if 
> an entire partition cannot fit in memory, which is especially problematic for 
> skewed data. Rather than OOMing, the behavior should be similar to the 
> [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
>  where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1790) Update EC2 scripts to support r3 instance types

2014-05-23 Thread Sujeet Varakhedi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007427#comment-14007427
 ] 

Sujeet Varakhedi commented on SPARK-1790:
-

I will work on this

> Update EC2 scripts to support r3 instance types
> ---
>
> Key: SPARK-1790
> URL: https://issues.apache.org/jira/browse/SPARK-1790
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Affects Versions: 0.9.0, 1.0.0, 0.9.1
>Reporter: Matei Zaharia
>  Labels: starter
>
> These were recently added by Amazon as a cheaper high-memory option



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-05-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007339#comment-14007339
 ] 

Sean Owen commented on SPARK-1867:
--

Yes, the 'mr1' artifacts are for when you are *not* using YARN. These are 
unusual to use for CDH5, and you would not need those versions.

The stock Spark artifacts are for Hadoop 1, not Hadoop 2. It can be built for 
Hadoop 2 and installed locally if you like. You can use the matched CDH5 
version, which is of course made for Hadoop 2, by targeting '0.9.0-cdh5.0.1' 
for example.

(I don't have a release schedule but assume some later version will be released 
with 5.1 of course)

There shouldn't be any trial an error to it, if you express as dependencies all 
the things you directly use. For example you say your app uses HBase but I see 
no dependence on the client libraries. This is nothing to do with Spark per se.

There shouldn't be any trial and error about it, or else you're probably trying 
to do something the wrong way around. What classes do you expect you're looking 
for?



> Spark Documentation Error causes java.lang.IllegalStateException: unread 
> block data
> ---
>
> Key: SPARK-1867
> URL: https://issues.apache.org/jira/browse/SPARK-1867
> Project: Spark
>  Issue Type: Bug
>Reporter: sam
>
> I've employed two System Administrators on a contract basis (for quite a bit 
> of money), and both contractors have independently hit the following 
> exception.  What we are doing is:
> 1. Installing Spark 0.9.1 according to the documentation on the website, 
> along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
> 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
> cluster
> I've also included code snippets, and sbt deps at the bottom.
> When I've Googled this, there seems to be two somewhat vague responses:
> a) Mismatching spark versions on nodes/user code
> b) Need to add more jars to the SparkConf
> Now I know that (b) is not the problem having successfully run the same code 
> on other clusters while only including one jar (it's a fat jar).
> But I have no idea how to check for (a) - it appears Spark doesn't have any 
> version checks or anything - it would be nice if it checked versions and 
> threw a "mismatching version exception: you have user code using version X 
> and node Y has version Z".
> I would be very grateful for advice on this.
> The exception:
> Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task 
> 0.0:1 failed 32 times (most recent failure: Exception failure: 
> java.lang.IllegalStateException: unread block data)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
> java.lang.IllegalStateException: unread block data [duplicate 59]
> My code snippet:
> val conf = new SparkConf()
>.setMaster(clusterMaster)
>.setAppName(appName)
>.setSparkHome(sparkHome)
>.setJars(SparkCont

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-05-23 Thread sam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007283#comment-14007283
 ] 

sam commented on SPARK-1867:


Changing

"org.apache.hadoop" % "hadoop-client" % "2.3.0-mr1-cdh5.0.0",

to

"org.apache.hadoop" % "hadoop-common" % "2.3.0-cdh5.0.0"

In my application code seemed to fix this.  Not entirely sure why. We have 
hadoop-yarn on the cluster, so maybe the "mr1" broke things.

What we need, is some kind of script/command, then when we run it on the 
cluster master plus give it a list of packages used in our application code 
(e.g. "org.apache.hadoop.fs", etc), it says what dependencies we need in our 
sbt.  Furthermore, it would be good if mismatching version problems where 
caught and an appropriate message given.

Cloudera list all their artefacts, but it's impossible to find which artefact 
contains a particular package that is used in application code.  We have been 
doing trial and error!

You see, we are trying to use HBase, Hadoop, and Spark but we are always 
hitting dependency / version issues.

Anyway thanks for getting back to me [~michaelmalak].  Any idea when 1.1.0 will 
be release? Also any idea when cloudera will distribute 0.9.1, they seem to 
just have 0.9.0.

> Spark Documentation Error causes java.lang.IllegalStateException: unread 
> block data
> ---
>
> Key: SPARK-1867
> URL: https://issues.apache.org/jira/browse/SPARK-1867
> Project: Spark
>  Issue Type: Bug
>Reporter: sam
>
> I've employed two System Administrators on a contract basis (for quite a bit 
> of money), and both contractors have independently hit the following 
> exception.  What we are doing is:
> 1. Installing Spark 0.9.1 according to the documentation on the website, 
> along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
> 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
> cluster
> I've also included code snippets, and sbt deps at the bottom.
> When I've Googled this, there seems to be two somewhat vague responses:
> a) Mismatching spark versions on nodes/user code
> b) Need to add more jars to the SparkConf
> Now I know that (b) is not the problem having successfully run the same code 
> on other clusters while only including one jar (it's a fat jar).
> But I have no idea how to check for (a) - it appears Spark doesn't have any 
> version checks or anything - it would be nice if it checked versions and 
> threw a "mismatching version exception: you have user code using version X 
> and node Y has version Z".
> I would be very grateful for advice on this.
> The exception:
> Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task 
> 0.0:1 failed 32 times (most recent failure: Exception failure: 
> java.lang.IllegalStateException: unread block data)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
> java.lang.IllegalStateException: unread block data [duplicate 59]
> My code snippet:
> val con

[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-05-23 Thread Michael Malak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007238#comment-14007238
 ] 

Michael Malak commented on SPARK-1867:
--

I, too, have run into this issue, and I was careful to ensure that my driver, 
master, and worker were all running the same version: CDH5.0.1 and JDK1.7_45.

Mridul Muralidharan has indicated this may have been fixed in the Spark 1.1 
code.
http://mail-archives.apache.org/mod_mbox/spark-dev/201405.mbox/%3CCAJiQeYLnEL%2B6dWiGfoim-V%3DvPydJ_TTFZ6VQ_jUgmuyVT8w%3Ddg%40mail.gmail.com%3E

> Spark Documentation Error causes java.lang.IllegalStateException: unread 
> block data
> ---
>
> Key: SPARK-1867
> URL: https://issues.apache.org/jira/browse/SPARK-1867
> Project: Spark
>  Issue Type: Bug
>Reporter: sam
>
> I've employed two System Administrators on a contract basis (for quite a bit 
> of money), and both contractors have independently hit the following 
> exception.  What we are doing is:
> 1. Installing Spark 0.9.1 according to the documentation on the website, 
> along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
> 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
> cluster
> I've also included code snippets, and sbt deps at the bottom.
> When I've Googled this, there seems to be two somewhat vague responses:
> a) Mismatching spark versions on nodes/user code
> b) Need to add more jars to the SparkConf
> Now I know that (b) is not the problem having successfully run the same code 
> on other clusters while only including one jar (it's a fat jar).
> But I have no idea how to check for (a) - it appears Spark doesn't have any 
> version checks or anything - it would be nice if it checked versions and 
> threw a "mismatching version exception: you have user code using version X 
> and node Y has version Z".
> I would be very grateful for advice on this.
> The exception:
> Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task 
> 0.0:1 failed 32 times (most recent failure: Exception failure: 
> java.lang.IllegalStateException: unread block data)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
> java.lang.IllegalStateException: unread block data [duplicate 59]
> My code snippet:
> val conf = new SparkConf()
>.setMaster(clusterMaster)
>.setAppName(appName)
>.setSparkHome(sparkHome)
>.setJars(SparkContext.jarOfClass(this.getClass))
> println("count = " + new SparkContext(conf).textFile(someHdfsPath).count())
> My SBT dependencies:
> // relevant
> "org.apache.spark" % "spark-core_2.10" % "0.9.1",
> "org.apache.hadoop" % "hadoop-client" % "2.3.0-mr1-cdh5.0.0",
> // standard, probably unrelated
> "com.github.seratch" %% "awscala" % "[0.2,)",
> "org.scalacheck" %% "scalacheck" % "1.10.1" % "test",
> "org.specs2" %% "specs2" % "1.14" % "test",
> "org.scala-lang" % "scala-reflect" % "2.10.3",
> "org.sca

[jira] [Commented] (SPARK-1912) Compression memory issue during reduce

2014-05-23 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007229#comment-14007229
 ] 

Andrew Ash commented on SPARK-1912:
---

https://github.com/apache/spark/pull/860

> Compression memory issue during reduce
> --
>
> Key: SPARK-1912
> URL: https://issues.apache.org/jira/browse/SPARK-1912
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Wenchen Fan
>
> When we need to read a compressed block, we will first create a compress 
> stream instance(LZF or Snappy) and use it to wrap that block.
> Let's say a reducer task need to read 1000 local shuffle blocks, it will 
> first prepare to read that 1000 blocks, which means create 1000 compression 
> stream instance to wrap them. But the initialization of compression instance 
> will allocate some memory and when we have many compression instance at the 
> same time, it is a problem.
> Actually reducer reads the shuffle blocks one by one, so why we create 
> compression instance at the first time? Can we do it lazily that when a block 
> is first read, create compression instance for it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1915) AverageFunction should not count if the evaluated value is null.

2014-05-23 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007136#comment-14007136
 ] 

Takuya Ueshin commented on SPARK-1915:
--

Pull-requested: https://github.com/apache/spark/pull/862

> AverageFunction should not count if the evaluated value is null.
> 
>
> Key: SPARK-1915
> URL: https://issues.apache.org/jira/browse/SPARK-1915
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>
> Average values are difference between the calculation is done partially or 
> not partially.
> Because {{AverageFunction}} (in not-partially calculation) counts even if the 
> evaluated value is null.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1898) In deploy.yarn.Client, use YarnClient rather than YarnClientImpl

2014-05-23 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007100#comment-14007100
 ] 

Thomas Graves commented on SPARK-1898:
--

That is correct, as long as you don't modify anything in common or yarn-alpha 
0.23 and really early 2.0 versions aren't affected.   I think he was referring 
to all the various 2.X releases though.  Unfortunately api's have changed in 
the early version up until the the stable release of 2.2.  I think the other 
problem we've seen is not all the various vendor releases mapping straight to 
an apache release. 

Ideally this shouldn't be a problem but it should be sufficiently tested on all 
we say we support.   Based on that, I agree with @tdas, unless this is a real 
blocker and you can explain why we should wait to put this in and fix as many 
of the api's as possible in a follow on release.  


> In deploy.yarn.Client, use YarnClient rather than YarnClientImpl
> 
>
> Key: SPARK-1898
> URL: https://issues.apache.org/jira/browse/SPARK-1898
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Colin Patrick McCabe
>
> In {{deploy.yarn.Client}}, we should use {{YarnClient}} rather than 
> {{YarnClientImpl}}.  The latter is annotated as {{Private}} and {{Unstable}} 
> in Hadoop, which means it could change in an incompatible way at any time.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1215) Clustering: Index out of bounds error

2014-05-23 Thread Denis Serduik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007094#comment-14007094
 ] 

Denis Serduik edited comment on SPARK-1215 at 5/23/14 12:30 PM:


attach test dataset
MLLib failed to find 4 centers with k-means|| init mode on this data



was (Author: dmaverick):
attach test dataset



> Clustering: Index out of bounds error
> -
>
> Key: SPARK-1215
> URL: https://issues.apache.org/jira/browse/SPARK-1215
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: dewshick
>Assignee: Xiangrui Meng
>Priority: Minor
> Attachments: test.csv
>
>
> code:
> import org.apache.spark.mllib.clustering._
> val test = sc.makeRDD(Array(4,4,4,4,4).map(e => Array(e.toDouble)))
> val kmeans = new KMeans().setK(4)
> kmeans.run(test) evals with java.lang.ArrayIndexOutOfBoundsException
> error:
> 14/01/17 12:35:54 INFO scheduler.DAGScheduler: Stage 25 (collectAsMap at 
> KMeans.scala:243) finished in 0.047 s
> 14/01/17 12:35:54 INFO spark.SparkContext: Job finished: collectAsMap at 
> KMeans.scala:243, took 16.389537116 s
> Exception in thread "main" java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.simontuffs.onejar.Boot.run(Boot.java:340)
>   at com.simontuffs.onejar.Boot.main(Boot.java:166)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
>   at 
> org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:47)
>   at 
> org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:247)
>   at 
> org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.foreach(Range.scala:81)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.map(Range.scala:46)
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:244)
>   at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:124)
>   at Clustering$$anonfun$1.apply$mcDI$sp(Clustering.scala:21)
>   at Clustering$$anonfun$1.apply(Clustering.scala:19)
>   at Clustering$$anonfun$1.apply(Clustering.scala:19)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.foreach(Range.scala:78)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.map(Range.scala:46)
>   at Clustering$.main(Clustering.scala:19)
>   at Clustering.main(Clustering.scala)
>   ... 6 more



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1215) Clustering: Index out of bounds error

2014-05-23 Thread Denis Serduik (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Serduik updated SPARK-1215:
-

Attachment: test.csv

attach test dataset



> Clustering: Index out of bounds error
> -
>
> Key: SPARK-1215
> URL: https://issues.apache.org/jira/browse/SPARK-1215
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: dewshick
>Assignee: Xiangrui Meng
>Priority: Minor
> Attachments: test.csv
>
>
> code:
> import org.apache.spark.mllib.clustering._
> val test = sc.makeRDD(Array(4,4,4,4,4).map(e => Array(e.toDouble)))
> val kmeans = new KMeans().setK(4)
> kmeans.run(test) evals with java.lang.ArrayIndexOutOfBoundsException
> error:
> 14/01/17 12:35:54 INFO scheduler.DAGScheduler: Stage 25 (collectAsMap at 
> KMeans.scala:243) finished in 0.047 s
> 14/01/17 12:35:54 INFO spark.SparkContext: Job finished: collectAsMap at 
> KMeans.scala:243, took 16.389537116 s
> Exception in thread "main" java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.simontuffs.onejar.Boot.run(Boot.java:340)
>   at com.simontuffs.onejar.Boot.main(Boot.java:166)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
>   at 
> org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:47)
>   at 
> org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:247)
>   at 
> org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.foreach(Range.scala:81)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.map(Range.scala:46)
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:244)
>   at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:124)
>   at Clustering$$anonfun$1.apply$mcDI$sp(Clustering.scala:21)
>   at Clustering$$anonfun$1.apply(Clustering.scala:19)
>   at Clustering$$anonfun$1.apply(Clustering.scala:19)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.foreach(Range.scala:78)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.map(Range.scala:46)
>   at Clustering$.main(Clustering.scala:19)
>   at Clustering.main(Clustering.scala)
>   ... 6 more



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1215) Clustering: Index out of bounds error

2014-05-23 Thread Denis Serduik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007090#comment-14007090
 ] 

Denis Serduik commented on SPARK-1215:
--

I don't think that the problem is about size of dataset. I've faced with 
similar issue on dataset  with about 900 items. As a workaround we've decided 
to fallback with random init mode.



> Clustering: Index out of bounds error
> -
>
> Key: SPARK-1215
> URL: https://issues.apache.org/jira/browse/SPARK-1215
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: dewshick
>Assignee: Xiangrui Meng
>Priority: Minor
>
> code:
> import org.apache.spark.mllib.clustering._
> val test = sc.makeRDD(Array(4,4,4,4,4).map(e => Array(e.toDouble)))
> val kmeans = new KMeans().setK(4)
> kmeans.run(test) evals with java.lang.ArrayIndexOutOfBoundsException
> error:
> 14/01/17 12:35:54 INFO scheduler.DAGScheduler: Stage 25 (collectAsMap at 
> KMeans.scala:243) finished in 0.047 s
> 14/01/17 12:35:54 INFO spark.SparkContext: Job finished: collectAsMap at 
> KMeans.scala:243, took 16.389537116 s
> Exception in thread "main" java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.simontuffs.onejar.Boot.run(Boot.java:340)
>   at com.simontuffs.onejar.Boot.main(Boot.java:166)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
>   at 
> org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:47)
>   at 
> org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:247)
>   at 
> org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.foreach(Range.scala:81)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.map(Range.scala:46)
>   at 
> org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:244)
>   at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:124)
>   at Clustering$$anonfun$1.apply$mcDI$sp(Clustering.scala:21)
>   at Clustering$$anonfun$1.apply(Clustering.scala:19)
>   at Clustering$$anonfun$1.apply(Clustering.scala:19)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.foreach(Range.scala:78)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>   at scala.collection.immutable.Range.map(Range.scala:46)
>   at Clustering$.main(Clustering.scala:19)
>   at Clustering.main(Clustering.scala)
>   ... 6 more



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1913) Parquet table column pruning error caused by filter pushdown

2014-05-23 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007063#comment-14007063
 ] 

Cheng Lian commented on SPARK-1913:
---

Corresponding PR: https://github.com/apache/spark/pull/863

> Parquet table column pruning error caused by filter pushdown
> 
>
> Key: SPARK-1913
> URL: https://issues.apache.org/jira/browse/SPARK-1913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
> Environment: mac os 10.9.2
>Reporter: Chen Chao
>
> When scanning Parquet tables, attributes referenced only in predicates that 
> are pushed down are not passed to the `ParquetTableScan` operator and causes 
> exception. Verified in the {{sbt hive/console}}:
> {code}
> loadTestTable("src")
> table("src").saveAsParquetFile("src.parquet")
> parquetFile("src.parquet").registerAsTable("src_parquet")
> hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1915) AverageFunction should not count if the evaluated value is null.

2014-05-23 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-1915:


 Summary: AverageFunction should not count if the evaluated value 
is null.
 Key: SPARK-1915
 URL: https://issues.apache.org/jira/browse/SPARK-1915
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin


Average values are difference between the calculation is done partially or not 
partially.

Because {{AverageFunction}} (in not-partially calculation) counts even if the 
evaluated value is null.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1913) Parquet table column pruning error caused by filter pushdown

2014-05-23 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-1913:
--

Summary: Parquet table column pruning error caused by filter pushdown  
(was: Column pruning for Parquet table)

> Parquet table column pruning error caused by filter pushdown
> 
>
> Key: SPARK-1913
> URL: https://issues.apache.org/jira/browse/SPARK-1913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
> Environment: mac os 10.9.2
>Reporter: Chen Chao
>
> When scanning Parquet tables, attributes referenced only in predicates that 
> are pushed down are not passed to the `ParquetTableScan` operator and causes 
> exception. Verified in the {{sbt hive/console}}:
> {code}
> loadTestTable("src")
> table("src").saveAsParquetFile("src.parquet")
> parquetFile("src.parquet").registerAsTable("src_parquet")
> hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1913) Column pruning for Parquet table

2014-05-23 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-1913:
--

Summary: Column pruning for Parquet table  (was: column pruning problem of 
Parquet  File)

> Column pruning for Parquet table
> 
>
> Key: SPARK-1913
> URL: https://issues.apache.org/jira/browse/SPARK-1913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
> Environment: mac os 10.9.2
>Reporter: Chen Chao
>
> When scanning Parquet tables, attributes referenced only in predicates that 
> are pushed down are not passed to the `ParquetTableScan` operator and causes 
> exception. Verified in the {{sbt hive/console}}:
> {code:java}
> loadTestTable("src")
> table("src").saveAsParquetFile("src.parquet")
> parquetFile("src.parquet").registerAsTable("src_parquet")
> hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1913) Column pruning for Parquet table

2014-05-23 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-1913:
--

Description: 
When scanning Parquet tables, attributes referenced only in predicates that are 
pushed down are not passed to the `ParquetTableScan` operator and causes 
exception. Verified in the {{sbt hive/console}}:

{code}
loadTestTable("src")
table("src").saveAsParquetFile("src.parquet")
parquetFile("src.parquet").registerAsTable("src_parquet")
hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println)
{code}

  was:
When scanning Parquet tables, attributes referenced only in predicates that are 
pushed down are not passed to the `ParquetTableScan` operator and causes 
exception. Verified in the {{sbt hive/console}}:

{code:java}
loadTestTable("src")
table("src").saveAsParquetFile("src.parquet")
parquetFile("src.parquet").registerAsTable("src_parquet")
hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println)
{code}


> Column pruning for Parquet table
> 
>
> Key: SPARK-1913
> URL: https://issues.apache.org/jira/browse/SPARK-1913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
> Environment: mac os 10.9.2
>Reporter: Chen Chao
>
> When scanning Parquet tables, attributes referenced only in predicates that 
> are pushed down are not passed to the `ParquetTableScan` operator and causes 
> exception. Verified in the {{sbt hive/console}}:
> {code}
> loadTestTable("src")
> table("src").saveAsParquetFile("src.parquet")
> parquetFile("src.parquet").registerAsTable("src_parquet")
> hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1913) column pruning problem of Parquet File

2014-05-23 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-1913:
--

Description: 
When scanning Parquet tables, attributes referenced only in predicates that are 
pushed down are not passed to the `ParquetTableScan` operator and causes 
exception. Verified in the {{sbt hive/console}}:

{code:java}
loadTestTable("src")
table("src").saveAsParquetFile("src.parquet")
parquetFile("src.parquet").registerAsTable("src_parquet")
hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println)
{code}

  was:
When scanning Parquet tables, attributes referenced only in predicates that are 
pushed down are not passed to the `ParquetTableScan` operator and causes 
exception. Verified in the {{sbt hive/console}}:

{code:scala}
loadTestTable("src")
table("src").saveAsParquetFile("src.parquet")
parquetFile("src.parquet").registerAsTable("src_parquet")
hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println)
{code}


> column pruning problem of Parquet  File
> ---
>
> Key: SPARK-1913
> URL: https://issues.apache.org/jira/browse/SPARK-1913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
> Environment: mac os 10.9.2
>Reporter: Chen Chao
>
> When scanning Parquet tables, attributes referenced only in predicates that 
> are pushed down are not passed to the `ParquetTableScan` operator and causes 
> exception. Verified in the {{sbt hive/console}}:
> {code:java}
> loadTestTable("src")
> table("src").saveAsParquetFile("src.parquet")
> parquetFile("src.parquet").registerAsTable("src_parquet")
> hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1913) column pruning problem of Parquet File

2014-05-23 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-1913:
--

Description: 
When scanning Parquet tables, attributes referenced only in predicates that are 
pushed down are not passed to the `ParquetTableScan` operator and causes 
exception. Verified in the {{sbt hive/console}}:

{code:scala}
loadTestTable("src")
table("src").saveAsParquetFile("src.parquet")
parquetFile("src.parquet").registerAsTable("src_parquet")
hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println)
{code}

  was:
case class Person(name: String, age: Int)

if we use Parquet file, the following statement will throw a exception says  
java.lang.IllegalArgumentException: Column age does not exist.
sql("SELECT name  FROM parquetFile WHERE age >= 13 AND age <= 19") 
we have to add age column after SELECT in order to make it right:
sql("SELECT name , age  FROM parquetFile WHERE age >= 13 AND age <= 19") 

The same error also occurs when we use DSL:
 parquetFile.where('key === 1).select('value as 'a).collect().foreach(println)
will tell us can not find column 'key',we have to fix like this : 
 parquetFile.where('key === 1).select('key ,'value as 
'a).collect().foreach(println)

Obviously, that's not the way we want!


> column pruning problem of Parquet  File
> ---
>
> Key: SPARK-1913
> URL: https://issues.apache.org/jira/browse/SPARK-1913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
> Environment: mac os 10.9.2
>Reporter: Chen Chao
>
> When scanning Parquet tables, attributes referenced only in predicates that 
> are pushed down are not passed to the `ParquetTableScan` operator and causes 
> exception. Verified in the {{sbt hive/console}}:
> {code:scala}
> loadTestTable("src")
> table("src").saveAsParquetFile("src.parquet")
> parquetFile("src.parquet").registerAsTable("src_parquet")
> hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1914) Simplify CountFunction not to traverse to evaluate all child expressions.

2014-05-23 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007033#comment-14007033
 ] 

Takuya Ueshin commented on SPARK-1914:
--

Pull-requested: https://github.com/apache/spark/pull/861

> Simplify CountFunction not to traverse to evaluate all child expressions.
> -
>
> Key: SPARK-1914
> URL: https://issues.apache.org/jira/browse/SPARK-1914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>
> {{CountFunction}} should count up only if the child's evaluated value is not 
> null.
> Because it traverses to evaluate all child expressions, even if the child is 
> null, it counts up if one of the all children is not null.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1914) Simplify CountFunction not to traverse to evaluate all child expressions.

2014-05-23 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-1914:
-

Description: 
{{CountFunction}} should count up only if the child's evaluated value is not 
null.

Because it traverses to evaluate all child expressions, even if the child is 
null, it counts up if one of the all children is not null.

> Simplify CountFunction not to traverse to evaluate all child expressions.
> -
>
> Key: SPARK-1914
> URL: https://issues.apache.org/jira/browse/SPARK-1914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>
> {{CountFunction}} should count up only if the child's evaluated value is not 
> null.
> Because it traverses to evaluate all child expressions, even if the child is 
> null, it counts up if one of the all children is not null.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1912) Compression memory issue during reduce

2014-05-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-1912:
---

Summary: Compression memory issue during reduce  (was: Compression memory 
issue during shuffle)

> Compression memory issue during reduce
> --
>
> Key: SPARK-1912
> URL: https://issues.apache.org/jira/browse/SPARK-1912
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Wenchen Fan
>
> When we need to read a compressed block, we will first create a compress 
> stream instance(LZF or Snappy) and use it to wrap that block.
> Let's say a reducer task need to read 1000 local shuffle blocks, it will 
> first prepare to read that 1000 blocks, which means create 1000 compression 
> stream instance to wrap them. But the initialization of compression instance 
> will allocate some memory and when we have many compression instance at the 
> same time, it is a problem.
> Actually reducer reads the shuffle blocks one by one, so why we create 
> compression instance at the first time? Can we do it lazily that when a block 
> is first read, create compression instance for it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1913) column pruning problem of Parquet File

2014-05-23 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007026#comment-14007026
 ] 

Cheng Lian commented on SPARK-1913:
---

Attributes referenced only in those filters that are pushed down are not 
considered when building the {{ParquetTableScan}} operator in 
{{ParquetOperations}}. Will submit a PR for this.

> column pruning problem of Parquet  File
> ---
>
> Key: SPARK-1913
> URL: https://issues.apache.org/jira/browse/SPARK-1913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
> Environment: mac os 10.9.2
>Reporter: Chen Chao
>
> case class Person(name: String, age: Int)
> if we use Parquet file, the following statement will throw a exception says  
> java.lang.IllegalArgumentException: Column age does not exist.
> sql("SELECT name  FROM parquetFile WHERE age >= 13 AND age <= 19") 
> we have to add age column after SELECT in order to make it right:
> sql("SELECT name , age  FROM parquetFile WHERE age >= 13 AND age <= 19") 
> The same error also occurs when we use DSL:
>  parquetFile.where('key === 1).select('value as 'a).collect().foreach(println)
> will tell us can not find column 'key',we have to fix like this : 
>  parquetFile.where('key === 1).select('key ,'value as 
> 'a).collect().foreach(println)
> Obviously, that's not the way we want!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1914) Simplify CountFunction not to traverse to evaluate all child expressions.

2014-05-23 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-1914:


 Summary: Simplify CountFunction not to traverse to evaluate all 
child expressions.
 Key: SPARK-1914
 URL: https://issues.apache.org/jira/browse/SPARK-1914
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1913) column pruning problem of Parquet File

2014-05-23 Thread Chen Chao (JIRA)
Chen Chao created SPARK-1913:


 Summary: column pruning problem of Parquet  File
 Key: SPARK-1913
 URL: https://issues.apache.org/jira/browse/SPARK-1913
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
 Environment: mac os 10.9.2
Reporter: Chen Chao


case class Person(name: String, age: Int)

if we use Parquet file, the following statement will throw a exception says  
java.lang.IllegalArgumentException: Column age does not exist.
sql("SELECT name  FROM parquetFile WHERE age >= 13 AND age <= 19") 
we have to add age column after SELECT in order to make it right:
sql("SELECT name , age  FROM parquetFile WHERE age >= 13 AND age <= 19") 

The same error also occurs when we use DSL:
 parquetFile.where('key === 1).select('value as 'a).collect().foreach(println)
will tell us can not find column 'key',we have to fix like this : 
 parquetFile.where('key === 1).select('key ,'value as 
'a).collect().foreach(println)

Obviously, that's not the way we want!



--
This message was sent by Atlassian JIRA
(v6.2#6252)