[jira] [Created] (SPARK-1919) In Windows, Spark shell cannot load classes in spark.jars (--jars)
Andrew Or created SPARK-1919: Summary: In Windows, Spark shell cannot load classes in spark.jars (--jars) Key: SPARK-1919 URL: https://issues.apache.org/jira/browse/SPARK-1919 Project: Spark Issue Type: Bug Components: Windows Affects Versions: 1.0.0 Reporter: Andrew Or Not sure what the issue is, but Spark submit does not have the same problem, even if the jars specified are the same. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1918) PySpark shell --py-files does not work for zip files
[ https://issues.apache.org/jira/browse/SPARK-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007874#comment-14007874 ] Andrew Or commented on SPARK-1918: -- https://github.com/apache/spark/pull/853 > PySpark shell --py-files does not work for zip files > > > Key: SPARK-1918 > URL: https://issues.apache.org/jira/browse/SPARK-1918 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.0 >Reporter: Andrew Or > Fix For: 1.0.0 > > > For pyspark shell, we never add --py-files to the python path. This is > specific to non-python files, because python does not automatically look into > zip files even if they are also uploaded to the HTTP server through > `sc.addFile`. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1918) PySpark shell --py-files does not work for zip files
Andrew Or created SPARK-1918: Summary: PySpark shell --py-files does not work for zip files Key: SPARK-1918 URL: https://issues.apache.org/jira/browse/SPARK-1918 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Andrew Or Fix For: 1.0.0 For pyspark shell, we never add --py-files to the python path. This is specific to non-python files, because python does not automatically look into zip files even if they are also uploaded to the HTTP server through `sc.addFile`. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1138) Spark 0.9.0 does not work with Hadoop / HDFS
[ https://issues.apache.org/jira/browse/SPARK-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007837#comment-14007837 ] Reynold Xin commented on SPARK-1138: Just want to chime in that I also encountered this stack trace, and the problem was an older Netty (in particular, I have Netty 3.4 on my classpath). Once I included Netty 3.6.6, the problem went away. > Spark 0.9.0 does not work with Hadoop / HDFS > > > Key: SPARK-1138 > URL: https://issues.apache.org/jira/browse/SPARK-1138 > Project: Spark > Issue Type: Bug >Reporter: Sam Abeyratne > > UPDATE: This problem is certainly related to trying to use Spark 0.9.0 and > the latest cloudera Hadoop / HDFS in the same jar. It seems no matter how I > fiddle with the deps, the do not play nice together. > I'm getting a java.util.concurrent.TimeoutException when trying to create a > spark context with 0.9. I cannot, whatever I do, change the timeout. I've > tried using System.setProperty, the SparkConf mechanism of creating a > SparkContext and the -D flags when executing my jar. I seem to be able to > run simple jobs from the spark-shell OK, but my more complicated jobs require > external libraries so I need to build jars and execute them. > Some code that causes this: > println("Creating config") > val conf = new SparkConf() > .setMaster(clusterMaster) > .setAppName("MyApp") > .setSparkHome(sparkHome) > .set("spark.akka.askTimeout", parsed.getOrElse(timeouts, "100")) > .set("spark.akka.timeout", parsed.getOrElse(timeouts, "100")) > println("Creating sc") > implicit val sc = new SparkContext(conf) > The output: > Creating config > Creating sc > log4j:WARN No appenders could be found for logger > (akka.event.slf4j.Slf4jLogger). > log4j:WARN Please initialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more > info. > [ERROR] [02/26/2014 11:05:25.491] [main] [Remoting] Remoting error: [Startup > timed out] [ > akka.remote.RemoteTransportException: Startup timed out > at > akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:129) > at akka.remote.Remoting.start(Remoting.scala:191) > at > akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184) > at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579) > at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577) > at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588) > at akka.actor.ActorSystem$.apply(ActorSystem.scala:111) > at akka.actor.ActorSystem$.apply(ActorSystem.scala:104) > at > org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96) > at org.apache.spark.SparkEnv$.create(SparkEnv.scala:126) > at org.apache.spark.SparkContext.(SparkContext.scala:139) > at > com.adbrain.accuracy.EvaluateAdtruthIDs$.main(EvaluateAdtruthIDs.scala:40) > at > com.adbrain.accuracy.EvaluateAdtruthIDs.main(EvaluateAdtruthIDs.scala) > Caused by: java.util.concurrent.TimeoutException: Futures timed out after > [1 milliseconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at akka.remote.Remoting.start(Remoting.scala:173) > ... 11 more > ] > Exception in thread "main" java.util.concurrent.TimeoutException: Futures > timed out after [1 milliseconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at akka.remote.Remoting.start(Remoting.scala:173) > at > akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184) > at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579) > at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577) > at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588) > at akka.actor.ActorSystem$.apply(ActorSystem.scala:111) > at akka.actor.ActorSystem$.apply(ActorSystem.scala:104) > at > org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96) > at org.apache.spark.SparkEnv$.create(SparkEnv.scala:126) > at org.apache.spark.SparkCont
[jira] [Commented] (SPARK-1917) PySpark fails to import functions from {{scipy.special}}
[ https://issues.apache.org/jira/browse/SPARK-1917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007741#comment-14007741 ] Uri Laserson commented on SPARK-1917: - https://github.com/apache/spark/pull/866 > PySpark fails to import functions from {{scipy.special}} > > > Key: SPARK-1917 > URL: https://issues.apache.org/jira/browse/SPARK-1917 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.0, 1.0.0 >Reporter: Uri Laserson > Original Estimate: 1h > Remaining Estimate: 1h > > PySpark is able to load {{numpy}} functions, but not {{scipy.special}} > functions. For example take this snippet: > {code} > from numpy import exp > from scipy.special import gammaln > a = range(1, 11) > b = sc.parallelize(a) > c = b.map(exp) > d = b.map(special.gammaln) > {code} > Calling {{c.collect()}} will return the expected result. However, calling > {{d.collect()}} will fail with > {code} > KeyError: (('gammaln',), , > ('scipy.special', 'gammaln')) > {code} > in {{cloudpickle.py}} module in {{_getobject}}. > The reason is that {{_getobject}} executes {{__import__(modname)}}, which > only loads the top-level package {{X}} in case {{modname}} is like {{X.Y}}. > It is failing because {{gammaln}} is not a member of {{scipy}}. The fix (for > which I will shortly submit a PR) is to add {{fromlist=[attribute]}} to the > {{__import__}} call, which will load the innermost module. > See > [https://docs.python.org/2/library/functions.html#__import__] > and > [http://stackoverflow.com/questions/9544331/from-a-b-import-x-using-import] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1913) Parquet table column pruning error caused by filter pushdown
[ https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-1913: --- Description: When scanning Parquet tables, attributes referenced only in predicates that are pushed down are not passed to the `ParquetTableScan` operator and causes exception. Verified in the {{sbt hive/console}}: {code} loadTestTable("src") table("src").saveAsParquetFile("src.parquet") parquetFile("src.parquet").registerAsTable("src_parquet") hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println) {code} Exception {code} parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/scratch/rxin/spark/src.parquet/part-r-2.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:177) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717) at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.IllegalArgumentException: Column key does not exist. at parquet.filter.ColumnRecordFilter$1.bind(ColumnRecordFilter.java:51) at org.apache.spark.sql.parquet.ComparisonFilter.bind(ParquetFilters.scala:306) at parquet.io.FilteredRecordReader.(FilteredRecordReader.java:46) at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74) at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172) ... 28 more {code} was: When scanning Parquet tables, attributes referenced only in predicates that are pushed down are not passed to the `ParquetTableScan` operator and causes exception. Verified in the {{sbt hive/console}}: {code} loadTestTable("src") table("src").saveAsParquetFile("src.parquet") parquetFile("src.parquet").registerAsTable("src_parquet") hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println) {code} > Parquet table column pruning error caused by filter pushdown > > > Key: SPARK-1913 > URL: https://issues.apache.org/jira/browse/SPARK-1913 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 > Environment: mac os 10.9.2 >Reporter: Chen Chao > > When scanning Parquet tables, attributes referenced only in predicates that > are pushed down are not passed to the `ParquetTableScan` operator and causes > exception. Verified in the {{sbt hive/console}}: > {code} > loadTestTable("src") > table("src").saveAsParquetFile("src.parquet") > parquetFile("src.parquet").registerAsTable("src_parquet") > hql("SELECT value FROM src_parquet WHERE key < 10").collect(
[jira] [Commented] (SPARK-1913) Parquet table column pruning error caused by filter pushdown
[ https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007710#comment-14007710 ] Reynold Xin commented on SPARK-1913: I added the exception. > Parquet table column pruning error caused by filter pushdown > > > Key: SPARK-1913 > URL: https://issues.apache.org/jira/browse/SPARK-1913 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 > Environment: mac os 10.9.2 >Reporter: Chen Chao > > When scanning Parquet tables, attributes referenced only in predicates that > are pushed down are not passed to the `ParquetTableScan` operator and causes > exception. Verified in the {{sbt hive/console}}: > {code} > loadTestTable("src") > table("src").saveAsParquetFile("src.parquet") > parquetFile("src.parquet").registerAsTable("src_parquet") > hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println) > {code} > Exception > {code} > parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in > file file:/scratch/rxin/spark/src.parquet/part-r-2.parquet > at > parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:177) > at > parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:122) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717) > at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: java.lang.IllegalArgumentException: Column key does not exist. > at parquet.filter.ColumnRecordFilter$1.bind(ColumnRecordFilter.java:51) > at > org.apache.spark.sql.parquet.ComparisonFilter.bind(ParquetFilters.scala:306) > at parquet.io.FilteredRecordReader.(FilteredRecordReader.java:46) > at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74) > at > parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110) > at > parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172) > ... 28 more > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1917) PySpark fails to import functions from {{scipy.special}}
Uri Laserson created SPARK-1917: --- Summary: PySpark fails to import functions from {{scipy.special}} Key: SPARK-1917 URL: https://issues.apache.org/jira/browse/SPARK-1917 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 0.9.0, 1.0.0 Reporter: Uri Laserson PySpark is able to load {{numpy}} functions, but not {{scipy.special}} functions. For example take this snippet: {code} from numpy import exp from scipy.special import gammaln a = range(1, 11) b = sc.parallelize(a) c = b.map(exp) d = b.map(special.gammaln) {code} Calling {{c.collect()}} will return the expected result. However, calling {{d.collect()}} will fail with {code} KeyError: (('gammaln',), , ('scipy.special', 'gammaln')) {code} in {{cloudpickle.py}} module in {{_getobject}}. The reason is that {{_getobject}} executes {{__import__(modname)}}, which only loads the top-level package {{X}} in case {{modname}} is like {{X.Y}}. It is failing because {{gammaln}} is not a member of {{scipy}}. The fix (for which I will shortly submit a PR) is to add {{fromlist=[attribute]}} to the {{__import__}} call, which will load the innermost module. See [https://docs.python.org/2/library/functions.html#__import__] and [http://stackoverflow.com/questions/9544331/from-a-b-import-x-using-import] -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1902) Spark shell prints error when :4040 port already in use
[ https://issues.apache.org/jira/browse/SPARK-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007684#comment-14007684 ] Patrick Wendell commented on SPARK-1902: Yeah, this would be a good one to fix. IIRC I spent a long time trying to figure out how to silence only this message, but I couldn't do anything except for silencing all jetty WARN logs (which we don't want). I also considered first checking if the port is free before trying to bind to it, but that has race conditions. If someone figures out a better way to do this, that would be great. > Spark shell prints error when :4040 port already in use > --- > > Key: SPARK-1902 > URL: https://issues.apache.org/jira/browse/SPARK-1902 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Ash > > When running two shells on the same machine, I get the below error. The > issue is that the first shell takes port 4040, then the next tries tries 4040 > and fails so falls back to 4041, then a third would try 4040 and 4041 before > landing on 4042, etc. > We should catch the error and instead log as "Unable to use port 4041; > already in use. Attempting port 4042..." > {noformat} > 14/05/22 11:31:54 WARN component.AbstractLifeCycle: FAILED > SelectChannelConnector@0.0.0.0:4041: java.net.BindException: Address already > in use > java.net.BindException: Address already in use > at sun.nio.ch.Net.bind0(Native Method) > at sun.nio.ch.Net.bind(Net.java:444) > at sun.nio.ch.Net.bind(Net.java:436) > at > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214) > at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > at > org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187) > at > org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316) > at > org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) > at org.eclipse.jetty.server.Server.doStart(Server.java:293) > at > org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) > at > org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192) > at > org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192) > at > org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192) > at scala.util.Try$.apply(Try.scala:161) > at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191) > at > org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205) > at org.apache.spark.ui.WebUI.bind(WebUI.scala:99) > at org.apache.spark.SparkContext.(SparkContext.scala:217) > at > org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:957) > at $line3.$read$$iwC$$iwC.(:8) > at $line3.$read$$iwC.(:14) > at $line3.$read.(:16) > at $line3.$read$.(:20) > at $line3.$read$.() > at $line3.$eval$.(:7) > at $line3.$eval$.() > at $line3.$eval.$print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:121) > at > org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:120) > at > org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:263) > at > org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:120) > at > org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:56) >
[jira] [Updated] (SPARK-1916) SparkFlumeEvent with body bigger than 1020 bytes are not read properly
[ https://issues.apache.org/jira/browse/SPARK-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lemieux updated SPARK-1916: - Attachment: SPARK-1916.diff Attaching a diff for now. I'll create a pull request shortly. > SparkFlumeEvent with body bigger than 1020 bytes are not read properly > -- > > Key: SPARK-1916 > URL: https://issues.apache.org/jira/browse/SPARK-1916 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 0.9.0 >Reporter: David Lemieux > Attachments: SPARK-1916.diff > > > The readExternal implementation on SparkFlumeEvent will read only the first > 1020 bytes of the actual body when streaming data from flume. > This means that any event sent to Spark via Flume will be processed properly > if the body is small, but will fail if the body is bigger than 1020. > Considering that the default max size for a Flume Avro Event is 32K, the > implementation should be updated to read more. > The following is related : > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-using-Flume-body-size-limitation-tt6127.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1916) SparkFlumeEvent with body bigger than 1020 bytes are not read properly
David Lemieux created SPARK-1916: Summary: SparkFlumeEvent with body bigger than 1020 bytes are not read properly Key: SPARK-1916 URL: https://issues.apache.org/jira/browse/SPARK-1916 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 0.9.0 Reporter: David Lemieux The readExternal implementation on SparkFlumeEvent will read only the first 1020 bytes of the actual body when streaming data from flume. This means that any event sent to Spark via Flume will be processed properly if the body is small, but will fail if the body is bigger than 1020. Considering that the default max size for a Flume Avro Event is 32K, the implementation should be updated to read more. The following is related : http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-using-Flume-body-size-limitation-tt6127.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007565#comment-14007565 ] Michael Malak commented on SPARK-1867: -- Thank you, sam, that fixed it for me! FYI, I had: {noformat} org.apache.hadoop hadoop-common 2.3.0-cdh5.0.1 org.apache.hadoop hadoop-mapreduce-client-core 2.3.0-cdh5.0.1 {noformat} By removing the second one, textfile().count now works. > Spark Documentation Error causes java.lang.IllegalStateException: unread > block data > --- > > Key: SPARK-1867 > URL: https://issues.apache.org/jira/browse/SPARK-1867 > Project: Spark > Issue Type: Bug >Reporter: sam > > I've employed two System Administrators on a contract basis (for quite a bit > of money), and both contractors have independently hit the following > exception. What we are doing is: > 1. Installing Spark 0.9.1 according to the documentation on the website, > along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. > 2. Building a fat jar with a Spark app with sbt then trying to run it on the > cluster > I've also included code snippets, and sbt deps at the bottom. > When I've Googled this, there seems to be two somewhat vague responses: > a) Mismatching spark versions on nodes/user code > b) Need to add more jars to the SparkConf > Now I know that (b) is not the problem having successfully run the same code > on other clusters while only including one jar (it's a fat jar). > But I have no idea how to check for (a) - it appears Spark doesn't have any > version checks or anything - it would be nice if it checked versions and > threw a "mismatching version exception: you have user code using version X > and node Y has version Z". > I would be very grateful for advice on this. > The exception: > Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task > 0.0:1 failed 32 times (most recent failure: Exception failure: > java.lang.IllegalStateException: unread block data) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to > java.lang.IllegalStateException: unread block data [duplicate 59] > My code snippet: > val conf = new SparkConf() >.setMaster(clusterMaster) >.setAppName(appName) >.setSparkHome(sparkHome) >.setJars(SparkContext.jarOfClass(this.getClass)) > println("count = " + new SparkContext(conf).textFile(someHdfsPath).count()) > My SBT dependencies: > // relevant > "org.apache.spark" % "spark-core_2.10" % "0.9.1", > "org.apache.hadoop" % "hadoop-client" % "2.3.0-mr1-cdh5.0.0", > // standard, probably unrelated > "com.github.seratch" %% "awscala" % "[0.2,)", > "org.scalacheck" %% "scalacheck" % "1.10.1" % "test", > "org.specs2" %% "specs2" % "1.14" % "test", > "org.scala-lang" % "scala-reflect" % "2.10.3", > "org.scalaz" %% "scalaz-core" % "7.0.5", > "net.minidev" % "json-smart" % "1.2" -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()
[ https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007479#comment-14007479 ] Madhu Siddalingaiah commented on SPARK-983: --- I have the beginnings of a SortedIterator working for data that will fit in memory. It does more or less the same thing as partition sort in OrderedRDDFunctions, but it's an iterator. If we know that a partition cannot fit in memory, it's possible to split it up into chunks, sort each chunk, write to disk, and merge the chunks on disk. To determine when to split, is it reasonable to use Runtime.freeMemory() / maxMemory() along with configuration limits? I could just keep adding to an in-memory sortable list until some memory threshold is reached, then sort/spill to disk, repeat until all data has been sorted and spilled. Then it's a basic merge operation. Any comments? > Support external sorting for RDD#sortByKey() > > > Key: SPARK-983 > URL: https://issues.apache.org/jira/browse/SPARK-983 > Project: Spark > Issue Type: New Feature >Affects Versions: 0.9.0 >Reporter: Reynold Xin > > Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a > buffer to hold the entire partition, then sorts it. This will cause an OOM if > an entire partition cannot fit in memory, which is especially problematic for > skewed data. Rather than OOMing, the behavior should be similar to the > [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala], > where we fallback to disk if we detect memory pressure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1790) Update EC2 scripts to support r3 instance types
[ https://issues.apache.org/jira/browse/SPARK-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007427#comment-14007427 ] Sujeet Varakhedi commented on SPARK-1790: - I will work on this > Update EC2 scripts to support r3 instance types > --- > > Key: SPARK-1790 > URL: https://issues.apache.org/jira/browse/SPARK-1790 > Project: Spark > Issue Type: Improvement > Components: EC2 >Affects Versions: 0.9.0, 1.0.0, 0.9.1 >Reporter: Matei Zaharia > Labels: starter > > These were recently added by Amazon as a cheaper high-memory option -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007339#comment-14007339 ] Sean Owen commented on SPARK-1867: -- Yes, the 'mr1' artifacts are for when you are *not* using YARN. These are unusual to use for CDH5, and you would not need those versions. The stock Spark artifacts are for Hadoop 1, not Hadoop 2. It can be built for Hadoop 2 and installed locally if you like. You can use the matched CDH5 version, which is of course made for Hadoop 2, by targeting '0.9.0-cdh5.0.1' for example. (I don't have a release schedule but assume some later version will be released with 5.1 of course) There shouldn't be any trial an error to it, if you express as dependencies all the things you directly use. For example you say your app uses HBase but I see no dependence on the client libraries. This is nothing to do with Spark per se. There shouldn't be any trial and error about it, or else you're probably trying to do something the wrong way around. What classes do you expect you're looking for? > Spark Documentation Error causes java.lang.IllegalStateException: unread > block data > --- > > Key: SPARK-1867 > URL: https://issues.apache.org/jira/browse/SPARK-1867 > Project: Spark > Issue Type: Bug >Reporter: sam > > I've employed two System Administrators on a contract basis (for quite a bit > of money), and both contractors have independently hit the following > exception. What we are doing is: > 1. Installing Spark 0.9.1 according to the documentation on the website, > along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. > 2. Building a fat jar with a Spark app with sbt then trying to run it on the > cluster > I've also included code snippets, and sbt deps at the bottom. > When I've Googled this, there seems to be two somewhat vague responses: > a) Mismatching spark versions on nodes/user code > b) Need to add more jars to the SparkConf > Now I know that (b) is not the problem having successfully run the same code > on other clusters while only including one jar (it's a fat jar). > But I have no idea how to check for (a) - it appears Spark doesn't have any > version checks or anything - it would be nice if it checked versions and > threw a "mismatching version exception: you have user code using version X > and node Y has version Z". > I would be very grateful for advice on this. > The exception: > Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task > 0.0:1 failed 32 times (most recent failure: Exception failure: > java.lang.IllegalStateException: unread block data) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to > java.lang.IllegalStateException: unread block data [duplicate 59] > My code snippet: > val conf = new SparkConf() >.setMaster(clusterMaster) >.setAppName(appName) >.setSparkHome(sparkHome) >.setJars(SparkCont
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007283#comment-14007283 ] sam commented on SPARK-1867: Changing "org.apache.hadoop" % "hadoop-client" % "2.3.0-mr1-cdh5.0.0", to "org.apache.hadoop" % "hadoop-common" % "2.3.0-cdh5.0.0" In my application code seemed to fix this. Not entirely sure why. We have hadoop-yarn on the cluster, so maybe the "mr1" broke things. What we need, is some kind of script/command, then when we run it on the cluster master plus give it a list of packages used in our application code (e.g. "org.apache.hadoop.fs", etc), it says what dependencies we need in our sbt. Furthermore, it would be good if mismatching version problems where caught and an appropriate message given. Cloudera list all their artefacts, but it's impossible to find which artefact contains a particular package that is used in application code. We have been doing trial and error! You see, we are trying to use HBase, Hadoop, and Spark but we are always hitting dependency / version issues. Anyway thanks for getting back to me [~michaelmalak]. Any idea when 1.1.0 will be release? Also any idea when cloudera will distribute 0.9.1, they seem to just have 0.9.0. > Spark Documentation Error causes java.lang.IllegalStateException: unread > block data > --- > > Key: SPARK-1867 > URL: https://issues.apache.org/jira/browse/SPARK-1867 > Project: Spark > Issue Type: Bug >Reporter: sam > > I've employed two System Administrators on a contract basis (for quite a bit > of money), and both contractors have independently hit the following > exception. What we are doing is: > 1. Installing Spark 0.9.1 according to the documentation on the website, > along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. > 2. Building a fat jar with a Spark app with sbt then trying to run it on the > cluster > I've also included code snippets, and sbt deps at the bottom. > When I've Googled this, there seems to be two somewhat vague responses: > a) Mismatching spark versions on nodes/user code > b) Need to add more jars to the SparkConf > Now I know that (b) is not the problem having successfully run the same code > on other clusters while only including one jar (it's a fat jar). > But I have no idea how to check for (a) - it appears Spark doesn't have any > version checks or anything - it would be nice if it checked versions and > threw a "mismatching version exception: you have user code using version X > and node Y has version Z". > I would be very grateful for advice on this. > The exception: > Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task > 0.0:1 failed 32 times (most recent failure: Exception failure: > java.lang.IllegalStateException: unread block data) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to > java.lang.IllegalStateException: unread block data [duplicate 59] > My code snippet: > val con
[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data
[ https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007238#comment-14007238 ] Michael Malak commented on SPARK-1867: -- I, too, have run into this issue, and I was careful to ensure that my driver, master, and worker were all running the same version: CDH5.0.1 and JDK1.7_45. Mridul Muralidharan has indicated this may have been fixed in the Spark 1.1 code. http://mail-archives.apache.org/mod_mbox/spark-dev/201405.mbox/%3CCAJiQeYLnEL%2B6dWiGfoim-V%3DvPydJ_TTFZ6VQ_jUgmuyVT8w%3Ddg%40mail.gmail.com%3E > Spark Documentation Error causes java.lang.IllegalStateException: unread > block data > --- > > Key: SPARK-1867 > URL: https://issues.apache.org/jira/browse/SPARK-1867 > Project: Spark > Issue Type: Bug >Reporter: sam > > I've employed two System Administrators on a contract basis (for quite a bit > of money), and both contractors have independently hit the following > exception. What we are doing is: > 1. Installing Spark 0.9.1 according to the documentation on the website, > along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs. > 2. Building a fat jar with a Spark app with sbt then trying to run it on the > cluster > I've also included code snippets, and sbt deps at the bottom. > When I've Googled this, there seems to be two somewhat vague responses: > a) Mismatching spark versions on nodes/user code > b) Need to add more jars to the SparkConf > Now I know that (b) is not the problem having successfully run the same code > on other clusters while only including one jar (it's a fat jar). > But I have no idea how to check for (a) - it appears Spark doesn't have any > version checks or anything - it would be nice if it checked versions and > threw a "mismatching version exception: you have user code using version X > and node Y has version Z". > I would be very grateful for advice on this. > The exception: > Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task > 0.0:1 failed 32 times (most recent failure: Exception failure: > java.lang.IllegalStateException: unread block data) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to > java.lang.IllegalStateException: unread block data [duplicate 59] > My code snippet: > val conf = new SparkConf() >.setMaster(clusterMaster) >.setAppName(appName) >.setSparkHome(sparkHome) >.setJars(SparkContext.jarOfClass(this.getClass)) > println("count = " + new SparkContext(conf).textFile(someHdfsPath).count()) > My SBT dependencies: > // relevant > "org.apache.spark" % "spark-core_2.10" % "0.9.1", > "org.apache.hadoop" % "hadoop-client" % "2.3.0-mr1-cdh5.0.0", > // standard, probably unrelated > "com.github.seratch" %% "awscala" % "[0.2,)", > "org.scalacheck" %% "scalacheck" % "1.10.1" % "test", > "org.specs2" %% "specs2" % "1.14" % "test", > "org.scala-lang" % "scala-reflect" % "2.10.3", > "org.sca
[jira] [Commented] (SPARK-1912) Compression memory issue during reduce
[ https://issues.apache.org/jira/browse/SPARK-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007229#comment-14007229 ] Andrew Ash commented on SPARK-1912: --- https://github.com/apache/spark/pull/860 > Compression memory issue during reduce > -- > > Key: SPARK-1912 > URL: https://issues.apache.org/jira/browse/SPARK-1912 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Wenchen Fan > > When we need to read a compressed block, we will first create a compress > stream instance(LZF or Snappy) and use it to wrap that block. > Let's say a reducer task need to read 1000 local shuffle blocks, it will > first prepare to read that 1000 blocks, which means create 1000 compression > stream instance to wrap them. But the initialization of compression instance > will allocate some memory and when we have many compression instance at the > same time, it is a problem. > Actually reducer reads the shuffle blocks one by one, so why we create > compression instance at the first time? Can we do it lazily that when a block > is first read, create compression instance for it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1915) AverageFunction should not count if the evaluated value is null.
[ https://issues.apache.org/jira/browse/SPARK-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007136#comment-14007136 ] Takuya Ueshin commented on SPARK-1915: -- Pull-requested: https://github.com/apache/spark/pull/862 > AverageFunction should not count if the evaluated value is null. > > > Key: SPARK-1915 > URL: https://issues.apache.org/jira/browse/SPARK-1915 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin > > Average values are difference between the calculation is done partially or > not partially. > Because {{AverageFunction}} (in not-partially calculation) counts even if the > evaluated value is null. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1898) In deploy.yarn.Client, use YarnClient rather than YarnClientImpl
[ https://issues.apache.org/jira/browse/SPARK-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007100#comment-14007100 ] Thomas Graves commented on SPARK-1898: -- That is correct, as long as you don't modify anything in common or yarn-alpha 0.23 and really early 2.0 versions aren't affected. I think he was referring to all the various 2.X releases though. Unfortunately api's have changed in the early version up until the the stable release of 2.2. I think the other problem we've seen is not all the various vendor releases mapping straight to an apache release. Ideally this shouldn't be a problem but it should be sufficiently tested on all we say we support. Based on that, I agree with @tdas, unless this is a real blocker and you can explain why we should wait to put this in and fix as many of the api's as possible in a follow on release. > In deploy.yarn.Client, use YarnClient rather than YarnClientImpl > > > Key: SPARK-1898 > URL: https://issues.apache.org/jira/browse/SPARK-1898 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Colin Patrick McCabe > > In {{deploy.yarn.Client}}, we should use {{YarnClient}} rather than > {{YarnClientImpl}}. The latter is annotated as {{Private}} and {{Unstable}} > in Hadoop, which means it could change in an incompatible way at any time. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1215) Clustering: Index out of bounds error
[ https://issues.apache.org/jira/browse/SPARK-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007094#comment-14007094 ] Denis Serduik edited comment on SPARK-1215 at 5/23/14 12:30 PM: attach test dataset MLLib failed to find 4 centers with k-means|| init mode on this data was (Author: dmaverick): attach test dataset > Clustering: Index out of bounds error > - > > Key: SPARK-1215 > URL: https://issues.apache.org/jira/browse/SPARK-1215 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: dewshick >Assignee: Xiangrui Meng >Priority: Minor > Attachments: test.csv > > > code: > import org.apache.spark.mllib.clustering._ > val test = sc.makeRDD(Array(4,4,4,4,4).map(e => Array(e.toDouble))) > val kmeans = new KMeans().setK(4) > kmeans.run(test) evals with java.lang.ArrayIndexOutOfBoundsException > error: > 14/01/17 12:35:54 INFO scheduler.DAGScheduler: Stage 25 (collectAsMap at > KMeans.scala:243) finished in 0.047 s > 14/01/17 12:35:54 INFO spark.SparkContext: Job finished: collectAsMap at > KMeans.scala:243, took 16.389537116 s > Exception in thread "main" java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.simontuffs.onejar.Boot.run(Boot.java:340) > at com.simontuffs.onejar.Boot.main(Boot.java:166) > Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 > at > org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:47) > at > org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:247) > at > org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) > at scala.collection.immutable.Range.foreach(Range.scala:81) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) > at scala.collection.immutable.Range.map(Range.scala:46) > at > org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:244) > at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:124) > at Clustering$$anonfun$1.apply$mcDI$sp(Clustering.scala:21) > at Clustering$$anonfun$1.apply(Clustering.scala:19) > at Clustering$$anonfun$1.apply(Clustering.scala:19) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) > at scala.collection.immutable.Range.foreach(Range.scala:78) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) > at scala.collection.immutable.Range.map(Range.scala:46) > at Clustering$.main(Clustering.scala:19) > at Clustering.main(Clustering.scala) > ... 6 more -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1215) Clustering: Index out of bounds error
[ https://issues.apache.org/jira/browse/SPARK-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Serduik updated SPARK-1215: - Attachment: test.csv attach test dataset > Clustering: Index out of bounds error > - > > Key: SPARK-1215 > URL: https://issues.apache.org/jira/browse/SPARK-1215 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: dewshick >Assignee: Xiangrui Meng >Priority: Minor > Attachments: test.csv > > > code: > import org.apache.spark.mllib.clustering._ > val test = sc.makeRDD(Array(4,4,4,4,4).map(e => Array(e.toDouble))) > val kmeans = new KMeans().setK(4) > kmeans.run(test) evals with java.lang.ArrayIndexOutOfBoundsException > error: > 14/01/17 12:35:54 INFO scheduler.DAGScheduler: Stage 25 (collectAsMap at > KMeans.scala:243) finished in 0.047 s > 14/01/17 12:35:54 INFO spark.SparkContext: Job finished: collectAsMap at > KMeans.scala:243, took 16.389537116 s > Exception in thread "main" java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.simontuffs.onejar.Boot.run(Boot.java:340) > at com.simontuffs.onejar.Boot.main(Boot.java:166) > Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 > at > org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:47) > at > org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:247) > at > org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) > at scala.collection.immutable.Range.foreach(Range.scala:81) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) > at scala.collection.immutable.Range.map(Range.scala:46) > at > org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:244) > at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:124) > at Clustering$$anonfun$1.apply$mcDI$sp(Clustering.scala:21) > at Clustering$$anonfun$1.apply(Clustering.scala:19) > at Clustering$$anonfun$1.apply(Clustering.scala:19) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) > at scala.collection.immutable.Range.foreach(Range.scala:78) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) > at scala.collection.immutable.Range.map(Range.scala:46) > at Clustering$.main(Clustering.scala:19) > at Clustering.main(Clustering.scala) > ... 6 more -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1215) Clustering: Index out of bounds error
[ https://issues.apache.org/jira/browse/SPARK-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007090#comment-14007090 ] Denis Serduik commented on SPARK-1215: -- I don't think that the problem is about size of dataset. I've faced with similar issue on dataset with about 900 items. As a workaround we've decided to fallback with random init mode. > Clustering: Index out of bounds error > - > > Key: SPARK-1215 > URL: https://issues.apache.org/jira/browse/SPARK-1215 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: dewshick >Assignee: Xiangrui Meng >Priority: Minor > > code: > import org.apache.spark.mllib.clustering._ > val test = sc.makeRDD(Array(4,4,4,4,4).map(e => Array(e.toDouble))) > val kmeans = new KMeans().setK(4) > kmeans.run(test) evals with java.lang.ArrayIndexOutOfBoundsException > error: > 14/01/17 12:35:54 INFO scheduler.DAGScheduler: Stage 25 (collectAsMap at > KMeans.scala:243) finished in 0.047 s > 14/01/17 12:35:54 INFO spark.SparkContext: Job finished: collectAsMap at > KMeans.scala:243, took 16.389537116 s > Exception in thread "main" java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.simontuffs.onejar.Boot.run(Boot.java:340) > at com.simontuffs.onejar.Boot.main(Boot.java:166) > Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 > at > org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:47) > at > org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:247) > at > org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) > at scala.collection.immutable.Range.foreach(Range.scala:81) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) > at scala.collection.immutable.Range.map(Range.scala:46) > at > org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:244) > at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:124) > at Clustering$$anonfun$1.apply$mcDI$sp(Clustering.scala:21) > at Clustering$$anonfun$1.apply(Clustering.scala:19) > at Clustering$$anonfun$1.apply(Clustering.scala:19) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) > at scala.collection.immutable.Range.foreach(Range.scala:78) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) > at scala.collection.immutable.Range.map(Range.scala:46) > at Clustering$.main(Clustering.scala:19) > at Clustering.main(Clustering.scala) > ... 6 more -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1913) Parquet table column pruning error caused by filter pushdown
[ https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007063#comment-14007063 ] Cheng Lian commented on SPARK-1913: --- Corresponding PR: https://github.com/apache/spark/pull/863 > Parquet table column pruning error caused by filter pushdown > > > Key: SPARK-1913 > URL: https://issues.apache.org/jira/browse/SPARK-1913 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 > Environment: mac os 10.9.2 >Reporter: Chen Chao > > When scanning Parquet tables, attributes referenced only in predicates that > are pushed down are not passed to the `ParquetTableScan` operator and causes > exception. Verified in the {{sbt hive/console}}: > {code} > loadTestTable("src") > table("src").saveAsParquetFile("src.parquet") > parquetFile("src.parquet").registerAsTable("src_parquet") > hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1915) AverageFunction should not count if the evaluated value is null.
Takuya Ueshin created SPARK-1915: Summary: AverageFunction should not count if the evaluated value is null. Key: SPARK-1915 URL: https://issues.apache.org/jira/browse/SPARK-1915 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin Average values are difference between the calculation is done partially or not partially. Because {{AverageFunction}} (in not-partially calculation) counts even if the evaluated value is null. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1913) Parquet table column pruning error caused by filter pushdown
[ https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-1913: -- Summary: Parquet table column pruning error caused by filter pushdown (was: Column pruning for Parquet table) > Parquet table column pruning error caused by filter pushdown > > > Key: SPARK-1913 > URL: https://issues.apache.org/jira/browse/SPARK-1913 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 > Environment: mac os 10.9.2 >Reporter: Chen Chao > > When scanning Parquet tables, attributes referenced only in predicates that > are pushed down are not passed to the `ParquetTableScan` operator and causes > exception. Verified in the {{sbt hive/console}}: > {code} > loadTestTable("src") > table("src").saveAsParquetFile("src.parquet") > parquetFile("src.parquet").registerAsTable("src_parquet") > hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1913) Column pruning for Parquet table
[ https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-1913: -- Summary: Column pruning for Parquet table (was: column pruning problem of Parquet File) > Column pruning for Parquet table > > > Key: SPARK-1913 > URL: https://issues.apache.org/jira/browse/SPARK-1913 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 > Environment: mac os 10.9.2 >Reporter: Chen Chao > > When scanning Parquet tables, attributes referenced only in predicates that > are pushed down are not passed to the `ParquetTableScan` operator and causes > exception. Verified in the {{sbt hive/console}}: > {code:java} > loadTestTable("src") > table("src").saveAsParquetFile("src.parquet") > parquetFile("src.parquet").registerAsTable("src_parquet") > hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1913) Column pruning for Parquet table
[ https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-1913: -- Description: When scanning Parquet tables, attributes referenced only in predicates that are pushed down are not passed to the `ParquetTableScan` operator and causes exception. Verified in the {{sbt hive/console}}: {code} loadTestTable("src") table("src").saveAsParquetFile("src.parquet") parquetFile("src.parquet").registerAsTable("src_parquet") hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println) {code} was: When scanning Parquet tables, attributes referenced only in predicates that are pushed down are not passed to the `ParquetTableScan` operator and causes exception. Verified in the {{sbt hive/console}}: {code:java} loadTestTable("src") table("src").saveAsParquetFile("src.parquet") parquetFile("src.parquet").registerAsTable("src_parquet") hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println) {code} > Column pruning for Parquet table > > > Key: SPARK-1913 > URL: https://issues.apache.org/jira/browse/SPARK-1913 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 > Environment: mac os 10.9.2 >Reporter: Chen Chao > > When scanning Parquet tables, attributes referenced only in predicates that > are pushed down are not passed to the `ParquetTableScan` operator and causes > exception. Verified in the {{sbt hive/console}}: > {code} > loadTestTable("src") > table("src").saveAsParquetFile("src.parquet") > parquetFile("src.parquet").registerAsTable("src_parquet") > hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1913) column pruning problem of Parquet File
[ https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-1913: -- Description: When scanning Parquet tables, attributes referenced only in predicates that are pushed down are not passed to the `ParquetTableScan` operator and causes exception. Verified in the {{sbt hive/console}}: {code:java} loadTestTable("src") table("src").saveAsParquetFile("src.parquet") parquetFile("src.parquet").registerAsTable("src_parquet") hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println) {code} was: When scanning Parquet tables, attributes referenced only in predicates that are pushed down are not passed to the `ParquetTableScan` operator and causes exception. Verified in the {{sbt hive/console}}: {code:scala} loadTestTable("src") table("src").saveAsParquetFile("src.parquet") parquetFile("src.parquet").registerAsTable("src_parquet") hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println) {code} > column pruning problem of Parquet File > --- > > Key: SPARK-1913 > URL: https://issues.apache.org/jira/browse/SPARK-1913 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 > Environment: mac os 10.9.2 >Reporter: Chen Chao > > When scanning Parquet tables, attributes referenced only in predicates that > are pushed down are not passed to the `ParquetTableScan` operator and causes > exception. Verified in the {{sbt hive/console}}: > {code:java} > loadTestTable("src") > table("src").saveAsParquetFile("src.parquet") > parquetFile("src.parquet").registerAsTable("src_parquet") > hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1913) column pruning problem of Parquet File
[ https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-1913: -- Description: When scanning Parquet tables, attributes referenced only in predicates that are pushed down are not passed to the `ParquetTableScan` operator and causes exception. Verified in the {{sbt hive/console}}: {code:scala} loadTestTable("src") table("src").saveAsParquetFile("src.parquet") parquetFile("src.parquet").registerAsTable("src_parquet") hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println) {code} was: case class Person(name: String, age: Int) if we use Parquet file, the following statement will throw a exception says java.lang.IllegalArgumentException: Column age does not exist. sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19") we have to add age column after SELECT in order to make it right: sql("SELECT name , age FROM parquetFile WHERE age >= 13 AND age <= 19") The same error also occurs when we use DSL: parquetFile.where('key === 1).select('value as 'a).collect().foreach(println) will tell us can not find column 'key',we have to fix like this : parquetFile.where('key === 1).select('key ,'value as 'a).collect().foreach(println) Obviously, that's not the way we want! > column pruning problem of Parquet File > --- > > Key: SPARK-1913 > URL: https://issues.apache.org/jira/browse/SPARK-1913 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 > Environment: mac os 10.9.2 >Reporter: Chen Chao > > When scanning Parquet tables, attributes referenced only in predicates that > are pushed down are not passed to the `ParquetTableScan` operator and causes > exception. Verified in the {{sbt hive/console}}: > {code:scala} > loadTestTable("src") > table("src").saveAsParquetFile("src.parquet") > parquetFile("src.parquet").registerAsTable("src_parquet") > hql("SELECT value FROM src_parquet WHERE key < 10").collect().foreach(println) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1914) Simplify CountFunction not to traverse to evaluate all child expressions.
[ https://issues.apache.org/jira/browse/SPARK-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007033#comment-14007033 ] Takuya Ueshin commented on SPARK-1914: -- Pull-requested: https://github.com/apache/spark/pull/861 > Simplify CountFunction not to traverse to evaluate all child expressions. > - > > Key: SPARK-1914 > URL: https://issues.apache.org/jira/browse/SPARK-1914 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin > > {{CountFunction}} should count up only if the child's evaluated value is not > null. > Because it traverses to evaluate all child expressions, even if the child is > null, it counts up if one of the all children is not null. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1914) Simplify CountFunction not to traverse to evaluate all child expressions.
[ https://issues.apache.org/jira/browse/SPARK-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-1914: - Description: {{CountFunction}} should count up only if the child's evaluated value is not null. Because it traverses to evaluate all child expressions, even if the child is null, it counts up if one of the all children is not null. > Simplify CountFunction not to traverse to evaluate all child expressions. > - > > Key: SPARK-1914 > URL: https://issues.apache.org/jira/browse/SPARK-1914 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin > > {{CountFunction}} should count up only if the child's evaluated value is not > null. > Because it traverses to evaluate all child expressions, even if the child is > null, it counts up if one of the all children is not null. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1912) Compression memory issue during reduce
[ https://issues.apache.org/jira/browse/SPARK-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-1912: --- Summary: Compression memory issue during reduce (was: Compression memory issue during shuffle) > Compression memory issue during reduce > -- > > Key: SPARK-1912 > URL: https://issues.apache.org/jira/browse/SPARK-1912 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Wenchen Fan > > When we need to read a compressed block, we will first create a compress > stream instance(LZF or Snappy) and use it to wrap that block. > Let's say a reducer task need to read 1000 local shuffle blocks, it will > first prepare to read that 1000 blocks, which means create 1000 compression > stream instance to wrap them. But the initialization of compression instance > will allocate some memory and when we have many compression instance at the > same time, it is a problem. > Actually reducer reads the shuffle blocks one by one, so why we create > compression instance at the first time? Can we do it lazily that when a block > is first read, create compression instance for it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1913) column pruning problem of Parquet File
[ https://issues.apache.org/jira/browse/SPARK-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14007026#comment-14007026 ] Cheng Lian commented on SPARK-1913: --- Attributes referenced only in those filters that are pushed down are not considered when building the {{ParquetTableScan}} operator in {{ParquetOperations}}. Will submit a PR for this. > column pruning problem of Parquet File > --- > > Key: SPARK-1913 > URL: https://issues.apache.org/jira/browse/SPARK-1913 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 > Environment: mac os 10.9.2 >Reporter: Chen Chao > > case class Person(name: String, age: Int) > if we use Parquet file, the following statement will throw a exception says > java.lang.IllegalArgumentException: Column age does not exist. > sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19") > we have to add age column after SELECT in order to make it right: > sql("SELECT name , age FROM parquetFile WHERE age >= 13 AND age <= 19") > The same error also occurs when we use DSL: > parquetFile.where('key === 1).select('value as 'a).collect().foreach(println) > will tell us can not find column 'key',we have to fix like this : > parquetFile.where('key === 1).select('key ,'value as > 'a).collect().foreach(println) > Obviously, that's not the way we want! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1914) Simplify CountFunction not to traverse to evaluate all child expressions.
Takuya Ueshin created SPARK-1914: Summary: Simplify CountFunction not to traverse to evaluate all child expressions. Key: SPARK-1914 URL: https://issues.apache.org/jira/browse/SPARK-1914 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1913) column pruning problem of Parquet File
Chen Chao created SPARK-1913: Summary: column pruning problem of Parquet File Key: SPARK-1913 URL: https://issues.apache.org/jira/browse/SPARK-1913 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Environment: mac os 10.9.2 Reporter: Chen Chao case class Person(name: String, age: Int) if we use Parquet file, the following statement will throw a exception says java.lang.IllegalArgumentException: Column age does not exist. sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19") we have to add age column after SELECT in order to make it right: sql("SELECT name , age FROM parquetFile WHERE age >= 13 AND age <= 19") The same error also occurs when we use DSL: parquetFile.where('key === 1).select('value as 'a).collect().foreach(println) will tell us can not find column 'key',we have to fix like this : parquetFile.where('key === 1).select('key ,'value as 'a).collect().foreach(println) Obviously, that's not the way we want! -- This message was sent by Atlassian JIRA (v6.2#6252)