[jira] [Commented] (SPARK-2400) config spark.yarn.max.executor.failures is not explained accurately
[ https://issues.apache.org/jira/browse/SPARK-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054542#comment-14054542 ] Chen Chao commented on SPARK-2400: -- PR @ https://github.com/apache/spark/pull/1282/files > config spark.yarn.max.executor.failures is not explained accurately > --- > > Key: SPARK-2400 > URL: https://issues.apache.org/jira/browse/SPARK-2400 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Chen Chao >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2400) config spark.yarn.max.executor.failures is not explained accurately
[ https://issues.apache.org/jira/browse/SPARK-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054540#comment-14054540 ] Chen Chao commented on SPARK-2400: -- it should be "numExecutors * 2, with minimum of 3" rather than "2*numExecutors". > config spark.yarn.max.executor.failures is not explained accurately > --- > > Key: SPARK-2400 > URL: https://issues.apache.org/jira/browse/SPARK-2400 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Chen Chao >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2400) config spark.yarn.max.executor.failures is not explained accurately
Chen Chao created SPARK-2400: Summary: config spark.yarn.max.executor.failures is not explained accurately Key: SPARK-2400 URL: https://issues.apache.org/jira/browse/SPARK-2400 Project: Spark Issue Type: Bug Components: YARN Reporter: Chen Chao Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2399) Add support for LZ4 compression
[ https://issues.apache.org/jira/browse/SPARK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Bowyer updated SPARK-2399: --- Description: LZ4 is a compression codec of the same ideas as googles snappy, but has some advantages: * It is faster than snappy with a similar compression ratio * The implementation is Apache licensed and not GPL It has shown promise in both the lucene and hadoop communities, and it looks like its a really easy add to spark io compression. Attached is a patch that does this was: LZ4 is a compression codec of the same ideas as googles snappy, but has some advantages: * It is faster than snappy with a similar compression ration * The implementation is Apache licensed and not GPL It has shown promise in both the lucene and hadoop communities, and it looks like its a really easy add to spark io compression. Attached is a patch that does this > Add support for LZ4 compression > --- > > Key: SPARK-2399 > URL: https://issues.apache.org/jira/browse/SPARK-2399 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Greg Bowyer > Labels: compression, lz4 > Attachments: SPARK-2399-Make-spark-compression-able-to-use-LZ4.patch > > > LZ4 is a compression codec of the same ideas as googles snappy, but has some > advantages: > * It is faster than snappy with a similar compression ratio > * The implementation is Apache licensed and not GPL > It has shown promise in both the lucene and hadoop communities, and it looks > like its a really easy add to spark io compression. > Attached is a patch that does this -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2399) Add support for LZ4 compression
Greg Bowyer created SPARK-2399: -- Summary: Add support for LZ4 compression Key: SPARK-2399 URL: https://issues.apache.org/jira/browse/SPARK-2399 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Greg Bowyer Attachments: SPARK-2399-Make-spark-compression-able-to-use-LZ4.patch LZ4 is a compression codec of the same ideas as googles snappy, but has some advantages: * It is faster than snappy with a similar compression ration * The implementation is Apache licensed and not GPL It has shown promise in both the lucene and hadoop communities, and it looks like its a really easy add to spark io compression. Attached is a patch that does this -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2399) Add support for LZ4 compression
[ https://issues.apache.org/jira/browse/SPARK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Bowyer updated SPARK-2399: --- Attachment: SPARK-2399-Make-spark-compression-able-to-use-LZ4.patch > Add support for LZ4 compression > --- > > Key: SPARK-2399 > URL: https://issues.apache.org/jira/browse/SPARK-2399 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Greg Bowyer > Labels: compression, lz4 > Attachments: SPARK-2399-Make-spark-compression-able-to-use-LZ4.patch > > > LZ4 is a compression codec of the same ideas as googles snappy, but has some > advantages: > * It is faster than snappy with a similar compression ration > * The implementation is Apache licensed and not GPL > It has shown promise in both the lucene and hadoop communities, and it looks > like its a really easy add to spark io compression. > Attached is a patch that does this -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2215) Multi-way join
[ https://issues.apache.org/jira/browse/SPARK-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054491#comment-14054491 ] Yin Huai commented on SPARK-2215: - Instead of implementing a multi-way join operator, how about we just rely on AddExchange to properly insert Exchange Operator to avoid unnecessary data movements? > Multi-way join > -- > > Key: SPARK-2215 > URL: https://issues.apache.org/jira/browse/SPARK-2215 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Hao >Priority: Minor > > Support the multi-way join (multiple table joins) in a single reduce stage if > they have the same join keys. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2303) Poisson regression model for count data
[ https://issues.apache.org/jira/browse/SPARK-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054470#comment-14054470 ] Gang Bai commented on SPARK-2303: - This change has been merged into another JIRA SPARK-2311. Closing this one. > Poisson regression model for count data > --- > > Key: SPARK-2303 > URL: https://issues.apache.org/jira/browse/SPARK-2303 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Gang Bai > > Modeling count data is of great importance in solving real-world statistic > problems. Currently mllib.regression provides models mostly for numeric data, > i.e fitting curves with various regularization on resulted weights, but still > lacks the support of count data modeling. > A very basic model for this is the Poisson regression. Following the patterns > in mllib and reusing the components, we address the parameter estimation for > Poisson regression in a maximum likelihood manner. In detail, to add Poisson > regression to mllib.regression, we need to: > # Add the gradient of the negative log-likelihood into > mllib/optimization/Gradients.scala. > # Add the implementations of PoissonRegressionModel, which extends > GeneralizedLinearModel with RegressionModel. Here we need the implementation > of the predict method. > # Add the implementations of the generalized linear algorithm class. Here we > can use either LBFGS or GradientDescent as the optimizer. So we implement > both as class PoissonRegressionWithSGD and class PoissonRegressionWithLBFGS > respectively. > # Add the companion object PoissonRegressionWithSGD and > PoissonRegressionWithLBFGS as drivers. > # Test suites > ## Test the gradient computation. > ## Test the regression method using generated data, which requires a > PoissonRegressionDataGenerator. > ## Test the regression method using a real-world data set. > # Add the documents. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-2303) Poisson regression model for count data
[ https://issues.apache.org/jira/browse/SPARK-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Bai closed SPARK-2303. --- Resolution: Fixed > Poisson regression model for count data > --- > > Key: SPARK-2303 > URL: https://issues.apache.org/jira/browse/SPARK-2303 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Gang Bai > > Modeling count data is of great importance in solving real-world statistic > problems. Currently mllib.regression provides models mostly for numeric data, > i.e fitting curves with various regularization on resulted weights, but still > lacks the support of count data modeling. > A very basic model for this is the Poisson regression. Following the patterns > in mllib and reusing the components, we address the parameter estimation for > Poisson regression in a maximum likelihood manner. In detail, to add Poisson > regression to mllib.regression, we need to: > # Add the gradient of the negative log-likelihood into > mllib/optimization/Gradients.scala. > # Add the implementations of PoissonRegressionModel, which extends > GeneralizedLinearModel with RegressionModel. Here we need the implementation > of the predict method. > # Add the implementations of the generalized linear algorithm class. Here we > can use either LBFGS or GradientDescent as the optimizer. So we implement > both as class PoissonRegressionWithSGD and class PoissonRegressionWithLBFGS > respectively. > # Add the companion object PoissonRegressionWithSGD and > PoissonRegressionWithLBFGS as drivers. > # Test suites > ## Test the gradient computation. > ## Test the regression method using generated data, which requires a > PoissonRegressionDataGenerator. > ## Test the regression method using a real-world data set. > # Add the documents. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2181) The keys for sorting the columns of Executor page in SparkUI are incorrect
[ https://issues.apache.org/jira/browse/SPARK-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054455#comment-14054455 ] Shuo Xiang commented on SPARK-2181: --- Hi, I've merged the latest version but the column of Memory Usage inside the RDD storage Info tab is not correct. It is still sorted by the string not the value. > The keys for sorting the columns of Executor page in SparkUI are incorrect > -- > > Key: SPARK-2181 > URL: https://issues.apache.org/jira/browse/SPARK-2181 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Shuo Xiang >Assignee: Guoqiang Li >Priority: Minor > Fix For: 1.1.0, 1.0.2 > > > Under the Executor page of SparkUI, each column is sorted alphabetically > (after clicking). However, it should be sorted by the value, not the string. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2235) Spark SQL basicOperator add Intersect operator
[ https://issues.apache.org/jira/browse/SPARK-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2235. - Resolution: Fixed Fix Version/s: 1.1.0 Assignee: (was: Michael Armbrust) > Spark SQL basicOperator add Intersect operator > --- > > Key: SPARK-2235 > URL: https://issues.apache.org/jira/browse/SPARK-2235 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Yanjie Gao > Fix For: 1.1.0 > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2201) Improve FlumeInputDStream's stability and make it scalable
[ https://issues.apache.org/jira/browse/SPARK-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054442#comment-14054442 ] sunshangchun commented on SPARK-2201: - I don't like it's a problem. It's a external module and take no effect on spark core module. Again, spark core module has already used zookeeper to select leader master. Thanks > Improve FlumeInputDStream's stability and make it scalable > -- > > Key: SPARK-2201 > URL: https://issues.apache.org/jira/browse/SPARK-2201 > Project: Spark > Issue Type: Improvement >Reporter: sunshangchun > > Currently: > FlumeUtils.createStream(ssc, "localhost", port); > This means that only one flume receiver can work with FlumeInputDStream .so > the solution is not scalable. > I use a zookeeper to solve this problem. > Spark flume receivers register themselves to a zk path when started, and a > flume agent get physical hosts and push events to them. > Some works need to be done here: > 1.receiver create tmp node in zk, listeners just watch those tmp nodes. > 2. when spark FlumeReceivers started, they acquire a physical host > (localhost's ip and an idle port) and register itself to zookeeper. > 3. A new flume sink. In the method of appendEvents, they get physical hosts > and push data to them in a round-robin manner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn
[ https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054434#comment-14054434 ] Guoqiang Li commented on SPARK-2398: Seems to be related to [SPARK-1930|https://issues.apache.org/jira/browse/SPARK-1930]. Can you post the yarn node manager log? > Trouble running Spark 1.0 on Yarn > -- > > Key: SPARK-2398 > URL: https://issues.apache.org/jira/browse/SPARK-2398 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Nishkam Ravi > > Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. > For example: SparkPageRank when run in standalone mode goes through without > any errors (tested for up to 30GB input dataset on a 6-node cluster). Also > runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn > cluster mode) as the input data size is increased. Confirmed for 16GB input > dataset. > The same workload runs fine with Spark 0.9 in both standalone and yarn > cluster mode (for up to 30 GB input dataset on a 6-node cluster). > Commandline used: > (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn > --deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g > --driver-cores 16 --num-executors 5 --class > org.apache.spark.examples.SparkPageRank > /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar > pagerank_in $NUM_ITER) > pagerank.conf: > spark.masterspark://c1704.halxg.cloudera.com:7077 > spark.home /opt/cloudera/parcels/CDH/lib/spark > spark.executor.memory 32g > spark.default.parallelism 118 > spark.cores.max 96 > spark.storage.memoryFraction0.6 > spark.shuffle.memoryFraction0.3 > spark.shuffle.compress true > spark.shuffle.spill.compresstrue > spark.broadcast.compresstrue > spark.rdd.compress false > spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec > spark.io.compression.snappy.block.size 32768 > spark.reducer.maxMbInFlight 48 > spark.local.dir /var/lib/jenkins/workspace/tmp > spark.driver.memory 30g > spark.executor.cores16 > spark.locality.wait 6000 > spark.executor.instances5 > UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions: > 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection > to ConnectionManagerId(a1016.halxg.cloudera.com,54105) > java.nio.channels.AsynchronousCloseException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) > at > org.apache.spark.network.SendingConnection.write(Connection.scala:361) > at > org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) > at > org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) > at java.io.FilterInputStream.close(FilterInputStream.java:181) > at org.apache.hadoop.util.LineReader.close(LineReader.java:150) > at > org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244) > at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) > at > org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) > at > org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) > at org.apache.spark.scheduler.Task.run(Task.scala:51) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(Thr
[jira] [Created] (SPARK-2398) Trouble running Spark 1.0 on Yarn
Nishkam Ravi created SPARK-2398: --- Summary: Trouble running Spark 1.0 on Yarn Key: SPARK-2398 URL: https://issues.apache.org/jira/browse/SPARK-2398 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Nishkam Ravi Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. For example: SparkPageRank when run in standalone mode goes through without any errors (tested for up to 30GB input dataset on a 6-node cluster). Also runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn cluster mode) as the input data size is increased. Confirmed for 16GB input dataset. The same workload runs fine with Spark 0.9 in both standalone and yarn cluster mode (for up to 30 GB input dataset on a 6-node cluster). Commandline used: (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn --deploy-mode cluster --properties-file pagerank.conf --driver-memory 30g --driver-cores 16 --num-executors 5 --class org.apache.spark.examples.SparkPageRank /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar pagerank_in $NUM_ITER) pagerank.conf: spark.masterspark://c1704.halxg.cloudera.com:7077 spark.home /opt/cloudera/parcels/CDH/lib/spark spark.executor.memory 32g spark.default.parallelism 118 spark.cores.max 96 spark.storage.memoryFraction0.6 spark.shuffle.memoryFraction0.3 spark.shuffle.compress true spark.shuffle.spill.compresstrue spark.broadcast.compresstrue spark.rdd.compress false spark.io.compression.codec org.apache.spark.io.LZFCompressionCodec spark.io.compression.snappy.block.size 32768 spark.reducer.maxMbInFlight 48 spark.local.dir /var/lib/jenkins/workspace/tmp spark.driver.memory 30g spark.executor.cores16 spark.locality.wait 6000 spark.executor.instances5 UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions: 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection to ConnectionManagerId(a1016.halxg.cloudera.com,54105) java.nio.channels.AsynchronousCloseException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) at org.apache.spark.network.SendingConnection.write(Connection.scala:361) at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703) at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) at java.io.FilterInputStream.close(FilterInputStream.java:181) at org.apache.hadoop.util.LineReader.close(LineReader.java:150) at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244) at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) --- 14/07/07 17:59:52 WARN network.SendingConnection: Error finishing connection to a1016.halxg.cloudera.com/10.20.184.116:54105 java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:313)
[jira] [Commented] (SPARK-2201) Improve FlumeInputDStream's stability and make it scalable
[ https://issues.apache.org/jira/browse/SPARK-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054431#comment-14054431 ] llai commented on SPARK-2201: - is it a good idea to add dependency of zookeeper? > Improve FlumeInputDStream's stability and make it scalable > -- > > Key: SPARK-2201 > URL: https://issues.apache.org/jira/browse/SPARK-2201 > Project: Spark > Issue Type: Improvement >Reporter: sunshangchun > > Currently: > FlumeUtils.createStream(ssc, "localhost", port); > This means that only one flume receiver can work with FlumeInputDStream .so > the solution is not scalable. > I use a zookeeper to solve this problem. > Spark flume receivers register themselves to a zk path when started, and a > flume agent get physical hosts and push events to them. > Some works need to be done here: > 1.receiver create tmp node in zk, listeners just watch those tmp nodes. > 2. when spark FlumeReceivers started, they acquire a physical host > (localhost's ip and an idle port) and register itself to zookeeper. > 3. A new flume sink. In the method of appendEvents, they get physical hosts > and push data to them in a round-robin manner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2376) Selecting list values inside nested JSON objects raises java.lang.IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-2376. - Resolution: Fixed Fix Version/s: 1.0.2 1.1.0 > Selecting list values inside nested JSON objects raises > java.lang.IllegalArgumentException > -- > > Key: SPARK-2376 > URL: https://issues.apache.org/jira/browse/SPARK-2376 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.1 >Reporter: Nicholas Chammas >Assignee: Yin Huai > Fix For: 1.1.0, 1.0.2 > > > Repro script for PySpark, deployed via {{spark-ec2}} at git revision > {{9d5ecf8205b924dc8a3c13fed68beb78cc5c7553}}: > {code} > from pyspark.sql import SQLContext > sqlContext = SQLContext(sc) > raw = sc.parallelize([ > """ > { > "name": "Nick", > "history": { > "countries": [] > } > } > """ > ]) > profiles = sqlContext.jsonRDD(raw) > profiles.registerAsTable("profiles") > profiles.printSchema() > sqlContext.sql("SELECT name FROM profiles;").collect() # works fine > sqlContext.sql("SELECT history FROM profiles;").collect() # raises exception > {code} > Attempting to select the top-level struct that has a nested list value yields > the following error: > {code} > 14/07/06 00:10:26 INFO scheduler.TaskSetManager: Loss was due to > net.razorvine.pickle.PickleException: couldn't introspect javabean: > java.lang.IllegalArgumentException: wrong number of arguments [duplicate 3] > 14/07/06 00:10:26 ERROR scheduler.TaskSetManager: Task 26.0:15 failed 4 > times; aborting job > 14/07/06 00:10:26 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 26.0, > whose tasks have all completed, from pool > 14/07/06 00:10:26 INFO scheduler.TaskSchedulerImpl: Cancelling stage 26 > 14/07/06 00:10:26 INFO scheduler.DAGScheduler: Failed to run collect at > :1 > Traceback (most recent call last): > File "", line 1, in > File "/root/spark/python/pyspark/rdd.py", line 649, in collect > bytesInJava = self._jrdd.collect().iterator() > File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", line > 537, in __call__ > File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line > 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o286.collect. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task > 26.0:15 failed 4 times, most recent failure: Exception failure in TID 394 on > host ip-10-183-59-125.ec2.internal: net.razorvine.pickle.PickleException: > couldn't introspect javabean: java.lang.IllegalArgumentException: wrong > number of arguments > net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603) > net.razorvine.pickle.Pickler.dispatch(Pickler.java:299) > net.razorvine.pickle.Pickler.save(Pickler.java:125) > net.razorvine.pickle.Pickler.put_map(Pickler.java:322) > net.razorvine.pickle.Pickler.dispatch(Pickler.java:286) > net.razorvine.pickle.Pickler.save(Pickler.java:125) > net.razorvine.pickle.Pickler.put_map(Pickler.java:322) > net.razorvine.pickle.Pickler.dispatch(Pickler.java:286) > net.razorvine.pickle.Pickler.save(Pickler.java:125) > net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392) > net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) > net.razorvine.pickle.Pickler.save(Pickler.java:125) > net.razorvine.pickle.Pickler.dump(Pickler.java:95) > net.razorvine.pickle.Pickler.dumps(Pickler.java:80) > > org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$3.apply(SchemaRDD.scala:385) > > org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$3.apply(SchemaRDD.scala:385) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:317) > > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:203) > > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178) > > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178) > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1213) > > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:177) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1041) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1025) > at > org.apache.spark.
[jira] [Created] (SPARK-2397) Get rid of LocalHiveContext
Michael Armbrust created SPARK-2397: --- Summary: Get rid of LocalHiveContext Key: SPARK-2397 URL: https://issues.apache.org/jira/browse/SPARK-2397 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Priority: Blocker Fix For: 1.1.0 HiveLocalContext is nearly completely redundant with HiveContext. We should consider deprecating it and removing all uses. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2360) CSV import to SchemaRDDs
[ https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054262#comment-14054262 ] Hossein Falaki edited comment on SPARK-2360 at 7/8/14 12:45 AM: As a point for comparison the interface in some other popular packages are: _R_: {code} read.csv(filePath, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...) {code} Where: * header: a logical value indicating whether the file contains the names of the variables as its first line. * sep: the field separator character. * quote: the set of quoting characters. To disable quoting altogether, use ‘quote = ""’ * dec: the character used in the file for decimal points. * fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are implicitly added. _pandas_: {code} pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, na_fvalues=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine=None, delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False) {code} The description of fields can be found here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html was (Author: falaki): As a point for comparison the interface in some other popular packages are: _R_: {code} read.csv(filePath, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...) {code} Where: header: a logical value indicating whether the file contains the names of the variables as its first line. sep: the field separator character. quote: the set of quoting characters. To disable quoting altogether, use ‘quote = ""’ dec: the character used in the file for decimal points. fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are implicitly added. _pandas_: {code} pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, na_fvalues=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine=None, delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False) {code} The description of fields can be found here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html > CSV import to SchemaRDDs > > > Key: SPARK-2360 > URL: https://issues.apache.org/jira/browse/SPARK-2360 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Priority: Minor > > I think the first step it to design the interface that we want to present to > users. Mostly this is defining options when importing. Off the top of my > head: > - What is the separator? > - Provide column names or infer them from the first row. > - how to handle multiple files with possibly different schemas > - do we have a method to let users specify the datatypes of the columns or > are they just strings? > - what types of quoting / escaping do we want to support? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS
[ https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054333#comment-14054333 ] Kousuke Saruta edited comment on SPARK-2390 at 7/8/14 12:41 AM: Pull Requested at https://github.com/apache/spark/pull/1326 was (Author: sarutak): Pull Requested at https://github.com/apache/spark/pull/1317 > Files in staging directory cannot be deleted and wastes the space of HDFS > - > > Key: SPARK-2390 > URL: https://issues.apache.org/jira/browse/SPARK-2390 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0, 1.0.1 >Reporter: Kousuke Saruta > > When running jobs with YARN Cluster mode and using HistoryServer, the files > in the Staging Directory cannot be deleted. > HistoryServer uses directory where event log is written, and the directory is > represented as a instance of o.a.h.f.FileSystem created by using > FileSystem.get. > {code:title=FileLogger.scala} > private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir)) > {code} > {code:title=utils.getHadoopFileSystem} > def getHadoopFileSystem(path: URI): FileSystem = { > FileSystem.get(path, SparkHadoopUtil.get.newConfiguration()) > } > {code} > On the other hand, ApplicationMaster has a instance named fs, which also > created by using FileSystem.get. > {code:title=ApplicationMaster} > private val fs = FileSystem.get(yarnConf) > {code} > FileSystem.get returns cached same instance when URI passed to the method > represents same file system and the method is called by same user. > Because of the behavior, when the directory for event log is on HDFS, fs of > ApplicationMaster and fileSystem of FileLogger is same instance. > When shutting down ApplicationMaster, fileSystem.close is called in > FileLogger#stop, which is invoked by SparkContext#stop indirectly. > {code:title=FileLogger.stop} > def stop() { > hadoopDataStream.foreach(_.close()) > writer.foreach(_.close()) > fileSystem.close() > } > {code} > And ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In > this method, fs.delete(stagingDirPath) is invoked. > Because fs.delete in ApplicationMaster is called after fileSystem.close in > FileLogger, fs.delete fails and results not deleting files in the staging > directory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS
[ https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054333#comment-14054333 ] Kousuke Saruta commented on SPARK-2390: --- Pull Requested at https://github.com/apache/spark/pull/1317 > Files in staging directory cannot be deleted and wastes the space of HDFS > - > > Key: SPARK-2390 > URL: https://issues.apache.org/jira/browse/SPARK-2390 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0, 1.0.1 >Reporter: Kousuke Saruta > > When running jobs with YARN Cluster mode and using HistoryServer, the files > in the Staging Directory cannot be deleted. > HistoryServer uses directory where event log is written, and the directory is > represented as a instance of o.a.h.f.FileSystem created by using > FileSystem.get. > {code:title=FileLogger.scala} > private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir)) > {code} > {code:title=utils.getHadoopFileSystem} > def getHadoopFileSystem(path: URI): FileSystem = { > FileSystem.get(path, SparkHadoopUtil.get.newConfiguration()) > } > {code} > On the other hand, ApplicationMaster has a instance named fs, which also > created by using FileSystem.get. > {code:title=ApplicationMaster} > private val fs = FileSystem.get(yarnConf) > {code} > FileSystem.get returns cached same instance when URI passed to the method > represents same file system and the method is called by same user. > Because of the behavior, when the directory for event log is on HDFS, fs of > ApplicationMaster and fileSystem of FileLogger is same instance. > When shutting down ApplicationMaster, fileSystem.close is called in > FileLogger#stop, which is invoked by SparkContext#stop indirectly. > {code:title=FileLogger.stop} > def stop() { > hadoopDataStream.foreach(_.close()) > writer.foreach(_.close()) > fileSystem.close() > } > {code} > And ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In > this method, fs.delete(stagingDirPath) is invoked. > Because fs.delete in ApplicationMaster is called after fileSystem.close in > FileLogger, fs.delete fails and results not deleting files in the staging > directory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2375) JSON schema inference may not resolve type conflicts correctly for a field inside an array of structs
[ https://issues.apache.org/jira/browse/SPARK-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2375. - Resolution: Fixed Fix Version/s: 1.0.2 1.1.0 > JSON schema inference may not resolve type conflicts correctly for a field > inside an array of structs > - > > Key: SPARK-2375 > URL: https://issues.apache.org/jira/browse/SPARK-2375 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.1 >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 1.1.0, 1.0.2 > > > For example, for > {code} > {"array": [{"field":214748364700}, {"field":1}]} > {code} > the type of field is resolved as IntType. While, for > {code} > {"array": [{"field":1}, {"field":214748364700}]} > {code} > the type of field is resolved as LongType. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2386) RowWriteSupport should use the exact types to cast.
[ https://issues.apache.org/jira/browse/SPARK-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2386. - Resolution: Fixed Fix Version/s: 1.0.2 1.1.0 Assignee: Takuya Ueshin > RowWriteSupport should use the exact types to cast. > --- > > Key: SPARK-2386 > URL: https://issues.apache.org/jira/browse/SPARK-2386 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 1.1.0, 1.0.2 > > > When execute {{saveAsParquetFile}} with non-primitive type, > {{RowWriteSupport}} uses wrong type {{Int}} for {{ByteType}} and > {{ShortType}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery
[ https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2339. - Resolution: Fixed Fix Version/s: 1.0.2 1.1.0 > SQL parser in sql-core is case sensitive, but a table alias is converted to > lower case when we create Subquery > -- > > Key: SPARK-2339 > URL: https://issues.apache.org/jira/browse/SPARK-2339 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 1.1.0, 1.0.2 > > > Reported by > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html > After we get the table from the catalog, because the table has an alias, we > will temporarily insert a Subquery. Then, we convert the table alias to lower > case no matter if the parser is case sensitive or not. > To see the issue ... > {code} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.createSchemaRDD > case class Person(name: String, age: Int) > val people = > sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p > => Person(p(0), p(1).trim.toInt)) > people.registerAsTable("people") > sqlContext.sql("select PEOPLE.name from people PEOPLE") > {code} > The plan is ... > {code} > == Query Plan == > Project ['PEOPLE.name] > ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at > basicOperators.scala:176 > {code} > You can find that "PEOPLE.name" is not resolved. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large
[ https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054307#comment-14054307 ] Reynold Xin commented on SPARK-2017: We gotta be careful with pagination because right now the web UI allows sorting via some client-side JavaScript. If we enable pagination, we need to make sure the sorting is global across pages (i.e. server side). This might go overboard, but I think rendering is the bottleneck. If that's indeed the case, maybe the right solution is for the server to just send all the data in JSON, and the client side decides what to render (pagination, etc) ... > web ui stage page becomes unresponsive when the number of tasks is large > > > Key: SPARK-2017 > URL: https://issues.apache.org/jira/browse/SPARK-2017 > Project: Spark > Issue Type: Sub-task >Reporter: Reynold Xin > Labels: starter > > {code} > sc.parallelize(1 to 100, 100).count() > {code} > The above code creates one million tasks to be executed. The stage detail web > ui page takes forever to load (if it ever completes). > There are again a few different alternatives: > 0. Limit the number of tasks we show. > 1. Pagination > 2. By default only show the aggregate metrics and failed tasks, and hide the > successful ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2396) Spark EC2 scripts fail when trying to log in to EC2 instances
Stephen M. Hopper created SPARK-2396: Summary: Spark EC2 scripts fail when trying to log in to EC2 instances Key: SPARK-2396 URL: https://issues.apache.org/jira/browse/SPARK-2396 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.0.0 Environment: Windows 8, Cygwin and command prompt, Python 2.7 Reporter: Stephen M. Hopper I cannot seem to successfully start up a Spark EC2 cluster using the spark-ec2 script. I'm using variations on the following command: ./spark-ec2 --instance-type=m1.small --region=us-west-1 --spot-price=0.05 --spark-version=1.0.0 -k my-key-name -i my-key-name.pem -s 1 launch spark-test-cluster The script always allocates the EC2 instances without much trouble, but can never seem to complete the SSH step to install Spark on the cluster. It always complains about my SSH key. If I try to log in with my ssh key doing something like this: ssh -i my-key-name.pem root@ it fails. However, if I log in to the AWS console, click on my instance and select "connect", it displays the instructions for SSHing into my instance (which are no different from the ssh command from above). So, if I rerun the SSH command from above, I'm able to log in. Next, if I try to rerun the spark-ec2 command from above (replacing "launch" with "start"), the script logs in and starts installing Spark. However, it eventually errors out with the following output: Cloning into 'spark-ec2'... remote: Counting objects: 1465, done. remote: Compressing objects: 100% (697/697), done. remote: Total 1465 (delta 485), reused 1465 (delta 485) Receiving objects: 100% (1465/1465), 228.51 KiB | 287 KiB/s, done. Resolving deltas: 100% (485/485), done. Connection to ec2-.us-west-1.compute.amazonaws.com closed. Searching for existing cluster spark-test-cluster... Found 1 master(s), 1 slaves Starting slaves... Starting master... Waiting for instances to start up... Waiting 120 more seconds... Deploying files to master... Traceback (most recent call last): File "./spark_ec2.py", line 823, in main() File "./spark_ec2.py", line 815, in main real_main() File "./spark_ec2.py", line 806, in real_main setup_cluster(conn, master_nodes, slave_nodes, opts, False) File "./spark_ec2.py", line 450, in setup_cluster deploy_files(conn, "deploy.generic", opts, master_nodes, slave_nodes, modules) File "./spark_ec2.py", line 593, in deploy_files subprocess.check_call(command) File "E:\windows_programs\Python27\lib\subprocess.py", line 535, in check_call retcode = call(*popenargs, **kwargs) File "E:\windows_programs\Python27\lib\subprocess.py", line 522, in call return Popen(*popenargs, **kwargs).wait() File "E:\windows_programs\Python27\lib\subprocess.py", line 710, in __init__ errread, errwrite) File "E:\windows_programs\Python27\lib\subprocess.py", line 958, in _execute_child startupinfo) WindowsError: [Error 2] The system cannot find the file specified So, in short, am I missing something or is this a bug? Any help would be appreciated. Other notes: -I've tried both us-west-1 and us-east-1 regions. -I've tried several different instance types. -I've tried playing with the permissions on the ssh key (600, 400, etc.), but to no avail -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS
[ https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054298#comment-14054298 ] Kousuke Saruta commented on SPARK-2390: --- Thank you for your comment [~mridulm80]. I noticed FileSystem is to be closed by shutdown hook, and the shutdown hook deletes files in the staging directory and then close FileSystem. So, in this case, I also think there is no problem to remove fs.close. > Files in staging directory cannot be deleted and wastes the space of HDFS > - > > Key: SPARK-2390 > URL: https://issues.apache.org/jira/browse/SPARK-2390 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0, 1.0.1 >Reporter: Kousuke Saruta > > When running jobs with YARN Cluster mode and using HistoryServer, the files > in the Staging Directory cannot be deleted. > HistoryServer uses directory where event log is written, and the directory is > represented as a instance of o.a.h.f.FileSystem created by using > FileSystem.get. > {code:title=FileLogger.scala} > private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir)) > {code} > {code:title=utils.getHadoopFileSystem} > def getHadoopFileSystem(path: URI): FileSystem = { > FileSystem.get(path, SparkHadoopUtil.get.newConfiguration()) > } > {code} > On the other hand, ApplicationMaster has a instance named fs, which also > created by using FileSystem.get. > {code:title=ApplicationMaster} > private val fs = FileSystem.get(yarnConf) > {code} > FileSystem.get returns cached same instance when URI passed to the method > represents same file system and the method is called by same user. > Because of the behavior, when the directory for event log is on HDFS, fs of > ApplicationMaster and fileSystem of FileLogger is same instance. > When shutting down ApplicationMaster, fileSystem.close is called in > FileLogger#stop, which is invoked by SparkContext#stop indirectly. > {code:title=FileLogger.stop} > def stop() { > hadoopDataStream.foreach(_.close()) > writer.foreach(_.close()) > fileSystem.close() > } > {code} > And ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In > this method, fs.delete(stagingDirPath) is invoked. > Because fs.delete in ApplicationMaster is called after fileSystem.close in > FileLogger, fs.delete fails and results not deleting files in the staging > directory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2395) Optimize common like expressions
Michael Armbrust created SPARK-2395: --- Summary: Optimize common like expressions Key: SPARK-2395 URL: https://issues.apache.org/jira/browse/SPARK-2395 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust LIKE queries that are really just a contains, startsWith or endsWith would be much faster if we avoided regular expressions. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2360) CSV import to SchemaRDDs
[ https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054262#comment-14054262 ] Hossein Falaki edited comment on SPARK-2360 at 7/7/14 11:04 PM: As a point for comparison the interface in some other popular packages are: _R_: {code} read.csv(filePath, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...) {code} Where: header: a logical value indicating whether the file contains the names of the variables as its first line. sep: the field separator character. quote: the set of quoting characters. To disable quoting altogether, use ‘quote = ""’ dec: the character used in the file for decimal points. fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are implicitly added. _pandas_: {code} pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, na_fvalues=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine=None, delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False) {code} The description of fields can be found here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html was (Author: falaki): As a point for comparison the interface in some other popular packages are: _R_: {code} read.csv(filePath, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...) {code} Where: header: a logical value indicating whether the file contains the names of the variables as its first line. sep: the field separator character. quote: the set of quoting characters. To disable quoting altogether, use ‘quote = ""’ dec: the character used in the file for decimal points. fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are implicitly added. __pandas__: ``` pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, na_fvalues=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine=None, delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False) ``` The description of fields can be found here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html > CSV import to SchemaRDDs > > > Key: SPARK-2360 > URL: https://issues.apache.org/jira/browse/SPARK-2360 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Priority: Minor > > I think the first step it to design the interface that we want to present to > users. Mostly this is defining options when importing. Off the top of my > head: > - What is the separator? > - Provide column names or infer them from the first row. > - how to handle multiple files with possibly different schemas > - do we have a method to let users specify the datatypes of the columns or > are they just strings? > - what types of quoting / escaping do we want to support? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2360) CSV import to SchemaRDDs
[ https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054262#comment-14054262 ] Hossein Falaki edited comment on SPARK-2360 at 7/7/14 11:03 PM: As a point for comparison the interface in some other popular packages are: _R_: {code} read.csv(filePath, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...) {code} Where: header: a logical value indicating whether the file contains the names of the variables as its first line. sep: the field separator character. quote: the set of quoting characters. To disable quoting altogether, use ‘quote = ""’ dec: the character used in the file for decimal points. fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are implicitly added. __pandas__: ``` pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, na_fvalues=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine=None, delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False) ``` The description of fields can be found here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html was (Author: falaki): As a point for comparison the interface in some other popular packages are: __R__: ``` read.csv(filePath, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...) ``` Where: header: a logical value indicating whether the file contains the names of the variables as its first line. sep: the field separator character. quote: the set of quoting characters. To disable quoting altogether, use ‘quote = ""’ dec: the character used in the file for decimal points. fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are implicitly added. __pandas__: ``` pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, na_fvalues=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine=None, delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False) ``` The description of fields can be found here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html > CSV import to SchemaRDDs > > > Key: SPARK-2360 > URL: https://issues.apache.org/jira/browse/SPARK-2360 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Priority: Minor > > I think the first step it to design the interface that we want to present to > users. Mostly this is defining options when importing. Off the top of my > head: > - What is the separator? > - Provide column names or infer them from the first row. > - how to handle multiple files with possibly different schemas > - do we have a method to let users specify the datatypes of the columns or > are they just strings? > - what types of quoting / escaping do we want to support? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2360) CSV import to SchemaRDDs
[ https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054262#comment-14054262 ] Hossein Falaki commented on SPARK-2360: --- As a point for comparison the interface in some other popular packages are: __R__: ``` read.csv(filePath, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...) ``` Where: header: a logical value indicating whether the file contains the names of the variables as its first line. sep: the field separator character. quote: the set of quoting characters. To disable quoting altogether, use ‘quote = ""’ dec: the character used in the file for decimal points. fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are implicitly added. __pandas__: ``` pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, na_fvalues=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine=None, delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False) ``` The description of fields can be found here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html > CSV import to SchemaRDDs > > > Key: SPARK-2360 > URL: https://issues.apache.org/jira/browse/SPARK-2360 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Priority: Minor > > I think the first step it to design the interface that we want to present to > users. Mostly this is defining options when importing. Off the top of my > head: > - What is the separator? > - Provide column names or infer them from the first row. > - how to handle multiple files with possibly different schemas > - do we have a method to let users specify the datatypes of the columns or > are they just strings? > - what types of quoting / escaping do we want to support? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2360) CSV import to SchemaRDDs
[ https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2360: Description: I think the first step it to design the interface that we want to present to users. Mostly this is defining options when importing. Off the top of my head: - What is the separator? - Provide column names or infer them from the first row. - how to handle multiple files with possibly different schemas - do we have a method to let users specify the datatypes of the columns or are they just strings? - what types of quoting / escaping do we want to support? > CSV import to SchemaRDDs > > > Key: SPARK-2360 > URL: https://issues.apache.org/jira/browse/SPARK-2360 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Priority: Minor > > I think the first step it to design the interface that we want to present to > users. Mostly this is defining options when importing. Off the top of my > head: > - What is the separator? > - Provide column names or infer them from the first row. > - how to handle multiple files with possibly different schemas > - do we have a method to let users specify the datatypes of the columns or > are they just strings? > - what types of quoting / escaping do we want to support? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-2378) Implement functionality to read csv files
[ https://issues.apache.org/jira/browse/SPARK-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust closed SPARK-2378. --- Resolution: Duplicate > Implement functionality to read csv files > - > > Key: SPARK-2378 > URL: https://issues.apache.org/jira/browse/SPARK-2378 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 1.0.0 >Reporter: Hossein Falaki >Priority: Minor > Fix For: 1.0.0 > > > Similar to jsonFile(), csvFile() could be used to read data into a SchemaRDD > (or a normal RDD). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1977. -- Resolution: Fixed Fix Version/s: 1.1.0 1.0.1 Issue resolved by pull request 1319 [https://github.com/apache/spark/pull/1319] > mutable.BitSet in ALS not serializable with KryoSerializer > -- > > Key: SPARK-1977 > URL: https://issues.apache.org/jira/browse/SPARK-1977 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Neville Li >Priority: Minor > Fix For: 1.0.1, 1.1.0 > > > OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member. > KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't > register mutable.BitSet. > Right now we have to register mutable.BitSet manually. A proper fix would be > using immutable.BitSet in ALS or register mutable.BitSet in upstream chill. > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID > 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: > com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: > scala.collection.mutable.HashSet > Serialization trace: > shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock) > > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) > > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) > com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115) > > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77) > org.apache.spark.rdd.RDD.iterator(RDD.scala:227) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) > org.apache.spark.scheduler.Task.run(Task.scala:51) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > java.lang.Thread.run(Thread.java:662) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) > at > org.apache.spark.scheduler.DAGSchedu
[jira] [Created] (SPARK-2394) Make it easier to read LZO-compressed files from EC2 clusters
Nicholas Chammas created SPARK-2394: --- Summary: Make it easier to read LZO-compressed files from EC2 clusters Key: SPARK-2394 URL: https://issues.apache.org/jira/browse/SPARK-2394 Project: Spark Issue Type: Improvement Components: EC2, Input/Output Affects Versions: 1.0.0 Reporter: Nicholas Chammas Priority: Minor Amazon hosts [a large Google n-grams data set on S3|https://aws.amazon.com/datasets/8172056142375670]. This data set is perfect, among other things, for putting together interesting and easily reproducible public demos of Spark's capabilities. The problem is that the data set is compressed using LZO, and it is currently more painful than it should be to get your average {{spark-ec2}} cluster to read input compressed in this way. This is what one has to go through to get a Spark cluster created with {{spark-ec2}} to read LZO-compressed files: # Install the latest LZO release, perhaps via {{yum}}. # Download [{{hadoop-lzo}}|https://github.com/twitter/hadoop-lzo] and build it. To build {{hadoop-lzo}} you need Maven. # Install Maven. For some reason, [you cannot install Maven with {{yum}}|http://stackoverflow.com/questions/7532928/how-do-i-install-maven-with-yum], so install it manually. # Update your {{core-site.xml}} and {{spark-env.sh}} with [the appropriate configs|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E]. # Make [the appropriate calls|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E] to {{sc.newAPIHadoopFile}}. This seems like a bit too much work for what we're trying to accomplish. If we expect this to be a common pattern -- reading LZO-compressed files from a {{spark-ec2}} cluster -- it would be great if we could somehow make this less painful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2384) Add tooltips for shuffle write and scheduler delay in UI
[ https://issues.apache.org/jira/browse/SPARK-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054207#comment-14054207 ] Kay Ousterhout commented on SPARK-2384: --- https://github.com/apache/spark/pull/1314 > Add tooltips for shuffle write and scheduler delay in UI > > > Key: SPARK-2384 > URL: https://issues.apache.org/jira/browse/SPARK-2384 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout >Priority: Minor > > There are a few common points of confusion in the UI that could be clarified > with tooltips. We should add tooltips to explain the scheduler delay and the > shuffle data (to explain why shuffle read is typically < shuffle write, as > many many people have expressed confusion about). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery
[ https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-2339: Component/s: SQL > SQL parser in sql-core is case sensitive, but a table alias is converted to > lower case when we create Subquery > -- > > Key: SPARK-2339 > URL: https://issues.apache.org/jira/browse/SPARK-2339 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Yin Huai >Assignee: Yin Huai > > Reported by > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html > After we get the table from the catalog, because the table has an alias, we > will temporarily insert a Subquery. Then, we convert the table alias to lower > case no matter if the parser is case sensitive or not. > To see the issue ... > {code} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.createSchemaRDD > case class Person(name: String, age: Int) > val people = > sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p > => Person(p(0), p(1).trim.toInt)) > people.registerAsTable("people") > sqlContext.sql("select PEOPLE.name from people PEOPLE") > {code} > The plan is ... > {code} > == Query Plan == > Project ['PEOPLE.name] > ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at > basicOperators.scala:176 > {code} > You can find that "PEOPLE.name" is not resolved. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2205) Unnecessary exchange operators in a join on multiple tables with the same join key.
[ https://issues.apache.org/jira/browse/SPARK-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-2205: Target Version/s: 1.1.0 (was: 1.0.1, 1.1.0) > Unnecessary exchange operators in a join on multiple tables with the same > join key. > --- > > Key: SPARK-2205 > URL: https://issues.apache.org/jira/browse/SPARK-2205 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Minor > > {code} > hql("select * from src x join src y on (x.key=y.key) join src z on > (y.key=z.key)") > SchemaRDD[1] at RDD at SchemaRDD.scala:100 > == Query Plan == > Project [key#4:0,value#5:1,key#6:2,value#7:3,key#8:4,value#9:5] > HashJoin [key#6], [key#8], BuildRight > Exchange (HashPartitioning [key#6], 200) >HashJoin [key#4], [key#6], BuildRight > Exchange (HashPartitioning [key#4], 200) > HiveTableScan [key#4,value#5], (MetastoreRelation default, src, > Some(x)), None > Exchange (HashPartitioning [key#6], 200) > HiveTableScan [key#6,value#7], (MetastoreRelation default, src, > Some(y)), None > Exchange (HashPartitioning [key#8], 200) >HiveTableScan [key#8,value#9], (MetastoreRelation default, src, Some(z)), > None > {code} > However, this is fine... > {code} > hql("select * from src x join src y on (x.key=y.key) join src z on > (x.key=z.key)") > res5: org.apache.spark.sql.SchemaRDD = > SchemaRDD[5] at RDD at SchemaRDD.scala:100 > == Query Plan == > Project [key#26:0,value#27:1,key#28:2,value#29:3,key#30:4,value#31:5] > HashJoin [key#26], [key#30], BuildRight > HashJoin [key#26], [key#28], BuildRight >Exchange (HashPartitioning [key#26], 200) > HiveTableScan [key#26,value#27], (MetastoreRelation default, src, > Some(x)), None >Exchange (HashPartitioning [key#28], 200) > HiveTableScan [key#28,value#29], (MetastoreRelation default, src, > Some(y)), None > Exchange (HashPartitioning [key#30], 200) >HiveTableScan [key#30,value#31], (MetastoreRelation default, src, > Some(z)), None > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery
[ https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-2339: Fix Version/s: (was: 1.1.0) > SQL parser in sql-core is case sensitive, but a table alias is converted to > lower case when we create Subquery > -- > > Key: SPARK-2339 > URL: https://issues.apache.org/jira/browse/SPARK-2339 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Yin Huai > > Reported by > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html > After we get the table from the catalog, because the table has an alias, we > will temporarily insert a Subquery. Then, we convert the table alias to lower > case no matter if the parser is case sensitive or not. > To see the issue ... > {code} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.createSchemaRDD > case class Person(name: String, age: Int) > val people = > sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p > => Person(p(0), p(1).trim.toInt)) > people.registerAsTable("people") > sqlContext.sql("select PEOPLE.name from people PEOPLE") > {code} > The plan is ... > {code} > == Query Plan == > Project ['PEOPLE.name] > ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at > basicOperators.scala:176 > {code} > You can find that "PEOPLE.name" is not resolved. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery
[ https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-2339: Target Version/s: 1.1.0 > SQL parser in sql-core is case sensitive, but a table alias is converted to > lower case when we create Subquery > -- > > Key: SPARK-2339 > URL: https://issues.apache.org/jira/browse/SPARK-2339 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Yin Huai > > Reported by > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html > After we get the table from the catalog, because the table has an alias, we > will temporarily insert a Subquery. Then, we convert the table alias to lower > case no matter if the parser is case sensitive or not. > To see the issue ... > {code} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.createSchemaRDD > case class Person(name: String, age: Int) > val people = > sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p > => Person(p(0), p(1).trim.toInt)) > people.registerAsTable("people") > sqlContext.sql("select PEOPLE.name from people PEOPLE") > {code} > The plan is ... > {code} > == Query Plan == > Project ['PEOPLE.name] > ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at > basicOperators.scala:176 > {code} > You can find that "PEOPLE.name" is not resolved. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2393) Simple cost estimation and auto selection of broadcast join
Michael Armbrust created SPARK-2393: --- Summary: Simple cost estimation and auto selection of broadcast join Key: SPARK-2393 URL: https://issues.apache.org/jira/browse/SPARK-2393 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Priority: Critical -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2376) Selecting list values inside nested JSON objects raises java.lang.IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2376: Target Version/s: 1.1.0 > Selecting list values inside nested JSON objects raises > java.lang.IllegalArgumentException > -- > > Key: SPARK-2376 > URL: https://issues.apache.org/jira/browse/SPARK-2376 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.1 >Reporter: Nicholas Chammas > > Repro script for PySpark, deployed via {{spark-ec2}} at git revision > {{9d5ecf8205b924dc8a3c13fed68beb78cc5c7553}}: > {code} > from pyspark.sql import SQLContext > sqlContext = SQLContext(sc) > raw = sc.parallelize([ > """ > { > "name": "Nick", > "history": { > "countries": [] > } > } > """ > ]) > profiles = sqlContext.jsonRDD(raw) > profiles.registerAsTable("profiles") > profiles.printSchema() > sqlContext.sql("SELECT name FROM profiles;").collect() # works fine > sqlContext.sql("SELECT history FROM profiles;").collect() # raises exception > {code} > Attempting to select the top-level struct that has a nested list value yields > the following error: > {code} > 14/07/06 00:10:26 INFO scheduler.TaskSetManager: Loss was due to > net.razorvine.pickle.PickleException: couldn't introspect javabean: > java.lang.IllegalArgumentException: wrong number of arguments [duplicate 3] > 14/07/06 00:10:26 ERROR scheduler.TaskSetManager: Task 26.0:15 failed 4 > times; aborting job > 14/07/06 00:10:26 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 26.0, > whose tasks have all completed, from pool > 14/07/06 00:10:26 INFO scheduler.TaskSchedulerImpl: Cancelling stage 26 > 14/07/06 00:10:26 INFO scheduler.DAGScheduler: Failed to run collect at > :1 > Traceback (most recent call last): > File "", line 1, in > File "/root/spark/python/pyspark/rdd.py", line 649, in collect > bytesInJava = self._jrdd.collect().iterator() > File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", line > 537, in __call__ > File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line > 300, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o286.collect. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task > 26.0:15 failed 4 times, most recent failure: Exception failure in TID 394 on > host ip-10-183-59-125.ec2.internal: net.razorvine.pickle.PickleException: > couldn't introspect javabean: java.lang.IllegalArgumentException: wrong > number of arguments > net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603) > net.razorvine.pickle.Pickler.dispatch(Pickler.java:299) > net.razorvine.pickle.Pickler.save(Pickler.java:125) > net.razorvine.pickle.Pickler.put_map(Pickler.java:322) > net.razorvine.pickle.Pickler.dispatch(Pickler.java:286) > net.razorvine.pickle.Pickler.save(Pickler.java:125) > net.razorvine.pickle.Pickler.put_map(Pickler.java:322) > net.razorvine.pickle.Pickler.dispatch(Pickler.java:286) > net.razorvine.pickle.Pickler.save(Pickler.java:125) > net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392) > net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) > net.razorvine.pickle.Pickler.save(Pickler.java:125) > net.razorvine.pickle.Pickler.dump(Pickler.java:95) > net.razorvine.pickle.Pickler.dumps(Pickler.java:80) > > org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$3.apply(SchemaRDD.scala:385) > > org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$3.apply(SchemaRDD.scala:385) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:317) > > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:203) > > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178) > > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178) > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1213) > > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:177) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1041) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1025) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1023) > at > scala.collect
[jira] [Commented] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large
[ https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054178#comment-14054178 ] Masayoshi TSUZUKI commented on SPARK-2017: -- Pagination seems to be better because with aggregated metrics, 1. we can't identify the skew of tasks between the executors. 2. the same problem will appear again when many tasks fail in a certain stage. In addition, when some errors or problems occur under the production environment, we would like to see the status of tasks near the time even if those tasks mostly succeeded. Although every status of tasks is written in the log file, web ui is very useful in operation phase. > web ui stage page becomes unresponsive when the number of tasks is large > > > Key: SPARK-2017 > URL: https://issues.apache.org/jira/browse/SPARK-2017 > Project: Spark > Issue Type: Sub-task >Reporter: Reynold Xin > Labels: starter > > {code} > sc.parallelize(1 to 100, 100).count() > {code} > The above code creates one million tasks to be executed. The stage detail web > ui page takes forever to load (if it ever completes). > There are again a few different alternatives: > 0. Limit the number of tasks we show. > 1. Pagination > 2. By default only show the aggregate metrics and failed tasks, and hide the > successful ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2375) JSON schema inference may not resolve type conflicts correctly for a field inside an array of structs
[ https://issues.apache.org/jira/browse/SPARK-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2375: Target Version/s: 1.1.0 Fix Version/s: (was: 1.1.0) > JSON schema inference may not resolve type conflicts correctly for a field > inside an array of structs > - > > Key: SPARK-2375 > URL: https://issues.apache.org/jira/browse/SPARK-2375 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.1 >Reporter: Yin Huai > > For example, for > {code} > {"array": [{"field":214748364700}, {"field":1}]} > {code} > the type of field is resolved as IntType. While, for > {code} > {"array": [{"field":1}, {"field":214748364700}]} > {code} > the type of field is resolved as LongType. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2205) Unnecessary exchange operators in a join on multiple tables with the same join key.
[ https://issues.apache.org/jira/browse/SPARK-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2205: Priority: Minor (was: Major) > Unnecessary exchange operators in a join on multiple tables with the same > join key. > --- > > Key: SPARK-2205 > URL: https://issues.apache.org/jira/browse/SPARK-2205 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Minor > > {code} > hql("select * from src x join src y on (x.key=y.key) join src z on > (y.key=z.key)") > SchemaRDD[1] at RDD at SchemaRDD.scala:100 > == Query Plan == > Project [key#4:0,value#5:1,key#6:2,value#7:3,key#8:4,value#9:5] > HashJoin [key#6], [key#8], BuildRight > Exchange (HashPartitioning [key#6], 200) >HashJoin [key#4], [key#6], BuildRight > Exchange (HashPartitioning [key#4], 200) > HiveTableScan [key#4,value#5], (MetastoreRelation default, src, > Some(x)), None > Exchange (HashPartitioning [key#6], 200) > HiveTableScan [key#6,value#7], (MetastoreRelation default, src, > Some(y)), None > Exchange (HashPartitioning [key#8], 200) >HiveTableScan [key#8,value#9], (MetastoreRelation default, src, Some(z)), > None > {code} > However, this is fine... > {code} > hql("select * from src x join src y on (x.key=y.key) join src z on > (x.key=z.key)") > res5: org.apache.spark.sql.SchemaRDD = > SchemaRDD[5] at RDD at SchemaRDD.scala:100 > == Query Plan == > Project [key#26:0,value#27:1,key#28:2,value#29:3,key#30:4,value#31:5] > HashJoin [key#26], [key#30], BuildRight > HashJoin [key#26], [key#28], BuildRight >Exchange (HashPartitioning [key#26], 200) > HiveTableScan [key#26,value#27], (MetastoreRelation default, src, > Some(x)), None >Exchange (HashPartitioning [key#28], 200) > HiveTableScan [key#28,value#29], (MetastoreRelation default, src, > Some(y)), None > Exchange (HashPartitioning [key#30], 200) >HiveTableScan [key#30,value#31], (MetastoreRelation default, src, > Some(z)), None > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2190) Specialized ColumnType for Timestamp
[ https://issues.apache.org/jira/browse/SPARK-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2190: Priority: Critical (was: Major) > Specialized ColumnType for Timestamp > > > Key: SPARK-2190 > URL: https://issues.apache.org/jira/browse/SPARK-2190 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Michael Armbrust >Assignee: Cheng Lian >Priority: Critical > > I'm going to call this a bug since currently its like 300X slower than it > needs to be. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2087) Support SQLConf per session
[ https://issues.apache.org/jira/browse/SPARK-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2087: Priority: Minor (was: Major) > Support SQLConf per session > --- > > Key: SPARK-2087 > URL: https://issues.apache.org/jira/browse/SPARK-2087 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Zongheng Yang >Priority: Minor > > For things like the SharkServer we should support configuration per thread > instead of globally for a context. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2097) UDF Support
[ https://issues.apache.org/jira/browse/SPARK-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2097: Priority: Critical (was: Major) > UDF Support > --- > > Key: SPARK-2097 > URL: https://issues.apache.org/jira/browse/SPARK-2097 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Critical > > Right now we only support UDFs that are written against the Hive API or are > called directly as expressions in the DSL. It would be nice to have native > support for registering scala/python functions as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2010) Support for nested data in PySpark SQL
[ https://issues.apache.org/jira/browse/SPARK-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2010: Priority: Critical (was: Major) > Support for nested data in PySpark SQL > -- > > Key: SPARK-2010 > URL: https://issues.apache.org/jira/browse/SPARK-2010 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Kan Zhang >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2054) Code Generation for Expression Evaluation
[ https://issues.apache.org/jira/browse/SPARK-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2054: Priority: Critical (was: Major) > Code Generation for Expression Evaluation > - > > Key: SPARK-2054 > URL: https://issues.apache.org/jira/browse/SPARK-2054 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2391) LIMIT queries ship a whole partition of data
[ https://issues.apache.org/jira/browse/SPARK-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2391: Priority: Blocker (was: Major) > LIMIT queries ship a whole partition of data > > > Key: SPARK-2391 > URL: https://issues.apache.org/jira/browse/SPARK-2391 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > > Basically the problem here is that Spark's take() runs jobs using allowLocal > = true. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2179) Public API for DataTypes and Schema
[ https://issues.apache.org/jira/browse/SPARK-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2179: Priority: Critical (was: Major) > Public API for DataTypes and Schema > --- > > Key: SPARK-2179 > URL: https://issues.apache.org/jira/browse/SPARK-2179 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Yin Huai >Priority: Critical > > We want something like the following: > * Expose DataType in the SQL package and lock down all the internal details > (TypeTags, etc) > * Programatic API for viewing the schema of an RDD as a StructType > * Method that creates a schema RDD given (RDD[A], StructType, A => Row) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2212) Hash Outer Joins
[ https://issues.apache.org/jira/browse/SPARK-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2212: Priority: Minor (was: Major) > Hash Outer Joins > > > Key: SPARK-2212 > URL: https://issues.apache.org/jira/browse/SPARK-2212 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Hao >Assignee: Cheng Hao >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS
[ https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054162#comment-14054162 ] Mridul Muralidharan commented on SPARK-2390: Here, and a bunch of other places, spark currently closes the Filesystem instance : this is incorrect, and should not be done. The fix would be to remove the fs.close; not force creation of new instances. > Files in staging directory cannot be deleted and wastes the space of HDFS > - > > Key: SPARK-2390 > URL: https://issues.apache.org/jira/browse/SPARK-2390 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0, 1.0.1 >Reporter: Kousuke Saruta > > When running jobs with YARN Cluster mode and using HistoryServer, the files > in the Staging Directory cannot be deleted. > HistoryServer uses directory where event log is written, and the directory is > represented as a instance of o.a.h.f.FileSystem created by using > FileSystem.get. > {code:title=FileLogger.scala} > private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir)) > {code} > {code:title=utils.getHadoopFileSystem} > def getHadoopFileSystem(path: URI): FileSystem = { > FileSystem.get(path, SparkHadoopUtil.get.newConfiguration()) > } > {code} > On the other hand, ApplicationMaster has a instance named fs, which also > created by using FileSystem.get. > {code:title=ApplicationMaster} > private val fs = FileSystem.get(yarnConf) > {code} > FileSystem.get returns cached same instance when URI passed to the method > represents same file system and the method is called by same user. > Because of the behavior, when the directory for event log is on HDFS, fs of > ApplicationMaster and fileSystem of FileLogger is same instance. > When shutting down ApplicationMaster, fileSystem.close is called in > FileLogger#stop, which is invoked by SparkContext#stop indirectly. > {code:title=FileLogger.stop} > def stop() { > hadoopDataStream.foreach(_.close()) > writer.foreach(_.close()) > fileSystem.close() > } > {code} > And ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In > this method, fs.delete(stagingDirPath) is invoked. > Because fs.delete in ApplicationMaster is called after fileSystem.close in > FileLogger, fs.delete fails and results not deleting files in the staging > directory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2392) Executors should not start an HTTP server
Andrew Or created SPARK-2392: Summary: Executors should not start an HTTP server Key: SPARK-2392 URL: https://issues.apache.org/jira/browse/SPARK-2392 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or In the long run we should separate out classes used by the driver vs executors in SparkEnv. For now, we should at least not start an unused HTTP on every executor. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2392) Executors should not start their own HTTP servers
[ https://issues.apache.org/jira/browse/SPARK-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2392: - Summary: Executors should not start their own HTTP servers (was: Executors should not start an HTTP server) > Executors should not start their own HTTP servers > - > > Key: SPARK-2392 > URL: https://issues.apache.org/jira/browse/SPARK-2392 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or > > In the long run we should separate out classes used by the driver vs > executors in SparkEnv. For now, we should at least not start an unused HTTP > on every executor. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054043#comment-14054043 ] Neville Li commented on SPARK-1977: --- There you go: https://github.com/apache/spark/pull/1319 > mutable.BitSet in ALS not serializable with KryoSerializer > -- > > Key: SPARK-1977 > URL: https://issues.apache.org/jira/browse/SPARK-1977 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Neville Li >Priority: Minor > > OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member. > KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't > register mutable.BitSet. > Right now we have to register mutable.BitSet manually. A proper fix would be > using immutable.BitSet in ALS or register mutable.BitSet in upstream chill. > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID > 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: > com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: > scala.collection.mutable.HashSet > Serialization trace: > shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock) > > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) > > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) > com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115) > > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77) > org.apache.spark.rdd.RDD.iterator(RDD.scala:227) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) > org.apache.spark.scheduler.Task.run(Task.scala:51) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > java.lang.Thread.run(Thread.java:662) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) > at scala.Option.
[jira] [Commented] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS
[ https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054041#comment-14054041 ] Kousuke Saruta commented on SPARK-2390: --- I modified the way to create a instance of FileSystem in FileLogger as follows, the symptom didn't appear. {code} private val fileSystem = FileSystem.newInstance(new URI(logDir), SparkHadoopUtil.get.newConfiguration()) {code} > Files in staging directory cannot be deleted and wastes the space of HDFS > - > > Key: SPARK-2390 > URL: https://issues.apache.org/jira/browse/SPARK-2390 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0, 1.0.1 >Reporter: Kousuke Saruta > > When running jobs with YARN Cluster mode and using HistoryServer, the files > in the Staging Directory cannot be deleted. > HistoryServer uses directory where event log is written, and the directory is > represented as a instance of o.a.h.f.FileSystem created by using > FileSystem.get. > {code:title=FileLogger.scala} > private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir)) > {code} > {code:title=utils.getHadoopFileSystem} > def getHadoopFileSystem(path: URI): FileSystem = { > FileSystem.get(path, SparkHadoopUtil.get.newConfiguration()) > } > {code} > On the other hand, ApplicationMaster has a instance named fs, which also > created by using FileSystem.get. > {code:title=ApplicationMaster} > private val fs = FileSystem.get(yarnConf) > {code} > FileSystem.get returns cached same instance when URI passed to the method > represents same file system and the method is called by same user. > Because of the behavior, when the directory for event log is on HDFS, fs of > ApplicationMaster and fileSystem of FileLogger is same instance. > When shutting down ApplicationMaster, fileSystem.close is called in > FileLogger#stop, which is invoked by SparkContext#stop indirectly. > {code:title=FileLogger.stop} > def stop() { > hadoopDataStream.foreach(_.close()) > writer.foreach(_.close()) > fileSystem.close() > } > {code} > And ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In > this method, fs.delete(stagingDirPath) is invoked. > Because fs.delete in ApplicationMaster is called after fileSystem.close in > FileLogger, fs.delete fails and results not deleting files in the staging > directory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS
[ https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-2390: -- Affects Version/s: 1.0.1 > Files in staging directory cannot be deleted and wastes the space of HDFS > - > > Key: SPARK-2390 > URL: https://issues.apache.org/jira/browse/SPARK-2390 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0, 1.0.1 >Reporter: Kousuke Saruta > > When running jobs with YARN Cluster mode and using HistoryServer, the files > in the Staging Directory cannot be deleted. > HistoryServer uses directory where event log is written, and the directory is > represented as a instance of o.a.h.f.FileSystem created by using > FileSystem.get. > {code:title=FileLogger.scala} > private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir)) > {code} > {code:title=utils.getHadoopFileSystem} > def getHadoopFileSystem(path: URI): FileSystem = { > FileSystem.get(path, SparkHadoopUtil.get.newConfiguration()) > } > {code} > On the other hand, ApplicationMaster has a instance named fs, which also > created by using FileSystem.get. > {code:title=ApplicationMaster} > private val fs = FileSystem.get(yarnConf) > {code} > FileSystem.get returns cached same instance when URI passed to the method > represents same file system and the method is called by same user. > Because of the behavior, when the directory for event log is on HDFS, fs of > ApplicationMaster and fileSystem of FileLogger is same instance. > When shutting down ApplicationMaster, fileSystem.close is called in > FileLogger#stop, which is invoked by SparkContext#stop indirectly. > {code:title=FileLogger.stop} > def stop() { > hadoopDataStream.foreach(_.close()) > writer.foreach(_.close()) > fileSystem.close() > } > {code} > And ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In > this method, fs.delete(stagingDirPath) is invoked. > Because fs.delete in ApplicationMaster is called after fileSystem.close in > FileLogger, fs.delete fails and results not deleting files in the staging > directory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2311) Added additional GLMs (Poisson and Gamma) into MLlib
[ https://issues.apache.org/jira/browse/SPARK-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaokai Wei updated SPARK-2311: --- Priority: Major (was: Minor) > Added additional GLMs (Poisson and Gamma) into MLlib > > > Key: SPARK-2311 > URL: https://issues.apache.org/jira/browse/SPARK-2311 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiaokai Wei > > Though GeneralizedLinearModel in MLlib 1.0.0 has some important GLMs such as > Logistic Regression, Linear Regression, some other important GLMs like > Poisson Regression are still missing. > Poisson Regression and Gamma Regression are two widely used models in > industry with many applications. This patch added Poisson and Gamma > Regression as additional GeneralizedLinearModels. > https://github.com/apache/spark/pull/1237 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054003#comment-14054003 ] Xiangrui Meng commented on SPARK-1977: -- Do you mind creating a PR registering mutable.BitSet in MovieLensALS.scala and close PR #925? Thanks! > mutable.BitSet in ALS not serializable with KryoSerializer > -- > > Key: SPARK-1977 > URL: https://issues.apache.org/jira/browse/SPARK-1977 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Neville Li >Priority: Minor > > OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member. > KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't > register mutable.BitSet. > Right now we have to register mutable.BitSet manually. A proper fix would be > using immutable.BitSet in ALS or register mutable.BitSet in upstream chill. > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID > 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: > com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: > scala.collection.mutable.HashSet > Serialization trace: > shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock) > > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) > > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) > com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115) > > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77) > org.apache.spark.rdd.RDD.iterator(RDD.scala:227) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) > org.apache.spark.scheduler.Task.run(Task.scala:51) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > java.lang.Thread.run(Thread.java:662) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.a
[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053999#comment-14053999 ] Neville Li commented on SPARK-1977: --- [~mengxr] sounds good to me. > mutable.BitSet in ALS not serializable with KryoSerializer > -- > > Key: SPARK-1977 > URL: https://issues.apache.org/jira/browse/SPARK-1977 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Neville Li >Priority: Minor > > OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member. > KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't > register mutable.BitSet. > Right now we have to register mutable.BitSet manually. A proper fix would be > using immutable.BitSet in ALS or register mutable.BitSet in upstream chill. > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID > 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: > com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: > scala.collection.mutable.HashSet > Serialization trace: > shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock) > > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) > > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) > com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115) > > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77) > org.apache.spark.rdd.RDD.iterator(RDD.scala:227) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) > org.apache.spark.scheduler.Task.run(Task.scala:51) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > java.lang.Thread.run(Thread.java:662) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) > at scala.Option.foreach(Option.scala:236) >
[jira] [Updated] (SPARK-2386) RowWriteSupport should use the exact types to cast.
[ https://issues.apache.org/jira/browse/SPARK-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2386: Target Version/s: 1.1.0 > RowWriteSupport should use the exact types to cast. > --- > > Key: SPARK-2386 > URL: https://issues.apache.org/jira/browse/SPARK-2386 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin > > When execute {{saveAsParquetFile}} with non-primitive type, > {{RowWriteSupport}} uses wrong type {{Int}} for {{ByteType}} and > {{ShortType}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer
[ https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053994#comment-14053994 ] Xiangrui Meng commented on SPARK-1977: -- I think now I understand when it happens. We use storage level MEMORY_AND_DISK for user/product in/out links, which contains BitSet objects. If the dataset is large, these RDDs will be pushed from in memory storage to on disk storage, where the latter requires serialization. So the easiest way to re-produce this error is changing the storage level of inLinks/outLinks to DISK_ONLY and run with kryo. [~neville] Instead of mapping mutable.BitSet to immutable.BitSet, which introduces overhead, we can register mutable.BitSet in our MovieLensALS example code and wait for the next Chill release. Does it sound good to you? > mutable.BitSet in ALS not serializable with KryoSerializer > -- > > Key: SPARK-1977 > URL: https://issues.apache.org/jira/browse/SPARK-1977 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Neville Li >Priority: Minor > > OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member. > KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't > register mutable.BitSet. > Right now we have to register mutable.BitSet manually. A proper fix would be > using immutable.BitSet in ALS or register mutable.BitSet in upstream chill. > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID > 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: > com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: > scala.collection.mutable.HashSet > Serialization trace: > shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock) > > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626) > > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43) > com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34) > com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) > > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115) > > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77) > org.apache.spark.rdd.RDD.iterator(RDD.scala:227) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) > org.apache.spark.scheduler.Task.run(Task.scala:51) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > java.lang.Thread.run(Thread.java:662) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) > at > org.apache.spark.scheduler.
[jira] [Updated] (SPARK-2391) LIMIT queries ship a whole partition of data
[ https://issues.apache.org/jira/browse/SPARK-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2391: Target Version/s: 1.1.0 Fix Version/s: (was: 1.1.0) > LIMIT queries ship a whole partition of data > > > Key: SPARK-2391 > URL: https://issues.apache.org/jira/browse/SPARK-2391 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Michael Armbrust >Assignee: Michael Armbrust > > Basically the problem here is that Spark's take() runs jobs using allowLocal > = true. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2391) LIMIT queries ship a whole partition of data
[ https://issues.apache.org/jira/browse/SPARK-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2391: Fix Version/s: 1.1.0 > LIMIT queries ship a whole partition of data > > > Key: SPARK-2391 > URL: https://issues.apache.org/jira/browse/SPARK-2391 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Michael Armbrust >Assignee: Michael Armbrust > > Basically the problem here is that Spark's take() runs jobs using allowLocal > = true. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2391) LIMIT queries ship a whole partition of data
Michael Armbrust created SPARK-2391: --- Summary: LIMIT queries ship a whole partition of data Key: SPARK-2391 URL: https://issues.apache.org/jira/browse/SPARK-2391 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Michael Armbrust Assignee: Michael Armbrust Basically the problem here is that Spark's take() runs jobs using allowLocal = true. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS
Kousuke Saruta created SPARK-2390: - Summary: Files in staging directory cannot be deleted and wastes the space of HDFS Key: SPARK-2390 URL: https://issues.apache.org/jira/browse/SPARK-2390 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Kousuke Saruta When running jobs with YARN Cluster mode and using HistoryServer, the files in the Staging Directory cannot be deleted. HistoryServer uses directory where event log is written, and the directory is represented as a instance of o.a.h.f.FileSystem created by using FileSystem.get. {code:title=FileLogger.scala} private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir)) {code} {code:title=utils.getHadoopFileSystem} def getHadoopFileSystem(path: URI): FileSystem = { FileSystem.get(path, SparkHadoopUtil.get.newConfiguration()) } {code} On the other hand, ApplicationMaster has a instance named fs, which also created by using FileSystem.get. {code:title=ApplicationMaster} private val fs = FileSystem.get(yarnConf) {code} FileSystem.get returns cached same instance when URI passed to the method represents same file system and the method is called by same user. Because of the behavior, when the directory for event log is on HDFS, fs of ApplicationMaster and fileSystem of FileLogger is same instance. When shutting down ApplicationMaster, fileSystem.close is called in FileLogger#stop, which is invoked by SparkContext#stop indirectly. {code:title=FileLogger.stop} def stop() { hadoopDataStream.foreach(_.close()) writer.foreach(_.close()) fileSystem.close() } {code} And ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In this method, fs.delete(stagingDirPath) is invoked. Because fs.delete in ApplicationMaster is called after fileSystem.close in FileLogger, fs.delete fails and results not deleting files in the staging directory. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2152) the error of comput rightNodeAgg about Decision tree algorithm in Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053936#comment-14053936 ] Jon Sondag commented on SPARK-2152: --- https://github.com/apache/spark/pull/1316 (also resolves SPARK-2160) > the error of comput rightNodeAgg about Decision tree algorithm in Spark > MLlib > > > Key: SPARK-2152 > URL: https://issues.apache.org/jira/browse/SPARK-2152 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 > Environment: windows7 ,32 operator,and 3G mem >Reporter: caoli > Labels: features > Original Estimate: 4h > Remaining Estimate: 4h > > the error of comput rightNodeAgg about Decision tree algorithm in Spark > MLlib about the function extractLeftRightNodeAggregates() ,when compute > rightNodeAgg used bindata index is error. in the DecisionTree.scala file > about Line 980: > rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) = > binData(shift + (2 * (numBins - 2 - splitIndex))) + > rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex)) > > the binData(shift + (2 * (numBins - 2 - splitIndex))) index compute is > error, so the result of rightNodeAgg include repeated data about "bins" -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2160) error of Decision tree algorithm in Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053938#comment-14053938 ] Jon Sondag commented on SPARK-2160: --- https://github.com/apache/spark/pull/1316 (also resolves SPARK-2152) > error of Decision tree algorithm in Spark MLlib > -- > > Key: SPARK-2160 > URL: https://issues.apache.org/jira/browse/SPARK-2160 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0 >Reporter: caoli > Labels: patch > Fix For: 1.1.0 > > Original Estimate: 4h > Remaining Estimate: 4h > > the error of comput rightNodeAgg about Decision tree algorithm in Spark > MLlib , in the function extractLeftRightNodeAggregates() ,when compute > rightNodeAgg used bindata index is error. in the DecisionTree.scala file > about Line980: > rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) = > binData(shift + (2 * (numBins - 2 - splitIndex))) + > rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex)) > > the binData(shift + (2 * (numBins - 2 - splitIndex))) index compute is > error, so the result of rightNodeAgg include repeated data about "bins" -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2384) Add tooltips for shuffle write and scheduler delay in UI
[ https://issues.apache.org/jira/browse/SPARK-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053834#comment-14053834 ] Sandy Ryza commented on SPARK-2384: --- This is a great idea > Add tooltips for shuffle write and scheduler delay in UI > > > Key: SPARK-2384 > URL: https://issues.apache.org/jira/browse/SPARK-2384 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout >Priority: Minor > > There are a few common points of confusion in the UI that could be clarified > with tooltips. We should add tooltips to explain the scheduler delay and the > shuffle data (to explain why shuffle read is typically < shuffle write, as > many many people have expressed confusion about). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2389) globally shared SparkContext / shared Spark "application"
Robert Stupp created SPARK-2389: --- Summary: globally shared SparkContext / shared Spark "application" Key: SPARK-2389 URL: https://issues.apache.org/jira/browse/SPARK-2389 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Robert Stupp The documentation (in Cluster Mode Overview) cites: bq. Each application gets its own executor processes, which *stay up for the duration of the whole application* and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that *data cannot be shared* across different Spark applications (instances of SparkContext) without writing it to an external storage system. IMO this is a limitation that should be lifted to support any number of --driver-- client processes to share executors and to share (persistent / cached) data. This is especially useful if you have a bunch of frontend servers (dump web app servers) that want to use Spark as a _big computing machine_. Most important is the fact that Spark is quite good in caching/persisting data in memory / on disk thus removing load from backend data stores. Means: it would be really great to let different --driver-- client JVMs operate on the same RDDs and benefit from Spark's caching/persistence. It would however introduce some administration mechanisms to * start a shared context * update the executor configuration (# of worker nodes, # of cpus, etc) on the fly * stop a shared context Even "conventional" batch MR applications would benefit if ran fequently against the same data set. As an implicit requirement, RDD persistence could get a TTL for its materialized state. With such a feature the overall performance of today's web applications could then be increased by adding more web app servers, more spark nodes, more nosql nodes etc -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2388) Streaming from multiple different Kafka topics is problematic
Sergey created SPARK-2388: - Summary: Streaming from multiple different Kafka topics is problematic Key: SPARK-2388 URL: https://issues.apache.org/jira/browse/SPARK-2388 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.0 Reporter: Sergey Fix For: 1.0.1 Default way of creating stream out of Kafka source would be as val stream = KafkaUtils.createStream(ssc,"localhost:2181","logs", Map("retarget" -> 2,"datapair" -> 2)) However, if two topics - in this case "retarget" and "datapair" - are very different, there is no way to set up different filter, mapping functions, etc), as they are effectively merged. However, instance of KafkaInputDStream, created with this call internally calls ConsumerConnector.createMessageStream() which returns *map* of KafkaStreams, keyed by topic. It would be great if this map would be exposed somehow, so aforementioned call val streamS = KafkaUtils.createStreamS(...) returned map of streams. Regards, Sergey Malov Collective Media -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2387) Remove the stage barrier for better resource utilization
Rui Li created SPARK-2387: - Summary: Remove the stage barrier for better resource utilization Key: SPARK-2387 URL: https://issues.apache.org/jira/browse/SPARK-2387 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Rui Li DAGScheduler divides a Spark job into multiple stages according to RDD dependencies. Whenever there’s a shuffle dependency, DAGScheduler creates a shuffle map stage on the map side, and another stage depending on that stage. Currently, the downstream stage cannot start until all its depended stages have finished. This barrier between stages leads to idle slots when waiting for the last few upstream tasks to finish and thus wasting cluster resources. Therefore we propose to remove the barrier and pre-start the reduce stage once there're free slots. This can achieve better resource utilization and improve the overall job performance, especially when there're lots of executors granted to the application. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2386) RowWriteSupport should use the exact types to cast.
[ https://issues.apache.org/jira/browse/SPARK-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053537#comment-14053537 ] Takuya Ueshin commented on SPARK-2386: -- PRed: https://github.com/apache/spark/pull/1315 > RowWriteSupport should use the exact types to cast. > --- > > Key: SPARK-2386 > URL: https://issues.apache.org/jira/browse/SPARK-2386 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin > > When execute {{saveAsParquetFile}} with non-primitive type, > {{RowWriteSupport}} uses wrong type {{Int}} for {{ByteType}} and > {{ShortType}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2386) RowWriteSupport should use the exact types to cast.
Takuya Ueshin created SPARK-2386: Summary: RowWriteSupport should use the exact types to cast. Key: SPARK-2386 URL: https://issues.apache.org/jira/browse/SPARK-2386 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin When execute {{saveAsParquetFile}} with non-primitive type, {{RowWriteSupport}} uses wrong type {{Int}} for {{ByteType}} and {{ShortType}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2385) Missing guide for running JDBC server on YARN
Yi Tian created SPARK-2385: -- Summary: Missing guide for running JDBC server on YARN Key: SPARK-2385 URL: https://issues.apache.org/jira/browse/SPARK-2385 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.0.0 Reporter: Yi Tian Priority: Minor There are no document for "running JDBC server on YARN" -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2382) build error:
[ https://issues.apache.org/jira/browse/SPARK-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053433#comment-14053433 ] Sean Owen commented on SPARK-2382: -- I can't reproduce this. It is almost certainly a problem with your environment or network, possibly temporary. It's not a Spark issue so I think this should be closed. > build error: > - > > Key: SPARK-2382 > URL: https://issues.apache.org/jira/browse/SPARK-2382 > Project: Spark > Issue Type: Question > Components: Build >Affects Versions: 1.0.0 > Environment: Ubuntu 12.0.4 precise. > spark@ubuntu-cdh5-spark:~/spark-1.0.0$ mvn -version > Apache Maven 3.0.4 > Maven home: /usr/share/maven > Java version: 1.6.0_31, vendor: Sun Microsystems Inc. > Java home: /usr/lib/jvm/j2sdk1.6-oracle/jre > Default locale: en_US, platform encoding: UTF-8 > OS name: "linux", version: "3.11.0-15-generic", arch: "amd64", family: "unix" >Reporter: Mukul Jain > Labels: newbie > > Unable to build. maven can't download dependency .. checked my http_proxy and > https_proxy setting they are working fine. Other http and https dependencies > were downloaded fine.. build process gets stuck always at this repo. manually > down loading also fails and receive an exception. > [INFO] > > [INFO] Building Spark Project External MQTT 1.0.0 > [INFO] > > Downloading: > https://repository.apache.org/content/repositories/releases/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom > Jul 6, 2014 4:53:26 PM org.apache.commons.httpclient.HttpMethodDirector > executeWithRetry > INFO: I/O exception (java.net.ConnectException) caught when processing > request: Connection timed out > Jul 6, 2014 4:53:26 PM org.apache.commons.httpclient.HttpMethodDirector > executeWithRetry > INFO: Retrying request -- This message was sent by Atlassian JIRA (v6.2#6252)