date:20140707

[jira] [Commented] (SPARK-2400) config spark.yarn.max.executor.failures is not explained accurately

2014-07-07 Thread Chen Chao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054542#comment-14054542
 ] 

Chen Chao commented on SPARK-2400:
--

PR @ https://github.com/apache/spark/pull/1282/files

> config spark.yarn.max.executor.failures is not explained accurately
> ---
>
> Key: SPARK-2400
> URL: https://issues.apache.org/jira/browse/SPARK-2400
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Chen Chao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2400) config spark.yarn.max.executor.failures is not explained accurately

2014-07-07 Thread Chen Chao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054540#comment-14054540
 ] 

Chen Chao commented on SPARK-2400:
--

it should be "numExecutors * 2, with minimum of 3" rather than "2*numExecutors".


> config spark.yarn.max.executor.failures is not explained accurately
> ---
>
> Key: SPARK-2400
> URL: https://issues.apache.org/jira/browse/SPARK-2400
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Chen Chao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2400) config spark.yarn.max.executor.failures is not explained accurately

2014-07-07 Thread Chen Chao (JIRA)

Chen Chao created SPARK-2400:


 Summary: config spark.yarn.max.executor.failures is not explained 
accurately
 Key: SPARK-2400
 URL: https://issues.apache.org/jira/browse/SPARK-2400
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Chen Chao
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2399) Add support for LZ4 compression

2014-07-07 Thread Greg Bowyer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Bowyer updated SPARK-2399:
---

Description: 
LZ4 is a compression codec of the same ideas as googles snappy, but has some 
advantages:
* It is faster than snappy with a similar compression ratio
* The implementation is Apache licensed and not GPL

It has shown promise in both the lucene and hadoop communities, and it looks 
like its a really easy add to spark io compression.

Attached is a patch that does this

  was:
LZ4 is a compression codec of the same ideas as googles snappy, but has some 
advantages:
* It is faster than snappy with a similar compression ration
* The implementation is Apache licensed and not GPL

It has shown promise in both the lucene and hadoop communities, and it looks 
like its a really easy add to spark io compression.

Attached is a patch that does this


> Add support for LZ4 compression
> ---
>
> Key: SPARK-2399
> URL: https://issues.apache.org/jira/browse/SPARK-2399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Greg Bowyer
>  Labels: compression, lz4
> Attachments: SPARK-2399-Make-spark-compression-able-to-use-LZ4.patch
>
>
> LZ4 is a compression codec of the same ideas as googles snappy, but has some 
> advantages:
> * It is faster than snappy with a similar compression ratio
> * The implementation is Apache licensed and not GPL
> It has shown promise in both the lucene and hadoop communities, and it looks 
> like its a really easy add to spark io compression.
> Attached is a patch that does this



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2399) Add support for LZ4 compression

2014-07-07 Thread Greg Bowyer (JIRA)

Greg Bowyer created SPARK-2399:
--

 Summary: Add support for LZ4 compression
 Key: SPARK-2399
 URL: https://issues.apache.org/jira/browse/SPARK-2399
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Greg Bowyer
 Attachments: SPARK-2399-Make-spark-compression-able-to-use-LZ4.patch

LZ4 is a compression codec of the same ideas as googles snappy, but has some 
advantages:
* It is faster than snappy with a similar compression ration
* The implementation is Apache licensed and not GPL

It has shown promise in both the lucene and hadoop communities, and it looks 
like its a really easy add to spark io compression.

Attached is a patch that does this



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2399) Add support for LZ4 compression

2014-07-07 Thread Greg Bowyer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Bowyer updated SPARK-2399:
---

Attachment: SPARK-2399-Make-spark-compression-able-to-use-LZ4.patch

> Add support for LZ4 compression
> ---
>
> Key: SPARK-2399
> URL: https://issues.apache.org/jira/browse/SPARK-2399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Greg Bowyer
>  Labels: compression, lz4
> Attachments: SPARK-2399-Make-spark-compression-able-to-use-LZ4.patch
>
>
> LZ4 is a compression codec of the same ideas as googles snappy, but has some 
> advantages:
> * It is faster than snappy with a similar compression ration
> * The implementation is Apache licensed and not GPL
> It has shown promise in both the lucene and hadoop communities, and it looks 
> like its a really easy add to spark io compression.
> Attached is a patch that does this



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2215) Multi-way join

2014-07-07 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054491#comment-14054491
 ] 

Yin Huai commented on SPARK-2215:
-

Instead of implementing a multi-way join operator, how about we just rely on 
AddExchange to properly insert Exchange Operator to avoid unnecessary data 
movements?

> Multi-way join
> --
>
> Key: SPARK-2215
> URL: https://issues.apache.org/jira/browse/SPARK-2215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Minor
>
> Support the multi-way join (multiple table joins) in a single reduce stage if 
> they have the same join keys.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2303) Poisson regression model for count data

2014-07-07 Thread Gang Bai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054470#comment-14054470
 ] 

Gang Bai commented on SPARK-2303:
-

This change has been merged into another JIRA SPARK-2311. Closing this one.

> Poisson regression model for count data
> ---
>
> Key: SPARK-2303
> URL: https://issues.apache.org/jira/browse/SPARK-2303
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Gang Bai
>
> Modeling count data is of great importance in solving real-world statistic 
> problems. Currently mllib.regression provides models mostly for numeric data, 
> i.e fitting curves with various regularization on resulted weights, but still 
> lacks the support of count data modeling.
> A very basic model for this is the Poisson regression. Following the patterns 
> in mllib and reusing the components, we address the parameter estimation for 
> Poisson regression in a maximum likelihood manner. In detail, to add Poisson 
> regression to mllib.regression, we need to:
>  # Add the gradient of the negative log-likelihood into 
> mllib/optimization/Gradients.scala.
>  # Add the implementations of PoissonRegressionModel, which extends 
> GeneralizedLinearModel with RegressionModel. Here we need the implementation 
> of the predict method.
>  # Add the implementations of the generalized linear algorithm class. Here we 
> can use either LBFGS or GradientDescent as the optimizer. So we implement 
> both as class PoissonRegressionWithSGD and class PoissonRegressionWithLBFGS 
> respectively.
>  # Add the companion object PoissonRegressionWithSGD and 
> PoissonRegressionWithLBFGS as drivers.
>  # Test suites
>  ## Test the gradient computation.
>  ## Test the regression method using generated data, which requires a 
> PoissonRegressionDataGenerator.
>  ## Test the regression method using a real-world data set.
>  # Add the documents.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (SPARK-2303) Poisson regression model for count data

2014-07-07 Thread Gang Bai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Bai closed SPARK-2303.
---

Resolution: Fixed

> Poisson regression model for count data
> ---
>
> Key: SPARK-2303
> URL: https://issues.apache.org/jira/browse/SPARK-2303
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Gang Bai
>
> Modeling count data is of great importance in solving real-world statistic 
> problems. Currently mllib.regression provides models mostly for numeric data, 
> i.e fitting curves with various regularization on resulted weights, but still 
> lacks the support of count data modeling.
> A very basic model for this is the Poisson regression. Following the patterns 
> in mllib and reusing the components, we address the parameter estimation for 
> Poisson regression in a maximum likelihood manner. In detail, to add Poisson 
> regression to mllib.regression, we need to:
>  # Add the gradient of the negative log-likelihood into 
> mllib/optimization/Gradients.scala.
>  # Add the implementations of PoissonRegressionModel, which extends 
> GeneralizedLinearModel with RegressionModel. Here we need the implementation 
> of the predict method.
>  # Add the implementations of the generalized linear algorithm class. Here we 
> can use either LBFGS or GradientDescent as the optimizer. So we implement 
> both as class PoissonRegressionWithSGD and class PoissonRegressionWithLBFGS 
> respectively.
>  # Add the companion object PoissonRegressionWithSGD and 
> PoissonRegressionWithLBFGS as drivers.
>  # Test suites
>  ## Test the gradient computation.
>  ## Test the regression method using generated data, which requires a 
> PoissonRegressionDataGenerator.
>  ## Test the regression method using a real-world data set.
>  # Add the documents.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2181) The keys for sorting the columns of Executor page in SparkUI are incorrect

2014-07-07 Thread Shuo Xiang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054455#comment-14054455
 ] 

Shuo Xiang commented on SPARK-2181:
---

Hi, I've merged the latest version but the column of Memory Usage inside the 
RDD storage Info tab is not correct. It is still sorted by the string not the 
value.



> The keys for sorting the columns of Executor page in SparkUI are incorrect
> --
>
> Key: SPARK-2181
> URL: https://issues.apache.org/jira/browse/SPARK-2181
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shuo Xiang
>Assignee: Guoqiang Li
>Priority: Minor
> Fix For: 1.1.0, 1.0.2
>
>
> Under the Executor page of SparkUI, each column is sorted alphabetically 
> (after clicking). However, it should be sorted by the value, not the string.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2235) Spark SQL basicOperator add Intersect operator

2014-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2235.
-

   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: (was: Michael Armbrust)

>  Spark SQL basicOperator add Intersect operator
> ---
>
> Key: SPARK-2235
> URL: https://issues.apache.org/jira/browse/SPARK-2235
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Yanjie Gao
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2201) Improve FlumeInputDStream's stability and make it scalable

2014-07-07 Thread sunshangchun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054442#comment-14054442
 ] 

sunshangchun commented on SPARK-2201:
-

I don't like it's a problem. 
It's a external module and take no effect on spark core module.
Again, spark core module has already used zookeeper to select leader master.

Thanks 

> Improve FlumeInputDStream's stability and make it scalable
> --
>
> Key: SPARK-2201
> URL: https://issues.apache.org/jira/browse/SPARK-2201
> Project: Spark
>  Issue Type: Improvement
>Reporter: sunshangchun
>
> Currently:
> FlumeUtils.createStream(ssc, "localhost", port); 
> This means that only one flume receiver can work with FlumeInputDStream .so 
> the solution is not scalable. 
> I use a zookeeper to solve this problem.
> Spark flume receivers register themselves to a zk path when started, and a 
> flume agent get physical hosts and push events to them.
> Some works need to be done here: 
> 1.receiver create tmp node in zk,  listeners just watch those tmp nodes.
> 2. when spark FlumeReceivers started, they acquire a physical host 
> (localhost's ip and an idle port) and register itself to zookeeper.
> 3. A new flume sink. In the method of appendEvents, they get physical hosts 
> and push data to them in a round-robin manner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn

2014-07-07 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054434#comment-14054434
 ] 

Guoqiang Li commented on SPARK-2398:


Seems to be related to 
[SPARK-1930|https://issues.apache.org/jira/browse/SPARK-1930].
Can you post the yarn node manager log?

> Trouble running Spark 1.0 on Yarn 
> --
>
> Key: SPARK-2398
> URL: https://issues.apache.org/jira/browse/SPARK-2398
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Nishkam Ravi
>
> Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 
> For example: SparkPageRank when run in standalone mode goes through without 
> any errors (tested for up to 30GB input dataset on a 6-node cluster).  Also 
> runs fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn 
> cluster mode) as the input data size is increased. Confirmed for 16GB input 
> dataset.
> The same workload runs fine with Spark 0.9 in both standalone and yarn 
> cluster mode (for up to 30 GB input dataset on a 6-node cluster).
> Commandline used:
> (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn 
> --deploy-mode cluster --properties-file pagerank.conf  --driver-memory 30g 
> --driver-cores 16 --num-executors 5 --class 
> org.apache.spark.examples.SparkPageRank 
> /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
>  pagerank_in $NUM_ITER)
> pagerank.conf:
> spark.masterspark://c1704.halxg.cloudera.com:7077
> spark.home  /opt/cloudera/parcels/CDH/lib/spark
> spark.executor.memory   32g
> spark.default.parallelism   118
> spark.cores.max 96
> spark.storage.memoryFraction0.6
> spark.shuffle.memoryFraction0.3
> spark.shuffle.compress  true
> spark.shuffle.spill.compresstrue
> spark.broadcast.compresstrue
> spark.rdd.compress  false
> spark.io.compression.codec  org.apache.spark.io.LZFCompressionCodec
> spark.io.compression.snappy.block.size  32768
> spark.reducer.maxMbInFlight 48
> spark.local.dir  /var/lib/jenkins/workspace/tmp
> spark.driver.memory 30g
> spark.executor.cores16
> spark.locality.wait 6000
> spark.executor.instances5
> UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:
> 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection 
> to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
> java.nio.channels.AsynchronousCloseException
> at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
> at 
> org.apache.spark.network.SendingConnection.write(Connection.scala:361)
> at 
> org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
> at 
> org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
> at 
> org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
> at 
> org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
> at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at 
> org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(Thr

[jira] [Created] (SPARK-2398) Trouble running Spark 1.0 on Yarn

2014-07-07 Thread Nishkam Ravi (JIRA)

Nishkam Ravi created SPARK-2398:
---

 Summary: Trouble running Spark 1.0 on Yarn 
 Key: SPARK-2398
 URL: https://issues.apache.org/jira/browse/SPARK-2398
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Nishkam Ravi


Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 

For example: SparkPageRank when run in standalone mode goes through without any 
errors (tested for up to 30GB input dataset on a 6-node cluster).  Also runs 
fine for a 1GB dataset in yarn cluster mode. Starts to choke (in yarn cluster 
mode) as the input data size is increased. Confirmed for 16GB input dataset.

The same workload runs fine with Spark 0.9 in both standalone and yarn cluster 
mode (for up to 30 GB input dataset on a 6-node cluster).

Commandline used:

(/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn 
--deploy-mode cluster --properties-file pagerank.conf  --driver-memory 30g 
--driver-cores 16 --num-executors 5 --class 
org.apache.spark.examples.SparkPageRank 
/opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
 pagerank_in $NUM_ITER)

pagerank.conf:

spark.masterspark://c1704.halxg.cloudera.com:7077
spark.home  /opt/cloudera/parcels/CDH/lib/spark
spark.executor.memory   32g
spark.default.parallelism   118
spark.cores.max 96
spark.storage.memoryFraction0.6
spark.shuffle.memoryFraction0.3
spark.shuffle.compress  true
spark.shuffle.spill.compresstrue
spark.broadcast.compresstrue
spark.rdd.compress  false
spark.io.compression.codec  org.apache.spark.io.LZFCompressionCodec
spark.io.compression.snappy.block.size  32768
spark.reducer.maxMbInFlight 48
spark.local.dir  /var/lib/jenkins/workspace/tmp
spark.driver.memory 30g
spark.executor.cores16
spark.locality.wait 6000
spark.executor.instances5

UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:

14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection 
to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
java.nio.channels.AsynchronousCloseException
at 
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
at 
org.apache.spark.network.SendingConnection.write(Connection.scala:361)
at 
org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
at 
org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
at 
org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
at 
org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
at 
org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
at 
org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
---

14/07/07 17:59:52 WARN network.SendingConnection: Error finishing connection to 
a1016.halxg.cloudera.com/10.20.184.116:54105
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at 
org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:313)

[jira] [Commented] (SPARK-2201) Improve FlumeInputDStream's stability and make it scalable

2014-07-07 Thread llai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054431#comment-14054431
 ] 

llai commented on SPARK-2201:
-

is it a good idea to add dependency of zookeeper?

> Improve FlumeInputDStream's stability and make it scalable
> --
>
> Key: SPARK-2201
> URL: https://issues.apache.org/jira/browse/SPARK-2201
> Project: Spark
>  Issue Type: Improvement
>Reporter: sunshangchun
>
> Currently:
> FlumeUtils.createStream(ssc, "localhost", port); 
> This means that only one flume receiver can work with FlumeInputDStream .so 
> the solution is not scalable. 
> I use a zookeeper to solve this problem.
> Spark flume receivers register themselves to a zk path when started, and a 
> flume agent get physical hosts and push events to them.
> Some works need to be done here: 
> 1.receiver create tmp node in zk,  listeners just watch those tmp nodes.
> 2. when spark FlumeReceivers started, they acquire a physical host 
> (localhost's ip and an idle port) and register itself to zookeeper.
> 3. A new flume sink. In the method of appendEvents, they get physical hosts 
> and push data to them in a round-robin manner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2376) Selecting list values inside nested JSON objects raises java.lang.IllegalArgumentException

2014-07-07 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-2376.
-

   Resolution: Fixed
Fix Version/s: 1.0.2
   1.1.0

> Selecting list values inside nested JSON objects raises 
> java.lang.IllegalArgumentException
> --
>
> Key: SPARK-2376
> URL: https://issues.apache.org/jira/browse/SPARK-2376
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.1
>Reporter: Nicholas Chammas
>Assignee: Yin Huai
> Fix For: 1.1.0, 1.0.2
>
>
> Repro script for PySpark, deployed via {{spark-ec2}} at git revision 
> {{9d5ecf8205b924dc8a3c13fed68beb78cc5c7553}}:
> {code}
> from pyspark.sql import SQLContext
> sqlContext = SQLContext(sc)
> raw = sc.parallelize([
> """
> {
> "name": "Nick",
> "history": {
> "countries": []
> }
> }
> """
> ])
> profiles = sqlContext.jsonRDD(raw)
> profiles.registerAsTable("profiles")
> profiles.printSchema()
> sqlContext.sql("SELECT name FROM profiles;").collect() # works fine
> sqlContext.sql("SELECT history FROM profiles;").collect()  # raises exception
> {code}
> Attempting to select the top-level struct that has a nested list value yields 
> the following error:
> {code}
> 14/07/06 00:10:26 INFO scheduler.TaskSetManager: Loss was due to 
> net.razorvine.pickle.PickleException: couldn't introspect javabean: 
> java.lang.IllegalArgumentException: wrong number of arguments [duplicate 3]
> 14/07/06 00:10:26 ERROR scheduler.TaskSetManager: Task 26.0:15 failed 4 
> times; aborting job
> 14/07/06 00:10:26 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 26.0, 
> whose tasks have all completed, from pool 
> 14/07/06 00:10:26 INFO scheduler.TaskSchedulerImpl: Cancelling stage 26
> 14/07/06 00:10:26 INFO scheduler.DAGScheduler: Failed to run collect at 
> :1
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/root/spark/python/pyspark/rdd.py", line 649, in collect
> bytesInJava = self._jrdd.collect().iterator()
>   File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", line 
> 537, in __call__
>   File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 
> 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o286.collect.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> 26.0:15 failed 4 times, most recent failure: Exception failure in TID 394 on 
> host ip-10-183-59-125.ec2.internal: net.razorvine.pickle.PickleException: 
> couldn't introspect javabean: java.lang.IllegalArgumentException: wrong 
> number of arguments
> net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603)
> net.razorvine.pickle.Pickler.dispatch(Pickler.java:299)
> net.razorvine.pickle.Pickler.save(Pickler.java:125)
> net.razorvine.pickle.Pickler.put_map(Pickler.java:322)
> net.razorvine.pickle.Pickler.dispatch(Pickler.java:286)
> net.razorvine.pickle.Pickler.save(Pickler.java:125)
> net.razorvine.pickle.Pickler.put_map(Pickler.java:322)
> net.razorvine.pickle.Pickler.dispatch(Pickler.java:286)
> net.razorvine.pickle.Pickler.save(Pickler.java:125)
> net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392)
> net.razorvine.pickle.Pickler.dispatch(Pickler.java:195)
> net.razorvine.pickle.Pickler.save(Pickler.java:125)
> net.razorvine.pickle.Pickler.dump(Pickler.java:95)
> net.razorvine.pickle.Pickler.dumps(Pickler.java:80)
> 
> org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$3.apply(SchemaRDD.scala:385)
> 
> org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$3.apply(SchemaRDD.scala:385)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:317)
> 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:203)
> 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
> 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1213)
> 
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:177)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1041)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1025)
>   at 
> org.apache.spark.

[jira] [Created] (SPARK-2397) Get rid of LocalHiveContext

2014-07-07 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-2397:
---

 Summary: Get rid of LocalHiveContext
 Key: SPARK-2397
 URL: https://issues.apache.org/jira/browse/SPARK-2397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Priority: Blocker
 Fix For: 1.1.0


HiveLocalContext is nearly completely redundant with HiveContext.  We should 
consider deprecating it and removing all uses.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2360) CSV import to SchemaRDDs

2014-07-07 Thread Hossein Falaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054262#comment-14054262
 ] 

Hossein Falaki edited comment on SPARK-2360 at 7/8/14 12:45 AM:


As a point for comparison the interface in some other popular packages are:
_R_:
{code}
read.csv(filePath, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = 
TRUE, comment.char = "", ...)
{code}

Where:
* header: a logical value indicating whether the file contains the names of the 
variables as its first line.
* sep: the field separator character. 
* quote: the set of quoting characters. To disable quoting altogether, use 
‘quote = ""’
* dec: the character used in the file for decimal points.
* fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are 
implicitly added.

_pandas_:
{code}
pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, 
compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, 
skipinitialspace=False, lineterminator=None, header='infer', index_col=None, 
names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, 
na_values=None, na_fvalues=None, true_values=None, false_values=None, 
delimiter=None, converters=None, dtype=None, usecols=None, engine=None, 
delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, 
use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, 
error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, 
decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, 
date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, 
verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, 
tupleize_cols=False, infer_datetime_format=False)
{code}
The description of fields can be found here: 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html


was (Author: falaki):
As a point for comparison the interface in some other popular packages are:
_R_:
{code}
read.csv(filePath, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = 
TRUE, comment.char = "", ...)
{code}

Where:
header: a logical value indicating whether the file contains the names of the 
variables as its first line.
sep: the field separator character. 
quote: the set of quoting characters. To disable quoting altogether, use ‘quote 
= ""’
dec: the character used in the file for decimal points.
fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are 
implicitly added.

_pandas_:
{code}
pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, 
compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, 
skipinitialspace=False, lineterminator=None, header='infer', index_col=None, 
names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, 
na_values=None, na_fvalues=None, true_values=None, false_values=None, 
delimiter=None, converters=None, dtype=None, usecols=None, engine=None, 
delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, 
use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, 
error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, 
decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, 
date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, 
verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, 
tupleize_cols=False, infer_datetime_format=False)
{code}
The description of fields can be found here: 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

> CSV import to SchemaRDDs
> 
>
> Key: SPARK-2360
> URL: https://issues.apache.org/jira/browse/SPARK-2360
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Minor
>
> I think the first step it to design the interface that we want to present to 
> users.  Mostly this is defining options when importing.  Off the top of my 
> head:
> - What is the separator?
> - Provide column names or infer them from the first row.
> - how to handle multiple files with possibly different schemas
> - do we have a method to let users specify the datatypes of the columns or 
> are they just strings?
> - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS

2014-07-07 Thread Kousuke Saruta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054333#comment-14054333
 ] 

Kousuke Saruta edited comment on SPARK-2390 at 7/8/14 12:41 AM:


Pull Requested at https://github.com/apache/spark/pull/1326


was (Author: sarutak):
Pull Requested at https://github.com/apache/spark/pull/1317

> Files in staging directory cannot be deleted and wastes the space of HDFS
> -
>
> Key: SPARK-2390
> URL: https://issues.apache.org/jira/browse/SPARK-2390
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Kousuke Saruta
>
> When running jobs with YARN Cluster mode and using HistoryServer, the files 
> in the Staging Directory cannot be deleted.
> HistoryServer uses directory where event log is written, and the directory is 
> represented as a instance of o.a.h.f.FileSystem created by using 
> FileSystem.get.
> {code:title=FileLogger.scala}
> private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir))
> {code}
> {code:title=utils.getHadoopFileSystem}
> def getHadoopFileSystem(path: URI): FileSystem = {
>   FileSystem.get(path, SparkHadoopUtil.get.newConfiguration())
> }
> {code}
> On the other hand, ApplicationMaster has a instance named fs, which also 
> created by using FileSystem.get.
> {code:title=ApplicationMaster}
> private val fs = FileSystem.get(yarnConf)
> {code}
> FileSystem.get returns cached same instance when URI passed to the method 
> represents same file system and the method is called by same user.
> Because of the behavior, when the directory for event log is on HDFS, fs of 
> ApplicationMaster and fileSystem of FileLogger is same instance.
> When shutting down ApplicationMaster, fileSystem.close is called in 
> FileLogger#stop, which is invoked by SparkContext#stop indirectly.
> {code:title=FileLogger.stop}
> def stop() {
>   hadoopDataStream.foreach(_.close())
>   writer.foreach(_.close())
>   fileSystem.close()
> }
> {code}
> And  ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In 
> this method, fs.delete(stagingDirPath) is invoked. 
> Because fs.delete in ApplicationMaster is called after fileSystem.close in 
> FileLogger, fs.delete fails and results not deleting files in the staging 
> directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS

2014-07-07 Thread Kousuke Saruta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054333#comment-14054333
 ] 

Kousuke Saruta commented on SPARK-2390:
---

Pull Requested at https://github.com/apache/spark/pull/1317

> Files in staging directory cannot be deleted and wastes the space of HDFS
> -
>
> Key: SPARK-2390
> URL: https://issues.apache.org/jira/browse/SPARK-2390
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Kousuke Saruta
>
> When running jobs with YARN Cluster mode and using HistoryServer, the files 
> in the Staging Directory cannot be deleted.
> HistoryServer uses directory where event log is written, and the directory is 
> represented as a instance of o.a.h.f.FileSystem created by using 
> FileSystem.get.
> {code:title=FileLogger.scala}
> private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir))
> {code}
> {code:title=utils.getHadoopFileSystem}
> def getHadoopFileSystem(path: URI): FileSystem = {
>   FileSystem.get(path, SparkHadoopUtil.get.newConfiguration())
> }
> {code}
> On the other hand, ApplicationMaster has a instance named fs, which also 
> created by using FileSystem.get.
> {code:title=ApplicationMaster}
> private val fs = FileSystem.get(yarnConf)
> {code}
> FileSystem.get returns cached same instance when URI passed to the method 
> represents same file system and the method is called by same user.
> Because of the behavior, when the directory for event log is on HDFS, fs of 
> ApplicationMaster and fileSystem of FileLogger is same instance.
> When shutting down ApplicationMaster, fileSystem.close is called in 
> FileLogger#stop, which is invoked by SparkContext#stop indirectly.
> {code:title=FileLogger.stop}
> def stop() {
>   hadoopDataStream.foreach(_.close())
>   writer.foreach(_.close())
>   fileSystem.close()
> }
> {code}
> And  ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In 
> this method, fs.delete(stagingDirPath) is invoked. 
> Because fs.delete in ApplicationMaster is called after fileSystem.close in 
> FileLogger, fs.delete fails and results not deleting files in the staging 
> directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2375) JSON schema inference may not resolve type conflicts correctly for a field inside an array of structs

2014-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2375.
-

   Resolution: Fixed
Fix Version/s: 1.0.2
   1.1.0

> JSON schema inference may not resolve type conflicts correctly for a field 
> inside an array of structs
> -
>
> Key: SPARK-2375
> URL: https://issues.apache.org/jira/browse/SPARK-2375
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.1
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.1.0, 1.0.2
>
>
> For example, for
> {code}
> {"array": [{"field":214748364700}, {"field":1}]}
> {code}
> the type of field is resolved as IntType. While, for
> {code}
> {"array": [{"field":1}, {"field":214748364700}]}
> {code}
> the type of field is resolved as LongType.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2386) RowWriteSupport should use the exact types to cast.

2014-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2386.
-

   Resolution: Fixed
Fix Version/s: 1.0.2
   1.1.0
 Assignee: Takuya Ueshin

> RowWriteSupport should use the exact types to cast.
> ---
>
> Key: SPARK-2386
> URL: https://issues.apache.org/jira/browse/SPARK-2386
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 1.1.0, 1.0.2
>
>
> When execute {{saveAsParquetFile}} with non-primitive type, 
> {{RowWriteSupport}} uses wrong type {{Int}} for {{ByteType}} and 
> {{ShortType}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery

2014-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2339.
-

   Resolution: Fixed
Fix Version/s: 1.0.2
   1.1.0

> SQL parser in sql-core is case sensitive, but a table alias is converted to 
> lower case when we create Subquery
> --
>
> Key: SPARK-2339
> URL: https://issues.apache.org/jira/browse/SPARK-2339
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.1.0, 1.0.2
>
>
> Reported by 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html
> After we get the table from the catalog, because the table has an alias, we 
> will temporarily insert a Subquery. Then, we convert the table alias to lower 
> case no matter if the parser is case sensitive or not.
> To see the issue ...
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> case class Person(name: String, age: Int)
> val people = 
> sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p 
> => Person(p(0), p(1).trim.toInt))
> people.registerAsTable("people")
> sqlContext.sql("select PEOPLE.name from people PEOPLE")
> {code}
> The plan is ...
> {code}
> == Query Plan ==
> Project ['PEOPLE.name]
>  ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at 
> basicOperators.scala:176
> {code}
> You can find that "PEOPLE.name" is not resolved.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large

2014-07-07 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054307#comment-14054307
 ] 

Reynold Xin commented on SPARK-2017:


We gotta be careful with pagination because right now the web UI allows sorting 
via some client-side JavaScript. If we enable pagination, we need to make sure 
the sorting is global across pages (i.e. server side).

This might go overboard, but I think rendering is the bottleneck. If that's 
indeed the case, maybe the right solution is for the server to just send all 
the data in JSON, and the client side decides what to render (pagination, etc) 
...


> web ui stage page becomes unresponsive when the number of tasks is large
> 
>
> Key: SPARK-2017
> URL: https://issues.apache.org/jira/browse/SPARK-2017
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Reynold Xin
>  Labels: starter
>
> {code}
> sc.parallelize(1 to 100, 100).count()
> {code}
> The above code creates one million tasks to be executed. The stage detail web 
> ui page takes forever to load (if it ever completes).
> There are again a few different alternatives:
> 0. Limit the number of tasks we show.
> 1. Pagination
> 2. By default only show the aggregate metrics and failed tasks, and hide the 
> successful ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2396) Spark EC2 scripts fail when trying to log in to EC2 instances

2014-07-07 Thread Stephen M. Hopper (JIRA)

Stephen M. Hopper created SPARK-2396:


 Summary: Spark EC2 scripts fail when trying to log in to EC2 
instances
 Key: SPARK-2396
 URL: https://issues.apache.org/jira/browse/SPARK-2396
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.0.0
 Environment: Windows 8, Cygwin and command prompt, Python 2.7
Reporter: Stephen M. Hopper


I cannot seem to successfully start up a Spark EC2 cluster using the spark-ec2 
script.

I'm using variations on the following command:
./spark-ec2 --instance-type=m1.small --region=us-west-1 --spot-price=0.05 
--spark-version=1.0.0 -k my-key-name -i my-key-name.pem -s 1 launch 
spark-test-cluster

The script always allocates the EC2 instances without much trouble, but can 
never seem to complete the SSH step to install Spark on the cluster.  It always 
complains about my SSH key.  If I try to log in with my ssh key doing something 
like this:

ssh -i my-key-name.pem root@

it fails.  However, if I log in to the AWS console, click on my instance and 
select "connect", it displays the instructions for SSHing into my instance 
(which are no different from the ssh command from above).  So, if I rerun the 
SSH command from above, I'm able to log in.

Next, if I try to rerun the spark-ec2 command from above (replacing "launch" 
with "start"), the script logs in and starts installing Spark.  However, it 
eventually errors out with the following output:

Cloning into 'spark-ec2'...
remote: Counting objects: 1465, done.
remote: Compressing objects: 100% (697/697), done.
remote: Total 1465 (delta 485), reused 1465 (delta 485)
Receiving objects: 100% (1465/1465), 228.51 KiB | 287 KiB/s, done.
Resolving deltas: 100% (485/485), done.
Connection to ec2-.us-west-1.compute.amazonaws.com closed.
Searching for existing cluster spark-test-cluster...
Found 1 master(s), 1 slaves
Starting slaves...
Starting master...
Waiting for instances to start up...
Waiting 120 more seconds...
Deploying files to master...
Traceback (most recent call last):
  File "./spark_ec2.py", line 823, in 
main()
  File "./spark_ec2.py", line 815, in main
real_main()
  File "./spark_ec2.py", line 806, in real_main
setup_cluster(conn, master_nodes, slave_nodes, opts, False)
  File "./spark_ec2.py", line 450, in setup_cluster
deploy_files(conn, "deploy.generic", opts, master_nodes, slave_nodes, 
modules)
  File "./spark_ec2.py", line 593, in deploy_files
subprocess.check_call(command)
  File "E:\windows_programs\Python27\lib\subprocess.py", line 535, in check_call
retcode = call(*popenargs, **kwargs)
  File "E:\windows_programs\Python27\lib\subprocess.py", line 522, in call
return Popen(*popenargs, **kwargs).wait()
  File "E:\windows_programs\Python27\lib\subprocess.py", line 710, in __init__
errread, errwrite)
  File "E:\windows_programs\Python27\lib\subprocess.py", line 958, in 
_execute_child
startupinfo)
WindowsError: [Error 2] The system cannot find the file specified


So, in short, am I missing something or is this a bug?  Any help would be 
appreciated.

Other notes:
-I've tried both us-west-1 and us-east-1 regions.
-I've tried several different instance types.
-I've tried playing with the permissions on the ssh key (600, 400, etc.), but 
to no avail



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS

2014-07-07 Thread Kousuke Saruta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054298#comment-14054298
 ] 

Kousuke Saruta commented on SPARK-2390:
---

Thank you for your comment [~mridulm80].

I noticed FileSystem is to be closed by shutdown hook, and the shutdown hook 
deletes files in the staging directory and then close FileSystem.
So, in this case, I also think there is no problem to remove fs.close.


> Files in staging directory cannot be deleted and wastes the space of HDFS
> -
>
> Key: SPARK-2390
> URL: https://issues.apache.org/jira/browse/SPARK-2390
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Kousuke Saruta
>
> When running jobs with YARN Cluster mode and using HistoryServer, the files 
> in the Staging Directory cannot be deleted.
> HistoryServer uses directory where event log is written, and the directory is 
> represented as a instance of o.a.h.f.FileSystem created by using 
> FileSystem.get.
> {code:title=FileLogger.scala}
> private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir))
> {code}
> {code:title=utils.getHadoopFileSystem}
> def getHadoopFileSystem(path: URI): FileSystem = {
>   FileSystem.get(path, SparkHadoopUtil.get.newConfiguration())
> }
> {code}
> On the other hand, ApplicationMaster has a instance named fs, which also 
> created by using FileSystem.get.
> {code:title=ApplicationMaster}
> private val fs = FileSystem.get(yarnConf)
> {code}
> FileSystem.get returns cached same instance when URI passed to the method 
> represents same file system and the method is called by same user.
> Because of the behavior, when the directory for event log is on HDFS, fs of 
> ApplicationMaster and fileSystem of FileLogger is same instance.
> When shutting down ApplicationMaster, fileSystem.close is called in 
> FileLogger#stop, which is invoked by SparkContext#stop indirectly.
> {code:title=FileLogger.stop}
> def stop() {
>   hadoopDataStream.foreach(_.close())
>   writer.foreach(_.close())
>   fileSystem.close()
> }
> {code}
> And  ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In 
> this method, fs.delete(stagingDirPath) is invoked. 
> Because fs.delete in ApplicationMaster is called after fileSystem.close in 
> FileLogger, fs.delete fails and results not deleting files in the staging 
> directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2395) Optimize common like expressions

2014-07-07 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-2395:
---

 Summary: Optimize common like expressions
 Key: SPARK-2395
 URL: https://issues.apache.org/jira/browse/SPARK-2395
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust


LIKE queries that are really just a contains, startsWith or endsWith would be 
much faster if we avoided regular expressions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2360) CSV import to SchemaRDDs

2014-07-07 Thread Hossein Falaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054262#comment-14054262
 ] 

Hossein Falaki edited comment on SPARK-2360 at 7/7/14 11:04 PM:


As a point for comparison the interface in some other popular packages are:
_R_:
{code}
read.csv(filePath, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = 
TRUE, comment.char = "", ...)
{code}

Where:
header: a logical value indicating whether the file contains the names of the 
variables as its first line.
sep: the field separator character. 
quote: the set of quoting characters. To disable quoting altogether, use ‘quote 
= ""’
dec: the character used in the file for decimal points.
fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are 
implicitly added.

_pandas_:
{code}
pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, 
compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, 
skipinitialspace=False, lineterminator=None, header='infer', index_col=None, 
names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, 
na_values=None, na_fvalues=None, true_values=None, false_values=None, 
delimiter=None, converters=None, dtype=None, usecols=None, engine=None, 
delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, 
use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, 
error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, 
decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, 
date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, 
verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, 
tupleize_cols=False, infer_datetime_format=False)
{code}
The description of fields can be found here: 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html


was (Author: falaki):
As a point for comparison the interface in some other popular packages are:
_R_:
{code}
read.csv(filePath, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = 
TRUE, comment.char = "", ...)
{code}

Where:
header: a logical value indicating whether the file contains the names of the 
variables as its first line.
sep: the field separator character. 
quote: the set of quoting characters. To disable quoting altogether, use ‘quote 
= ""’
dec: the character used in the file for decimal points.
fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are 
implicitly added.

__pandas__:
```
pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, 
compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, 
skipinitialspace=False, lineterminator=None, header='infer', index_col=None, 
names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, 
na_values=None, na_fvalues=None, true_values=None, false_values=None, 
delimiter=None, converters=None, dtype=None, usecols=None, engine=None, 
delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, 
use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, 
error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, 
decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, 
date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, 
verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, 
tupleize_cols=False, infer_datetime_format=False)
```
The description of fields can be found here: 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

> CSV import to SchemaRDDs
> 
>
> Key: SPARK-2360
> URL: https://issues.apache.org/jira/browse/SPARK-2360
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Minor
>
> I think the first step it to design the interface that we want to present to 
> users.  Mostly this is defining options when importing.  Off the top of my 
> head:
> - What is the separator?
> - Provide column names or infer them from the first row.
> - how to handle multiple files with possibly different schemas
> - do we have a method to let users specify the datatypes of the columns or 
> are they just strings?
> - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2360) CSV import to SchemaRDDs

2014-07-07 Thread Hossein Falaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054262#comment-14054262
 ] 

Hossein Falaki edited comment on SPARK-2360 at 7/7/14 11:03 PM:


As a point for comparison the interface in some other popular packages are:
_R_:
{code}
read.csv(filePath, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = 
TRUE, comment.char = "", ...)
{code}

Where:
header: a logical value indicating whether the file contains the names of the 
variables as its first line.
sep: the field separator character. 
quote: the set of quoting characters. To disable quoting altogether, use ‘quote 
= ""’
dec: the character used in the file for decimal points.
fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are 
implicitly added.

__pandas__:
```
pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, 
compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, 
skipinitialspace=False, lineterminator=None, header='infer', index_col=None, 
names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, 
na_values=None, na_fvalues=None, true_values=None, false_values=None, 
delimiter=None, converters=None, dtype=None, usecols=None, engine=None, 
delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, 
use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, 
error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, 
decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, 
date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, 
verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, 
tupleize_cols=False, infer_datetime_format=False)
```
The description of fields can be found here: 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html


was (Author: falaki):
As a point for comparison the interface in some other popular packages are:
__R__:
```
read.csv(filePath, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = 
TRUE, comment.char = "", ...)
```
Where:
header: a logical value indicating whether the file contains the names of the 
variables as its first line.
sep: the field separator character. 
quote: the set of quoting characters. To disable quoting altogether, use ‘quote 
= ""’
dec: the character used in the file for decimal points.
fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are 
implicitly added.

__pandas__:
```
pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, 
compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, 
skipinitialspace=False, lineterminator=None, header='infer', index_col=None, 
names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, 
na_values=None, na_fvalues=None, true_values=None, false_values=None, 
delimiter=None, converters=None, dtype=None, usecols=None, engine=None, 
delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, 
use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, 
error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, 
decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, 
date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, 
verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, 
tupleize_cols=False, infer_datetime_format=False)
```
The description of fields can be found here: 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

> CSV import to SchemaRDDs
> 
>
> Key: SPARK-2360
> URL: https://issues.apache.org/jira/browse/SPARK-2360
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Minor
>
> I think the first step it to design the interface that we want to present to 
> users.  Mostly this is defining options when importing.  Off the top of my 
> head:
> - What is the separator?
> - Provide column names or infer them from the first row.
> - how to handle multiple files with possibly different schemas
> - do we have a method to let users specify the datatypes of the columns or 
> are they just strings?
> - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2360) CSV import to SchemaRDDs

2014-07-07 Thread Hossein Falaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054262#comment-14054262
 ] 

Hossein Falaki commented on SPARK-2360:
---

As a point for comparison the interface in some other popular packages are:
__R__:
```
read.csv(filePath, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = 
TRUE, comment.char = "", ...)
```
Where:
header: a logical value indicating whether the file contains the names of the 
variables as its first line.
sep: the field separator character. 
quote: the set of quoting characters. To disable quoting altogether, use ‘quote 
= ""’
dec: the character used in the file for decimal points.
fill: If ‘TRUE’ then in case the rows have unequal length, blank fields are 
implicitly added.

__pandas__:
```
pandas.io.parsers.read_csv(filepath_or_buffer, sep=', ', dialect=None, 
compression=None, doublequote=True, escapechar=None, quotechar='"', quoting=0, 
skipinitialspace=False, lineterminator=None, header='infer', index_col=None, 
names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, 
na_values=None, na_fvalues=None, true_values=None, false_values=None, 
delimiter=None, converters=None, dtype=None, usecols=None, engine=None, 
delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, 
use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, 
error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, 
decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, 
date_parser=None, memory_map=False, nrows=None, iterator=False, chunksize=None, 
verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, 
tupleize_cols=False, infer_datetime_format=False)
```
The description of fields can be found here: 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

> CSV import to SchemaRDDs
> 
>
> Key: SPARK-2360
> URL: https://issues.apache.org/jira/browse/SPARK-2360
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Minor
>
> I think the first step it to design the interface that we want to present to 
> users.  Mostly this is defining options when importing.  Off the top of my 
> head:
> - What is the separator?
> - Provide column names or infer them from the first row.
> - how to handle multiple files with possibly different schemas
> - do we have a method to let users specify the datatypes of the columns or 
> are they just strings?
> - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2360) CSV import to SchemaRDDs

2014-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2360:


Description: 
I think the first step it to design the interface that we want to present to 
users.  Mostly this is defining options when importing.  Off the top of my head:
- What is the separator?
- Provide column names or infer them from the first row.
- how to handle multiple files with possibly different schemas
- do we have a method to let users specify the datatypes of the columns or are 
they just strings?
- what types of quoting / escaping do we want to support?

> CSV import to SchemaRDDs
> 
>
> Key: SPARK-2360
> URL: https://issues.apache.org/jira/browse/SPARK-2360
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Minor
>
> I think the first step it to design the interface that we want to present to 
> users.  Mostly this is defining options when importing.  Off the top of my 
> head:
> - What is the separator?
> - Provide column names or infer them from the first row.
> - how to handle multiple files with possibly different schemas
> - do we have a method to let users specify the datatypes of the columns or 
> are they just strings?
> - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Closed] (SPARK-2378) Implement functionality to read csv files

2014-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust closed SPARK-2378.
---

Resolution: Duplicate

> Implement functionality to read csv files
> -
>
> Key: SPARK-2378
> URL: https://issues.apache.org/jira/browse/SPARK-2378
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 1.0.0
>Reporter: Hossein Falaki
>Priority: Minor
> Fix For: 1.0.0
>
>
> Similar to jsonFile(), csvFile() could be used to read data into a SchemaRDD 
> (or a normal RDD).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-07-07 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1977.
--

   Resolution: Fixed
Fix Version/s: 1.1.0
   1.0.1

Issue resolved by pull request 1319
[https://github.com/apache/spark/pull/1319]

> mutable.BitSet in ALS not serializable with KryoSerializer
> --
>
> Key: SPARK-1977
> URL: https://issues.apache.org/jira/browse/SPARK-1977
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Neville Li
>Priority: Minor
> Fix For: 1.0.1, 1.1.0
>
>
> OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member.
> KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't 
> register mutable.BitSet.
> Right now we have to register mutable.BitSet manually. A proper fix would be 
> using immutable.BitSet in ALS or register mutable.BitSet in upstream chill.
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 
> 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: 
> com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
> scala.collection.mutable.HashSet
> Serialization trace:
> shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
> 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> org.apache.spark.scheduler.Task.run(Task.scala:51)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> java.lang.Thread.run(Thread.java:662)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
>   at 
> org.apache.spark.scheduler.DAGSchedu

[jira] [Created] (SPARK-2394) Make it easier to read LZO-compressed files from EC2 clusters

2014-07-07 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-2394:
---

 Summary: Make it easier to read LZO-compressed files from EC2 
clusters
 Key: SPARK-2394
 URL: https://issues.apache.org/jira/browse/SPARK-2394
 Project: Spark
  Issue Type: Improvement
  Components: EC2, Input/Output
Affects Versions: 1.0.0
Reporter: Nicholas Chammas
Priority: Minor


Amazon hosts [a large Google n-grams data set on 
S3|https://aws.amazon.com/datasets/8172056142375670]. This data set is perfect, 
among other things, for putting together interesting and easily reproducible 
public demos of Spark's capabilities.

The problem is that the data set is compressed using LZO, and it is currently 
more painful than it should be to get your average {{spark-ec2}} cluster to 
read input compressed in this way.

This is what one has to go through to get a Spark cluster created with 
{{spark-ec2}} to read LZO-compressed files:
# Install the latest LZO release, perhaps via {{yum}}.
# Download [{{hadoop-lzo}}|https://github.com/twitter/hadoop-lzo] and build it. 
To build {{hadoop-lzo}} you need Maven. 
# Install Maven. For some reason, [you cannot install Maven with 
{{yum}}|http://stackoverflow.com/questions/7532928/how-do-i-install-maven-with-yum],
 so install it manually.
# Update your {{core-site.xml}} and {{spark-env.sh}} with [the appropriate 
configs|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E].
# Make [the appropriate 
calls|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E]
 to {{sc.newAPIHadoopFile}}.

This seems like a bit too much work for what we're trying to accomplish.

If we expect this to be a common pattern -- reading LZO-compressed files from a 
{{spark-ec2}} cluster -- it would be great if we could somehow make this less 
painful.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2384) Add tooltips for shuffle write and scheduler delay in UI

2014-07-07 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054207#comment-14054207
 ] 

Kay Ousterhout commented on SPARK-2384:
---

https://github.com/apache/spark/pull/1314

> Add tooltips for shuffle write and scheduler delay in UI
> 
>
> Key: SPARK-2384
> URL: https://issues.apache.org/jira/browse/SPARK-2384
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> There are a few common points of confusion in the UI that could be clarified 
> with tooltips.  We should add tooltips to explain the scheduler delay and the 
> shuffle data (to explain why shuffle read is typically < shuffle write, as 
> many many people have expressed confusion about).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery

2014-07-07 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-2339:


Component/s: SQL

> SQL parser in sql-core is case sensitive, but a table alias is converted to 
> lower case when we create Subquery
> --
>
> Key: SPARK-2339
> URL: https://issues.apache.org/jira/browse/SPARK-2339
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Reported by 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html
> After we get the table from the catalog, because the table has an alias, we 
> will temporarily insert a Subquery. Then, we convert the table alias to lower 
> case no matter if the parser is case sensitive or not.
> To see the issue ...
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> case class Person(name: String, age: Int)
> val people = 
> sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p 
> => Person(p(0), p(1).trim.toInt))
> people.registerAsTable("people")
> sqlContext.sql("select PEOPLE.name from people PEOPLE")
> {code}
> The plan is ...
> {code}
> == Query Plan ==
> Project ['PEOPLE.name]
>  ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at 
> basicOperators.scala:176
> {code}
> You can find that "PEOPLE.name" is not resolved.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2205) Unnecessary exchange operators in a join on multiple tables with the same join key.

2014-07-07 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-2205:


Target Version/s: 1.1.0  (was: 1.0.1, 1.1.0)

> Unnecessary exchange operators in a join on multiple tables with the same 
> join key.
> ---
>
> Key: SPARK-2205
> URL: https://issues.apache.org/jira/browse/SPARK-2205
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>
> {code}
> hql("select * from src x join src y on (x.key=y.key) join src z on 
> (y.key=z.key)")
> SchemaRDD[1] at RDD at SchemaRDD.scala:100
> == Query Plan ==
> Project [key#4:0,value#5:1,key#6:2,value#7:3,key#8:4,value#9:5]
>  HashJoin [key#6], [key#8], BuildRight
>   Exchange (HashPartitioning [key#6], 200)
>HashJoin [key#4], [key#6], BuildRight
> Exchange (HashPartitioning [key#4], 200)
>  HiveTableScan [key#4,value#5], (MetastoreRelation default, src, 
> Some(x)), None
> Exchange (HashPartitioning [key#6], 200)
>  HiveTableScan [key#6,value#7], (MetastoreRelation default, src, 
> Some(y)), None
>   Exchange (HashPartitioning [key#8], 200)
>HiveTableScan [key#8,value#9], (MetastoreRelation default, src, Some(z)), 
> None
> {code}
> However, this is fine...
> {code}
> hql("select * from src x join src y on (x.key=y.key) join src z on 
> (x.key=z.key)")
> res5: org.apache.spark.sql.SchemaRDD = 
> SchemaRDD[5] at RDD at SchemaRDD.scala:100
> == Query Plan ==
> Project [key#26:0,value#27:1,key#28:2,value#29:3,key#30:4,value#31:5]
>  HashJoin [key#26], [key#30], BuildRight
>   HashJoin [key#26], [key#28], BuildRight
>Exchange (HashPartitioning [key#26], 200)
> HiveTableScan [key#26,value#27], (MetastoreRelation default, src, 
> Some(x)), None
>Exchange (HashPartitioning [key#28], 200)
> HiveTableScan [key#28,value#29], (MetastoreRelation default, src, 
> Some(y)), None
>   Exchange (HashPartitioning [key#30], 200)
>HiveTableScan [key#30,value#31], (MetastoreRelation default, src, 
> Some(z)), None
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery

2014-07-07 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-2339:


Fix Version/s: (was: 1.1.0)

> SQL parser in sql-core is case sensitive, but a table alias is converted to 
> lower case when we create Subquery
> --
>
> Key: SPARK-2339
> URL: https://issues.apache.org/jira/browse/SPARK-2339
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Yin Huai
>
> Reported by 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html
> After we get the table from the catalog, because the table has an alias, we 
> will temporarily insert a Subquery. Then, we convert the table alias to lower 
> case no matter if the parser is case sensitive or not.
> To see the issue ...
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> case class Person(name: String, age: Int)
> val people = 
> sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p 
> => Person(p(0), p(1).trim.toInt))
> people.registerAsTable("people")
> sqlContext.sql("select PEOPLE.name from people PEOPLE")
> {code}
> The plan is ...
> {code}
> == Query Plan ==
> Project ['PEOPLE.name]
>  ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at 
> basicOperators.scala:176
> {code}
> You can find that "PEOPLE.name" is not resolved.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery

2014-07-07 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-2339:


Target Version/s: 1.1.0

> SQL parser in sql-core is case sensitive, but a table alias is converted to 
> lower case when we create Subquery
> --
>
> Key: SPARK-2339
> URL: https://issues.apache.org/jira/browse/SPARK-2339
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Yin Huai
>
> Reported by 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html
> After we get the table from the catalog, because the table has an alias, we 
> will temporarily insert a Subquery. Then, we convert the table alias to lower 
> case no matter if the parser is case sensitive or not.
> To see the issue ...
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
> case class Person(name: String, age: Int)
> val people = 
> sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p 
> => Person(p(0), p(1).trim.toInt))
> people.registerAsTable("people")
> sqlContext.sql("select PEOPLE.name from people PEOPLE")
> {code}
> The plan is ...
> {code}
> == Query Plan ==
> Project ['PEOPLE.name]
>  ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at 
> basicOperators.scala:176
> {code}
> You can find that "PEOPLE.name" is not resolved.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2393) Simple cost estimation and auto selection of broadcast join

2014-07-07 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-2393:
---

 Summary: Simple cost estimation and auto selection of broadcast 
join
 Key: SPARK-2393
 URL: https://issues.apache.org/jira/browse/SPARK-2393
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2376) Selecting list values inside nested JSON objects raises java.lang.IllegalArgumentException

2014-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2376:


Target Version/s: 1.1.0

> Selecting list values inside nested JSON objects raises 
> java.lang.IllegalArgumentException
> --
>
> Key: SPARK-2376
> URL: https://issues.apache.org/jira/browse/SPARK-2376
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.1
>Reporter: Nicholas Chammas
>
> Repro script for PySpark, deployed via {{spark-ec2}} at git revision 
> {{9d5ecf8205b924dc8a3c13fed68beb78cc5c7553}}:
> {code}
> from pyspark.sql import SQLContext
> sqlContext = SQLContext(sc)
> raw = sc.parallelize([
> """
> {
> "name": "Nick",
> "history": {
> "countries": []
> }
> }
> """
> ])
> profiles = sqlContext.jsonRDD(raw)
> profiles.registerAsTable("profiles")
> profiles.printSchema()
> sqlContext.sql("SELECT name FROM profiles;").collect() # works fine
> sqlContext.sql("SELECT history FROM profiles;").collect()  # raises exception
> {code}
> Attempting to select the top-level struct that has a nested list value yields 
> the following error:
> {code}
> 14/07/06 00:10:26 INFO scheduler.TaskSetManager: Loss was due to 
> net.razorvine.pickle.PickleException: couldn't introspect javabean: 
> java.lang.IllegalArgumentException: wrong number of arguments [duplicate 3]
> 14/07/06 00:10:26 ERROR scheduler.TaskSetManager: Task 26.0:15 failed 4 
> times; aborting job
> 14/07/06 00:10:26 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 26.0, 
> whose tasks have all completed, from pool 
> 14/07/06 00:10:26 INFO scheduler.TaskSchedulerImpl: Cancelling stage 26
> 14/07/06 00:10:26 INFO scheduler.DAGScheduler: Failed to run collect at 
> :1
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/root/spark/python/pyspark/rdd.py", line 649, in collect
> bytesInJava = self._jrdd.collect().iterator()
>   File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", line 
> 537, in __call__
>   File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 
> 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o286.collect.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> 26.0:15 failed 4 times, most recent failure: Exception failure in TID 394 on 
> host ip-10-183-59-125.ec2.internal: net.razorvine.pickle.PickleException: 
> couldn't introspect javabean: java.lang.IllegalArgumentException: wrong 
> number of arguments
> net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603)
> net.razorvine.pickle.Pickler.dispatch(Pickler.java:299)
> net.razorvine.pickle.Pickler.save(Pickler.java:125)
> net.razorvine.pickle.Pickler.put_map(Pickler.java:322)
> net.razorvine.pickle.Pickler.dispatch(Pickler.java:286)
> net.razorvine.pickle.Pickler.save(Pickler.java:125)
> net.razorvine.pickle.Pickler.put_map(Pickler.java:322)
> net.razorvine.pickle.Pickler.dispatch(Pickler.java:286)
> net.razorvine.pickle.Pickler.save(Pickler.java:125)
> net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392)
> net.razorvine.pickle.Pickler.dispatch(Pickler.java:195)
> net.razorvine.pickle.Pickler.save(Pickler.java:125)
> net.razorvine.pickle.Pickler.dump(Pickler.java:95)
> net.razorvine.pickle.Pickler.dumps(Pickler.java:80)
> 
> org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$3.apply(SchemaRDD.scala:385)
> 
> org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$3.apply(SchemaRDD.scala:385)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:317)
> 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:203)
> 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
> 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:178)
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1213)
> 
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:177)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1041)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1025)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1023)
>   at 
> scala.collect

[jira] [Commented] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large

2014-07-07 Thread Masayoshi TSUZUKI (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054178#comment-14054178
 ] 

Masayoshi TSUZUKI commented on SPARK-2017:
--

Pagination seems to be better because with aggregated metrics,
1. we can't identify the skew of tasks between the executors.
2. the same problem will appear again when many tasks fail in a certain stage.

In addition, when some errors or problems occur under the production 
environment, we would like to see the status of tasks near the time even if 
those tasks mostly succeeded. Although every status of tasks is written in the 
log file, web ui is very useful in operation phase.

> web ui stage page becomes unresponsive when the number of tasks is large
> 
>
> Key: SPARK-2017
> URL: https://issues.apache.org/jira/browse/SPARK-2017
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Reynold Xin
>  Labels: starter
>
> {code}
> sc.parallelize(1 to 100, 100).count()
> {code}
> The above code creates one million tasks to be executed. The stage detail web 
> ui page takes forever to load (if it ever completes).
> There are again a few different alternatives:
> 0. Limit the number of tasks we show.
> 1. Pagination
> 2. By default only show the aggregate metrics and failed tasks, and hide the 
> successful ones.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2375) JSON schema inference may not resolve type conflicts correctly for a field inside an array of structs

2014-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2375:


Target Version/s: 1.1.0
   Fix Version/s: (was: 1.1.0)

> JSON schema inference may not resolve type conflicts correctly for a field 
> inside an array of structs
> -
>
> Key: SPARK-2375
> URL: https://issues.apache.org/jira/browse/SPARK-2375
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.1
>Reporter: Yin Huai
>
> For example, for
> {code}
> {"array": [{"field":214748364700}, {"field":1}]}
> {code}
> the type of field is resolved as IntType. While, for
> {code}
> {"array": [{"field":1}, {"field":214748364700}]}
> {code}
> the type of field is resolved as LongType.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2205) Unnecessary exchange operators in a join on multiple tables with the same join key.

2014-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2205:


Priority: Minor  (was: Major)

> Unnecessary exchange operators in a join on multiple tables with the same 
> join key.
> ---
>
> Key: SPARK-2205
> URL: https://issues.apache.org/jira/browse/SPARK-2205
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>
> {code}
> hql("select * from src x join src y on (x.key=y.key) join src z on 
> (y.key=z.key)")
> SchemaRDD[1] at RDD at SchemaRDD.scala:100
> == Query Plan ==
> Project [key#4:0,value#5:1,key#6:2,value#7:3,key#8:4,value#9:5]
>  HashJoin [key#6], [key#8], BuildRight
>   Exchange (HashPartitioning [key#6], 200)
>HashJoin [key#4], [key#6], BuildRight
> Exchange (HashPartitioning [key#4], 200)
>  HiveTableScan [key#4,value#5], (MetastoreRelation default, src, 
> Some(x)), None
> Exchange (HashPartitioning [key#6], 200)
>  HiveTableScan [key#6,value#7], (MetastoreRelation default, src, 
> Some(y)), None
>   Exchange (HashPartitioning [key#8], 200)
>HiveTableScan [key#8,value#9], (MetastoreRelation default, src, Some(z)), 
> None
> {code}
> However, this is fine...
> {code}
> hql("select * from src x join src y on (x.key=y.key) join src z on 
> (x.key=z.key)")
> res5: org.apache.spark.sql.SchemaRDD = 
> SchemaRDD[5] at RDD at SchemaRDD.scala:100
> == Query Plan ==
> Project [key#26:0,value#27:1,key#28:2,value#29:3,key#30:4,value#31:5]
>  HashJoin [key#26], [key#30], BuildRight
>   HashJoin [key#26], [key#28], BuildRight
>Exchange (HashPartitioning [key#26], 200)
> HiveTableScan [key#26,value#27], (MetastoreRelation default, src, 
> Some(x)), None
>Exchange (HashPartitioning [key#28], 200)
> HiveTableScan [key#28,value#29], (MetastoreRelation default, src, 
> Some(y)), None
>   Exchange (HashPartitioning [key#30], 200)
>HiveTableScan [key#30,value#31], (MetastoreRelation default, src, 
> Some(z)), None
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2190) Specialized ColumnType for Timestamp

2014-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2190:


Priority: Critical  (was: Major)

> Specialized ColumnType for Timestamp
> 
>
> Key: SPARK-2190
> URL: https://issues.apache.org/jira/browse/SPARK-2190
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Critical
>
> I'm going to call this a bug since currently its like 300X slower than it 
> needs to  be.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2087) Support SQLConf per session

2014-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2087:


Priority: Minor  (was: Major)

> Support SQLConf per session
> ---
>
> Key: SPARK-2087
> URL: https://issues.apache.org/jira/browse/SPARK-2087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Zongheng Yang
>Priority: Minor
>
> For things like the SharkServer we should support configuration per thread 
> instead of globally for a context.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2097) UDF Support

2014-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2097:


Priority: Critical  (was: Major)

> UDF Support
> ---
>
> Key: SPARK-2097
> URL: https://issues.apache.org/jira/browse/SPARK-2097
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Critical
>
> Right now we only support UDFs that are written against the Hive API or are 
> called directly as expressions in the DSL.  It would be nice to have native 
> support for registering scala/python functions as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2010) Support for nested data in PySpark SQL

2014-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2010:


Priority: Critical  (was: Major)

> Support for nested data in PySpark SQL
> --
>
> Key: SPARK-2010
> URL: https://issues.apache.org/jira/browse/SPARK-2010
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Kan Zhang
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2054) Code Generation for Expression Evaluation

2014-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2054:


Priority: Critical  (was: Major)

> Code Generation for Expression Evaluation
> -
>
> Key: SPARK-2054
> URL: https://issues.apache.org/jira/browse/SPARK-2054
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2391) LIMIT queries ship a whole partition of data

2014-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2391:


Priority: Blocker  (was: Major)

> LIMIT queries ship a whole partition of data
> 
>
> Key: SPARK-2391
> URL: https://issues.apache.org/jira/browse/SPARK-2391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>
> Basically the problem here is that Spark's take() runs jobs using allowLocal 
> = true.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2179) Public API for DataTypes and Schema

2014-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2179:


Priority: Critical  (was: Major)

> Public API for DataTypes and Schema
> ---
>
> Key: SPARK-2179
> URL: https://issues.apache.org/jira/browse/SPARK-2179
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Yin Huai
>Priority: Critical
>
> We want something like the following:
>  * Expose DataType in the SQL package and lock down all the internal details 
> (TypeTags, etc)
>  * Programatic API for viewing the schema of an RDD as a StructType
>  * Method that creates a schema RDD given (RDD[A], StructType, A => Row)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2212) Hash Outer Joins

2014-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2212:


Priority: Minor  (was: Major)

> Hash Outer Joins
> 
>
> Key: SPARK-2212
> URL: https://issues.apache.org/jira/browse/SPARK-2212
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS

2014-07-07 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054162#comment-14054162
 ] 

Mridul Muralidharan commented on SPARK-2390:


Here, and a bunch of other places, spark currently closes the Filesystem 
instance : this is incorrect, and should not be done.
The fix would be to remove the fs.close; not force creation of new instances.

> Files in staging directory cannot be deleted and wastes the space of HDFS
> -
>
> Key: SPARK-2390
> URL: https://issues.apache.org/jira/browse/SPARK-2390
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Kousuke Saruta
>
> When running jobs with YARN Cluster mode and using HistoryServer, the files 
> in the Staging Directory cannot be deleted.
> HistoryServer uses directory where event log is written, and the directory is 
> represented as a instance of o.a.h.f.FileSystem created by using 
> FileSystem.get.
> {code:title=FileLogger.scala}
> private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir))
> {code}
> {code:title=utils.getHadoopFileSystem}
> def getHadoopFileSystem(path: URI): FileSystem = {
>   FileSystem.get(path, SparkHadoopUtil.get.newConfiguration())
> }
> {code}
> On the other hand, ApplicationMaster has a instance named fs, which also 
> created by using FileSystem.get.
> {code:title=ApplicationMaster}
> private val fs = FileSystem.get(yarnConf)
> {code}
> FileSystem.get returns cached same instance when URI passed to the method 
> represents same file system and the method is called by same user.
> Because of the behavior, when the directory for event log is on HDFS, fs of 
> ApplicationMaster and fileSystem of FileLogger is same instance.
> When shutting down ApplicationMaster, fileSystem.close is called in 
> FileLogger#stop, which is invoked by SparkContext#stop indirectly.
> {code:title=FileLogger.stop}
> def stop() {
>   hadoopDataStream.foreach(_.close())
>   writer.foreach(_.close())
>   fileSystem.close()
> }
> {code}
> And  ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In 
> this method, fs.delete(stagingDirPath) is invoked. 
> Because fs.delete in ApplicationMaster is called after fileSystem.close in 
> FileLogger, fs.delete fails and results not deleting files in the staging 
> directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2392) Executors should not start an HTTP server

2014-07-07 Thread Andrew Or (JIRA)

Andrew Or created SPARK-2392:


 Summary: Executors should not start an HTTP server
 Key: SPARK-2392
 URL: https://issues.apache.org/jira/browse/SPARK-2392
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or


In the long run we should separate out classes used by the driver vs executors  
in SparkEnv. For now, we should at least not start an unused HTTP on every 
executor.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2392) Executors should not start their own HTTP servers

2014-07-07 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2392:
-

Summary: Executors should not start their own HTTP servers  (was: Executors 
should not start an HTTP server)

> Executors should not start their own HTTP servers
> -
>
> Key: SPARK-2392
> URL: https://issues.apache.org/jira/browse/SPARK-2392
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>
> In the long run we should separate out classes used by the driver vs 
> executors  in SparkEnv. For now, we should at least not start an unused HTTP 
> on every executor.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-07-07 Thread Neville Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054043#comment-14054043
 ] 

Neville Li commented on SPARK-1977:
---

There you go:
https://github.com/apache/spark/pull/1319

> mutable.BitSet in ALS not serializable with KryoSerializer
> --
>
> Key: SPARK-1977
> URL: https://issues.apache.org/jira/browse/SPARK-1977
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Neville Li
>Priority: Minor
>
> OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member.
> KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't 
> register mutable.BitSet.
> Right now we have to register mutable.BitSet manually. A proper fix would be 
> using immutable.BitSet in ALS or register mutable.BitSet in upstream chill.
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 
> 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: 
> com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
> scala.collection.mutable.HashSet
> Serialization trace:
> shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
> 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> org.apache.spark.scheduler.Task.run(Task.scala:51)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> java.lang.Thread.run(Thread.java:662)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
>   at scala.Option.

[jira] [Commented] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS

2014-07-07 Thread Kousuke Saruta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054041#comment-14054041
 ] 

Kousuke Saruta commented on SPARK-2390:
---

I modified the way to create a instance of FileSystem in FileLogger as follows, 
the symptom didn't appear.

{code}
private val fileSystem = FileSystem.newInstance(new URI(logDir), 
SparkHadoopUtil.get.newConfiguration())  
{code}



> Files in staging directory cannot be deleted and wastes the space of HDFS
> -
>
> Key: SPARK-2390
> URL: https://issues.apache.org/jira/browse/SPARK-2390
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Kousuke Saruta
>
> When running jobs with YARN Cluster mode and using HistoryServer, the files 
> in the Staging Directory cannot be deleted.
> HistoryServer uses directory where event log is written, and the directory is 
> represented as a instance of o.a.h.f.FileSystem created by using 
> FileSystem.get.
> {code:title=FileLogger.scala}
> private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir))
> {code}
> {code:title=utils.getHadoopFileSystem}
> def getHadoopFileSystem(path: URI): FileSystem = {
>   FileSystem.get(path, SparkHadoopUtil.get.newConfiguration())
> }
> {code}
> On the other hand, ApplicationMaster has a instance named fs, which also 
> created by using FileSystem.get.
> {code:title=ApplicationMaster}
> private val fs = FileSystem.get(yarnConf)
> {code}
> FileSystem.get returns cached same instance when URI passed to the method 
> represents same file system and the method is called by same user.
> Because of the behavior, when the directory for event log is on HDFS, fs of 
> ApplicationMaster and fileSystem of FileLogger is same instance.
> When shutting down ApplicationMaster, fileSystem.close is called in 
> FileLogger#stop, which is invoked by SparkContext#stop indirectly.
> {code:title=FileLogger.stop}
> def stop() {
>   hadoopDataStream.foreach(_.close())
>   writer.foreach(_.close())
>   fileSystem.close()
> }
> {code}
> And  ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In 
> this method, fs.delete(stagingDirPath) is invoked. 
> Because fs.delete in ApplicationMaster is called after fileSystem.close in 
> FileLogger, fs.delete fails and results not deleting files in the staging 
> directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS

2014-07-07 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-2390:
--

Affects Version/s: 1.0.1

> Files in staging directory cannot be deleted and wastes the space of HDFS
> -
>
> Key: SPARK-2390
> URL: https://issues.apache.org/jira/browse/SPARK-2390
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Kousuke Saruta
>
> When running jobs with YARN Cluster mode and using HistoryServer, the files 
> in the Staging Directory cannot be deleted.
> HistoryServer uses directory where event log is written, and the directory is 
> represented as a instance of o.a.h.f.FileSystem created by using 
> FileSystem.get.
> {code:title=FileLogger.scala}
> private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir))
> {code}
> {code:title=utils.getHadoopFileSystem}
> def getHadoopFileSystem(path: URI): FileSystem = {
>   FileSystem.get(path, SparkHadoopUtil.get.newConfiguration())
> }
> {code}
> On the other hand, ApplicationMaster has a instance named fs, which also 
> created by using FileSystem.get.
> {code:title=ApplicationMaster}
> private val fs = FileSystem.get(yarnConf)
> {code}
> FileSystem.get returns cached same instance when URI passed to the method 
> represents same file system and the method is called by same user.
> Because of the behavior, when the directory for event log is on HDFS, fs of 
> ApplicationMaster and fileSystem of FileLogger is same instance.
> When shutting down ApplicationMaster, fileSystem.close is called in 
> FileLogger#stop, which is invoked by SparkContext#stop indirectly.
> {code:title=FileLogger.stop}
> def stop() {
>   hadoopDataStream.foreach(_.close())
>   writer.foreach(_.close())
>   fileSystem.close()
> }
> {code}
> And  ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In 
> this method, fs.delete(stagingDirPath) is invoked. 
> Because fs.delete in ApplicationMaster is called after fileSystem.close in 
> FileLogger, fs.delete fails and results not deleting files in the staging 
> directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2311) Added additional GLMs (Poisson and Gamma) into MLlib

2014-07-07 Thread Xiaokai Wei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaokai Wei updated SPARK-2311:
---

Priority: Major  (was: Minor)

> Added additional GLMs (Poisson and Gamma) into MLlib
> 
>
> Key: SPARK-2311
> URL: https://issues.apache.org/jira/browse/SPARK-2311
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiaokai Wei
>
> Though GeneralizedLinearModel in MLlib 1.0.0 has some important GLMs such as 
> Logistic Regression, Linear Regression, some other important GLMs like 
> Poisson Regression are still missing. 
> Poisson Regression and Gamma Regression are two widely used models in 
> industry with many applications. This patch added Poisson and Gamma 
> Regression as additional GeneralizedLinearModels.
> https://github.com/apache/spark/pull/1237



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-07-07 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054003#comment-14054003
 ] 

Xiangrui Meng commented on SPARK-1977:
--

Do you mind creating a PR registering mutable.BitSet in MovieLensALS.scala and 
close PR #925? Thanks!

> mutable.BitSet in ALS not serializable with KryoSerializer
> --
>
> Key: SPARK-1977
> URL: https://issues.apache.org/jira/browse/SPARK-1977
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Neville Li
>Priority: Minor
>
> OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member.
> KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't 
> register mutable.BitSet.
> Right now we have to register mutable.BitSet manually. A proper fix would be 
> using immutable.BitSet in ALS or register mutable.BitSet in upstream chill.
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 
> 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: 
> com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
> scala.collection.mutable.HashSet
> Serialization trace:
> shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
> 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> org.apache.spark.scheduler.Task.run(Task.scala:51)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> java.lang.Thread.run(Thread.java:662)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.a

[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-07-07 Thread Neville Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053999#comment-14053999
 ] 

Neville Li commented on SPARK-1977:
---

[~mengxr] sounds good to me.

> mutable.BitSet in ALS not serializable with KryoSerializer
> --
>
> Key: SPARK-1977
> URL: https://issues.apache.org/jira/browse/SPARK-1977
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Neville Li
>Priority: Minor
>
> OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member.
> KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't 
> register mutable.BitSet.
> Right now we have to register mutable.BitSet manually. A proper fix would be 
> using immutable.BitSet in ALS or register mutable.BitSet in upstream chill.
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 
> 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: 
> com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
> scala.collection.mutable.HashSet
> Serialization trace:
> shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
> 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> org.apache.spark.scheduler.Task.run(Task.scala:51)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> java.lang.Thread.run(Thread.java:662)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
>   at scala.Option.foreach(Option.scala:236)
>

[jira] [Updated] (SPARK-2386) RowWriteSupport should use the exact types to cast.

2014-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2386:


Target Version/s: 1.1.0

> RowWriteSupport should use the exact types to cast.
> ---
>
> Key: SPARK-2386
> URL: https://issues.apache.org/jira/browse/SPARK-2386
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>
> When execute {{saveAsParquetFile}} with non-primitive type, 
> {{RowWriteSupport}} uses wrong type {{Int}} for {{ByteType}} and 
> {{ShortType}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1977) mutable.BitSet in ALS not serializable with KryoSerializer

2014-07-07 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053994#comment-14053994
 ] 

Xiangrui Meng commented on SPARK-1977:
--

I think now I understand when it happens. We use storage level MEMORY_AND_DISK 
for user/product in/out links, which contains BitSet objects. If the dataset is 
large, these RDDs will be pushed from in memory storage to on disk storage, 
where the latter requires serialization. So the easiest way to re-produce this 
error is changing the storage level of inLinks/outLinks to DISK_ONLY and run 
with kryo.

[~neville] Instead of mapping mutable.BitSet to immutable.BitSet, which 
introduces overhead, we can register mutable.BitSet in our MovieLensALS example 
code and wait for the next Chill release. Does it sound good to you?

> mutable.BitSet in ALS not serializable with KryoSerializer
> --
>
> Key: SPARK-1977
> URL: https://issues.apache.org/jira/browse/SPARK-1977
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Neville Li
>Priority: Minor
>
> OutLinkBlock in ALS.scala has an Array[mutable.BitSet] member.
> KryoSerializer uses AllScalaRegistrar from Twitter chill but it doesn't 
> register mutable.BitSet.
> Right now we have to register mutable.BitSet manually. A proper fix would be 
> using immutable.BitSet in ALS or register mutable.BitSet in upstream chill.
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1724.0:9 failed 4 times, most recent failure: Exception failure in TID 
> 68548 on host lon4-hadoopslave-b232.lon4.spotify.net: 
> com.esotericsoftware.kryo.KryoException: java.lang.ArrayStoreException: 
> scala.collection.mutable.HashSet
> Serialization trace:
> shouldSend (org.apache.spark.mllib.recommendation.OutLinkBlock)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
> 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:43)
> com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:34)
> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
> 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:115)
> 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:155)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> org.apache.spark.scheduler.Task.run(Task.scala:51)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> java.lang.Thread.run(Thread.java:662)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
>   at 
> org.apache.spark.scheduler.

[jira] [Updated] (SPARK-2391) LIMIT queries ship a whole partition of data

2014-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2391:


Target Version/s: 1.1.0
   Fix Version/s: (was: 1.1.0)

> LIMIT queries ship a whole partition of data
> 
>
> Key: SPARK-2391
> URL: https://issues.apache.org/jira/browse/SPARK-2391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>
> Basically the problem here is that Spark's take() runs jobs using allowLocal 
> = true.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2391) LIMIT queries ship a whole partition of data

2014-07-07 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2391:


Fix Version/s: 1.1.0

> LIMIT queries ship a whole partition of data
> 
>
> Key: SPARK-2391
> URL: https://issues.apache.org/jira/browse/SPARK-2391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>
> Basically the problem here is that Spark's take() runs jobs using allowLocal 
> = true.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2391) LIMIT queries ship a whole partition of data

2014-07-07 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-2391:
---

 Summary: LIMIT queries ship a whole partition of data
 Key: SPARK-2391
 URL: https://issues.apache.org/jira/browse/SPARK-2391
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Michael Armbrust
Assignee: Michael Armbrust


Basically the problem here is that Spark's take() runs jobs using allowLocal = 
true.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2390) Files in staging directory cannot be deleted and wastes the space of HDFS

2014-07-07 Thread Kousuke Saruta (JIRA)

Kousuke Saruta created SPARK-2390:
-

 Summary: Files in staging directory cannot be deleted and wastes 
the space of HDFS
 Key: SPARK-2390
 URL: https://issues.apache.org/jira/browse/SPARK-2390
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Kousuke Saruta


When running jobs with YARN Cluster mode and using HistoryServer, the files in 
the Staging Directory cannot be deleted.

HistoryServer uses directory where event log is written, and the directory is 
represented as a instance of o.a.h.f.FileSystem created by using FileSystem.get.

{code:title=FileLogger.scala}
private val fileSystem = Utils.getHadoopFileSystem(new URI(logDir))
{code}
{code:title=utils.getHadoopFileSystem}
def getHadoopFileSystem(path: URI): FileSystem = {
  FileSystem.get(path, SparkHadoopUtil.get.newConfiguration())
}
{code}

On the other hand, ApplicationMaster has a instance named fs, which also 
created by using FileSystem.get.

{code:title=ApplicationMaster}
private val fs = FileSystem.get(yarnConf)
{code}

FileSystem.get returns cached same instance when URI passed to the method 
represents same file system and the method is called by same user.

Because of the behavior, when the directory for event log is on HDFS, fs of 
ApplicationMaster and fileSystem of FileLogger is same instance.

When shutting down ApplicationMaster, fileSystem.close is called in 
FileLogger#stop, which is invoked by SparkContext#stop indirectly.
{code:title=FileLogger.stop}
def stop() {
  hadoopDataStream.foreach(_.close())
  writer.foreach(_.close())
  fileSystem.close()
}
{code}

And  ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In 
this method, fs.delete(stagingDirPath) is invoked. 

Because fs.delete in ApplicationMaster is called after fileSystem.close in 
FileLogger, fs.delete fails and results not deleting files in the staging 
directory.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2152) the error of comput rightNodeAgg about Decision tree algorithm in Spark MLlib

2014-07-07 Thread Jon Sondag (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053936#comment-14053936
 ] 

Jon Sondag commented on SPARK-2152:
---

https://github.com/apache/spark/pull/1316 (also resolves SPARK-2160)

> the error of comput rightNodeAgg about  Decision tree algorithm  in Spark 
> MLlib 
> 
>
> Key: SPARK-2152
> URL: https://issues.apache.org/jira/browse/SPARK-2152
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.0
> Environment: windows7 ,32 operator,and 3G mem
>Reporter: caoli
>  Labels: features
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
>  the error of comput rightNodeAgg about  Decision tree algorithm  in Spark 
> MLlib  about  the function extractLeftRightNodeAggregates() ,when compute 
> rightNodeAgg  used bindata index is error. in the DecisionTree.scala file 
> about  Line 980:
>  rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) =
> binData(shift + (2 * (numBins - 2 - splitIndex))) +
>   rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex))  
>   
>  the   binData(shift + (2 * (numBins - 2 - splitIndex)))  index compute is 
> error, so the result of rightNodeAgg  include  repeated data about "bins"  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2160) error of Decision tree algorithm in Spark MLlib

2014-07-07 Thread Jon Sondag (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053938#comment-14053938
 ] 

Jon Sondag commented on SPARK-2160:
---

https://github.com/apache/spark/pull/1316 (also resolves SPARK-2152)

> error of  Decision tree algorithm  in Spark MLlib 
> --
>
> Key: SPARK-2160
> URL: https://issues.apache.org/jira/browse/SPARK-2160
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: caoli
>  Labels: patch
> Fix For: 1.1.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> the error of comput rightNodeAgg about  Decision tree algorithm  in Spark 
> MLlib  , in the function extractLeftRightNodeAggregates() ,when compute 
> rightNodeAgg  used bindata index is error. in the DecisionTree.scala file 
> about  Line980:
>  rightNodeAgg(featureIndex)(2 * (numBins - 2 - splitIndex)) =
> binData(shift + (2 * (numBins - 2 - splitIndex))) +
>   rightNodeAgg(featureIndex)(2 * (numBins - 1 - splitIndex))  
>   
>  the   binData(shift + (2 * (numBins - 2 - splitIndex)))  index compute is 
> error, so the result of rightNodeAgg  include  repeated data about "bins"  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2384) Add tooltips for shuffle write and scheduler delay in UI

2014-07-07 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053834#comment-14053834
 ] 

Sandy Ryza commented on SPARK-2384:
---

This is a great idea

> Add tooltips for shuffle write and scheduler delay in UI
> 
>
> Key: SPARK-2384
> URL: https://issues.apache.org/jira/browse/SPARK-2384
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> There are a few common points of confusion in the UI that could be clarified 
> with tooltips.  We should add tooltips to explain the scheduler delay and the 
> shuffle data (to explain why shuffle read is typically < shuffle write, as 
> many many people have expressed confusion about).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2389) globally shared SparkContext / shared Spark "application"

2014-07-07 Thread Robert Stupp (JIRA)

Robert Stupp created SPARK-2389:
---

 Summary: globally shared SparkContext / shared Spark "application"
 Key: SPARK-2389
 URL: https://issues.apache.org/jira/browse/SPARK-2389
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Robert Stupp


The documentation (in Cluster Mode Overview) cites:

bq. Each application gets its own executor processes, which *stay up for the 
duration of the whole application* and run tasks in multiple threads. This has 
the benefit of isolating applications from each other, on both the scheduling 
side (each driver schedules its own tasks) and executor side (tasks from 
different applications run in different JVMs). However, it also means that 
*data cannot be shared* across different Spark applications (instances of 
SparkContext) without writing it to an external storage system.

IMO this is a limitation that should be lifted to support any number of 
--driver-- client processes to share executors and to share (persistent / 
cached) data.

This is especially useful if you have a bunch of frontend servers (dump web app 
servers) that want to use Spark as a _big computing machine_. Most important is 
the fact that Spark is quite good in caching/persisting data in memory / on 
disk thus removing load from backend data stores.

Means: it would be really great to let different --driver-- client JVMs operate 
on the same RDDs and benefit from Spark's caching/persistence.

It would however introduce some administration mechanisms to
* start a shared context
* update the executor configuration (# of worker nodes, # of cpus, etc) on the 
fly
* stop a shared context

Even "conventional" batch MR applications would benefit if ran fequently 
against the same data set.

As an implicit requirement, RDD persistence could get a TTL for its 
materialized state.

With such a feature the overall performance of today's web applications could 
then be increased by adding more web app servers, more spark nodes, more nosql 
nodes etc




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2388) Streaming from multiple different Kafka topics is problematic

2014-07-07 Thread Sergey (JIRA)

Sergey created SPARK-2388:
-

 Summary: Streaming from multiple different Kafka topics is 
problematic
 Key: SPARK-2388
 URL: https://issues.apache.org/jira/browse/SPARK-2388
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Sergey
 Fix For: 1.0.1


Default way of creating stream out of Kafka source would be as

val stream = KafkaUtils.createStream(ssc,"localhost:2181","logs", 
Map("retarget" -> 2,"datapair" -> 2))

However, if two topics - in this case "retarget" and "datapair" - are very 
different, there is no way to set up different filter, mapping functions, etc), 
as they are effectively merged.

However, instance of KafkaInputDStream, created with this call internally calls 
ConsumerConnector.createMessageStream() which returns *map* of KafkaStreams, 
keyed by topic. It would be great if this map would be exposed somehow, so 
aforementioned call 

val streamS = KafkaUtils.createStreamS(...)

returned map of streams.

Regards,
Sergey Malov
Collective Media



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2387) Remove the stage barrier for better resource utilization

2014-07-07 Thread Rui Li (JIRA)

Rui Li created SPARK-2387:
-

 Summary: Remove the stage barrier for better resource utilization
 Key: SPARK-2387
 URL: https://issues.apache.org/jira/browse/SPARK-2387
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Rui Li


DAGScheduler divides a Spark job into multiple stages according to RDD 
dependencies. Whenever there’s a shuffle dependency, DAGScheduler creates a 
shuffle map stage on the map side, and another stage depending on that stage.
Currently, the downstream stage cannot start until all its depended stages have 
finished. This barrier between stages leads to idle slots when waiting for the 
last few upstream tasks to finish and thus wasting cluster resources.
Therefore we propose to remove the barrier and pre-start the reduce stage once 
there're free slots. This can achieve better resource utilization and improve 
the overall job performance, especially when there're lots of executors granted 
to the application.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2386) RowWriteSupport should use the exact types to cast.

2014-07-07 Thread Takuya Ueshin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053537#comment-14053537
 ] 

Takuya Ueshin commented on SPARK-2386:
--

PRed: https://github.com/apache/spark/pull/1315

> RowWriteSupport should use the exact types to cast.
> ---
>
> Key: SPARK-2386
> URL: https://issues.apache.org/jira/browse/SPARK-2386
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>
> When execute {{saveAsParquetFile}} with non-primitive type, 
> {{RowWriteSupport}} uses wrong type {{Int}} for {{ByteType}} and 
> {{ShortType}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2386) RowWriteSupport should use the exact types to cast.

2014-07-07 Thread Takuya Ueshin (JIRA)

Takuya Ueshin created SPARK-2386:


 Summary: RowWriteSupport should use the exact types to cast.
 Key: SPARK-2386
 URL: https://issues.apache.org/jira/browse/SPARK-2386
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin


When execute {{saveAsParquetFile}} with non-primitive type, {{RowWriteSupport}} 
uses wrong type {{Int}} for {{ByteType}} and {{ShortType}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2385) Missing guide for running JDBC server on YARN

2014-07-07 Thread Yi Tian (JIRA)

Yi Tian created SPARK-2385:
--

 Summary: Missing guide for running JDBC server on YARN
 Key: SPARK-2385
 URL: https://issues.apache.org/jira/browse/SPARK-2385
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.0.0
Reporter: Yi Tian
Priority: Minor


There are no document for "running JDBC server on YARN"



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2382) build error:

2014-07-07 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14053433#comment-14053433
 ] 

Sean Owen commented on SPARK-2382:
--

I can't reproduce this. It is almost certainly a problem with your environment 
or network, possibly temporary. It's not a Spark issue so I think this should 
be closed.

> build error: 
> -
>
> Key: SPARK-2382
> URL: https://issues.apache.org/jira/browse/SPARK-2382
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Affects Versions: 1.0.0
> Environment: Ubuntu 12.0.4 precise. 
> spark@ubuntu-cdh5-spark:~/spark-1.0.0$ mvn -version
> Apache Maven 3.0.4
> Maven home: /usr/share/maven
> Java version: 1.6.0_31, vendor: Sun Microsystems Inc.
> Java home: /usr/lib/jvm/j2sdk1.6-oracle/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "linux", version: "3.11.0-15-generic", arch: "amd64", family: "unix"
>Reporter: Mukul Jain
>  Labels: newbie
>
> Unable to build. maven can't download dependency .. checked my http_proxy and 
> https_proxy setting they are working fine. Other http and https dependencies 
> were downloaded fine.. build process gets stuck always at this repo. manually 
> down loading also fails and receive an exception. 
> [INFO] 
> 
> [INFO] Building Spark Project External MQTT 1.0.0
> [INFO] 
> 
> Downloading: 
> https://repository.apache.org/content/repositories/releases/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom
> Jul 6, 2014 4:53:26 PM org.apache.commons.httpclient.HttpMethodDirector 
> executeWithRetry
> INFO: I/O exception (java.net.ConnectException) caught when processing 
> request: Connection timed out
> Jul 6, 2014 4:53:26 PM org.apache.commons.httpclient.HttpMethodDirector 
> executeWithRetry
> INFO: Retrying request



--
This message was sent by Atlassian JIRA
(v6.2#6252)

77 matches

Mail list logo