Re: Return binary mode in ThriftServer

2016-06-13 Thread Reynold Xin
Thanks for the email. Things like this (and bugs) are exactly the reason
the preview releases exist. It seems like enough people have run into
problem with this one that maybe we should just bring it back for backward
compatibility.

On Monday, June 13, 2016, Egor Pahomov  wrote:

> In May due to the SPARK-15095 binary mode was "removed" (code is there,
> but you can not turn it on) from Spark-2.0. In 1.6.1 binary was default and
> in 2.0.0-preview it was removed. It's really annoying:
>
>- I can not use Tableau+Spark anymore
>- I need to change connection URL in SQL client for every analyst in
>my organization. And with Squirrel I experiencing problems with that.
>- We have parts of infrastructure, which connected to data
>infrastructure though ThriftServer. And of course format was binary.
>
> I've created a ticket to get binary back(
> https://issues.apache.org/jira/browse/SPARK-15934), but that's not the
> point. I've experienced this problem a month ago, but haven't done anything
> about it, because I believed, that I'm stupid and doing something wrong.
> But documentation was release recently and it contained no information
> about this new thing and it made me digging.
>
> Most of what I describe is just annoying, but Tableau+Spark new
> incompatibility I believe is big deal. Maybe I'm wrong and there are ways
> to make things work, it's just I wouldn't expect move to 2.0.0 to be so
> time consuming.
>
> My point: Do we have any guidelines regarding doing such radical things?
>
> --
>
>
> *Sincerely yoursEgor Pakhomov*
>


Return binary mode in ThriftServer

2016-06-13 Thread Egor Pahomov
In May due to the SPARK-15095 binary mode was "removed" (code is there, but
you can not turn it on) from Spark-2.0. In 1.6.1 binary was default and in
2.0.0-preview it was removed. It's really annoying:

   - I can not use Tableau+Spark anymore
   - I need to change connection URL in SQL client for every analyst in my
   organization. And with Squirrel I experiencing problems with that.
   - We have parts of infrastructure, which connected to data
   infrastructure though ThriftServer. And of course format was binary.

I've created a ticket to get binary back(
https://issues.apache.org/jira/browse/SPARK-15934), but that's not the
point. I've experienced this problem a month ago, but haven't done anything
about it, because I believed, that I'm stupid and doing something wrong.
But documentation was release recently and it contained no information
about this new thing and it made me digging.

Most of what I describe is just annoying, but Tableau+Spark new
incompatibility I believe is big deal. Maybe I'm wrong and there are ways
to make things work, it's just I wouldn't expect move to 2.0.0 to be so
time consuming.

My point: Do we have any guidelines regarding doing such radical things?

-- 


*Sincerely yoursEgor Pakhomov*


Utilizing YARN AM RPC port field

2016-06-13 Thread Mingyu Kim
Hi all,

 

YARN provides a way for AppilcationMaster to register a RPC port so that a 
client outside the YARN cluster can reach the application for any RPCs, but 
Spark’s YARN AMs simply register a dummy port number of 0. (See 
https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala#L74)
 This is useful for the long-running Spark application usecases where jobs are 
submitted via a form of RPC to an already started Spark context running in YARN 
cluster mode. Spark job server 
(https://github.com/spark-jobserver/spark-jobserver) and Livy 
(https://github.com/cloudera/hue/tree/master/apps/spark/java) are good 
open-source examples of these usecases. The current work-around is to have the 
Spark AM make a call back to a configured URL with the port number of the RPC 
server for the client to communicate with the AM.

 

Utilizing YARN AM RPC port allows the port number reporting to be done in a 
secure way (i.e. With AM RPC port field and Kerberized YARN cluster, you don’t 
need to re-invent a way to verify the authenticity of the port number 
reporting.) and removes the callback from YARN cluster back to a client, which 
means you can operate YARN in a low-trust environment and run other client 
applications behind a firewall.

 

A couple of proposals for utilizing YARN AM RPC port I have are, (Note that you 
cannot simply pre-configure the port number and pass it to Spark AM via 
configuration because of potential port conflicts on the YARN node)

 

· Start-up an empty Jetty server during Spark AM initialization, set 
the port number when registering AM with RM, and pass a reference to the Jetty 
server into the Spark application (e.g. through SparkContext) for the 
application to dynamically add servlet/resources to the Jetty server.

· Have an optional static method in the main class (e.g. 
initializeRpcPort()) which optionally sets up a RPC server and returns the RPC 
port. Spark AM can call this method, register the port number to RM and 
continue on with invoking the main method. I don’t see this making a good API, 
though.

 

I’m curious to hear what other people think. Would this be useful for anyone? 
What do you think about the proposals? Please feel free to suggest other ideas. 
Thanks!

 

Mingyu



smime.p7s
Description: S/MIME cryptographic signature


Re: tpcds q1 - java.lang.NegativeArraySizeException

2016-06-13 Thread Sameer Agarwal
I'm unfortunately not able to reproduce this on master. Does the query
always fail deterministically?

On Mon, Jun 13, 2016 at 12:54 PM, Ovidiu-Cristian MARCU <
ovidiu-cristian.ma...@inria.fr> wrote:

> Yes, commit ad102af
>
> On 13 Jun 2016, at 21:25, Reynold Xin  wrote:
>
> Did you try this on master?
>
>
> On Mon, Jun 13, 2016 at 11:26 AM, Ovidiu-Cristian MARCU <
> ovidiu-cristian.ma...@inria.fr> wrote:
>
>> Hi,
>>
>> Running the first query of tpcds on a standalone setup (4 nodes, tpcds2
>> generated for scale 10 and transformed in parquet under hdfs)  it results
>> in one exception [1].
>> Close to this problem I found this issue
>> https://issues.apache.org/jira/browse/SPARK-12089 but it seems to be
>> solved.
>>
>> Running the second query is successful.
>>
>> OpenJDK 64-Bit Server VM 1.7.0_101-b00 on Linux 3.2.0-4-amd64
>> Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
>> TPCDS Snappy:Best/Avg Time(ms)Rate(M/s)
>> Per Row(ns)   Relative
>>
>> 
>> q24512 / 8142  0.0
>> 61769.4   1.0X
>>
>> Best,
>> Ovidiu
>>
>> [1]
>> WARN TaskSetManager: Lost task 17.0 in stage 80.0 (TID 4469,
>> 172.16.96.70): java.lang.NegativeArraySizeException
>> at
>> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61)
>> at
>> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:214)
>> at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>> Source)
>> at
>> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>> at
>> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$3$$anon$2.hasNext(WholeStageCodegenExec.scala:386)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>> at
>> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
>> at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:628)
>> at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
>> at
>> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
>> at
>> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
>> at
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>> at
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>> at org.apache.spark.scheduler.Task.run(Task.scala:85)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> ERROR TaskSetManager: Task 17 in stage 80.0 failed 4 times; aborting job
>>
>> Driver stacktrace:
>> at org.apache.spark.scheduler.DAGScheduler.org
>> 
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>> at
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:806)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:806)
>> at scala.Option.foreach(Option.scala:257)
>> at
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:806)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1644)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1603)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1592)
>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1872)
>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1935)
>> at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:974)
>> 

Re: tpcds q1 - java.lang.NegativeArraySizeException

2016-06-13 Thread Ovidiu-Cristian MARCU
Yes, commit ad102af 

> On 13 Jun 2016, at 21:25, Reynold Xin  wrote:
> 
> Did you try this on master?
> 
> 
> On Mon, Jun 13, 2016 at 11:26 AM, Ovidiu-Cristian MARCU 
> > 
> wrote:
> Hi,
> 
> Running the first query of tpcds on a standalone setup (4 nodes, tpcds2 
> generated for scale 10 and transformed in parquet under hdfs)  it results in 
> one exception [1].
> Close to this problem I found this issue 
> https://issues.apache.org/jira/browse/SPARK-12089 
>  but it seems to be solved.
> 
> Running the second query is successful.
> 
> OpenJDK 64-Bit Server VM 1.7.0_101-b00 on Linux 3.2.0-4-amd64
> Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
> TPCDS Snappy:Best/Avg Time(ms)Rate(M/s)   Per 
> Row(ns)   Relative
> 
> q24512 / 8142  0.0   
> 61769.4   1.0X
> 
> Best,
> Ovidiu
> 
> [1]
> WARN TaskSetManager: Lost task 17.0 in stage 80.0 (TID 4469, 172.16.96.70): 
> java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:214)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$3$$anon$2.hasNext(WholeStageCodegenExec.scala:386)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
>   at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:628)
>   at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 
> ERROR TaskSetManager: Task 17 in stage 80.0 failed 4 times; aborting job
> 
> Driver stacktrace:
>   at org.apache.spark.scheduler.DAGScheduler.org 
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:806)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:806)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:806)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1644)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1603)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1592)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1872)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1935)

Re: tpcds q1 - java.lang.NegativeArraySizeException

2016-06-13 Thread Reynold Xin
Did you try this on master?


On Mon, Jun 13, 2016 at 11:26 AM, Ovidiu-Cristian MARCU <
ovidiu-cristian.ma...@inria.fr> wrote:

> Hi,
>
> Running the first query of tpcds on a standalone setup (4 nodes, tpcds2
> generated for scale 10 and transformed in parquet under hdfs)  it results
> in one exception [1].
> Close to this problem I found this issue
> https://issues.apache.org/jira/browse/SPARK-12089 but it seems to be
> solved.
>
> Running the second query is successful.
>
> OpenJDK 64-Bit Server VM 1.7.0_101-b00 on Linux 3.2.0-4-amd64
> Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
> TPCDS Snappy:Best/Avg Time(ms)Rate(M/s)
> Per Row(ns)   Relative
>
> 
> q24512 / 8142  0.0
>   61769.4   1.0X
>
> Best,
> Ovidiu
>
> [1]
> WARN TaskSetManager: Lost task 17.0 in stage 80.0 (TID 4469,
> 172.16.96.70): java.lang.NegativeArraySizeException
> at
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61)
> at
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:214)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$3$$anon$2.hasNext(WholeStageCodegenExec.scala:386)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
> at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:628)
> at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
> at
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
> at
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> ERROR TaskSetManager: Task 17 in stage 80.0 failed 4 times; aborting job
>
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:806)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:806)
> at scala.Option.foreach(Option.scala:257)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:806)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1644)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1603)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1592)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1872)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1935)
> at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:974)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:956)
> at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1371)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> 

Re: [YARN] Small fix for yarn.Client to use buildPath (not Path.SEPARATOR)

2016-06-13 Thread Sean Owen
Yeah it does the same thing anyway. It's fine to consistently use the
method. I think there's an instance in ClientSuite that can use it.

On Mon, Jun 13, 2016 at 6:50 PM, Jacek Laskowski  wrote:
> Hi,
>
> Just noticed that yarn.Client#populateClasspath uses Path.SEPARATOR
> [1] to build a CLASSPATH entry while another similar-looking line uses
> buildPath method [2].
>
> Could a pull request with a change to use buildPath at [1] be
> accepted? I'm always confused how to fix such small changes.
>
> [1] 
> https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1298
> [2] Path.SEPARATOR
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



tpcds q1 - java.lang.NegativeArraySizeException

2016-06-13 Thread Ovidiu-Cristian MARCU
Hi,

Running the first query of tpcds on a standalone setup (4 nodes, tpcds2 
generated for scale 10 and transformed in parquet under hdfs)  it results in 
one exception [1].
Close to this problem I found this issue 
https://issues.apache.org/jira/browse/SPARK-12089 
 but it seems to be solved.

Running the second query is successful.

OpenJDK 64-Bit Server VM 1.7.0_101-b00 on Linux 3.2.0-4-amd64
Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
TPCDS Snappy:Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

q24512 / 8142  0.0   
61769.4   1.0X

Best,
Ovidiu

[1]
WARN TaskSetManager: Lost task 17.0 in stage 80.0 (TID 4469, 172.16.96.70): 
java.lang.NegativeArraySizeException
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61)
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:214)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$3$$anon$2.hasNext(WholeStageCodegenExec.scala:386)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:628)
at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
at 
org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
at 
org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

ERROR TaskSetManager: Task 17 in stage 80.0 failed 4 times; aborting job

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:806)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:806)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:806)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1644)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1603)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1592)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1872)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1935)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:974)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:956)
at 

[YARN] Small fix for yarn.Client to use buildPath (not Path.SEPARATOR)

2016-06-13 Thread Jacek Laskowski
Hi,

Just noticed that yarn.Client#populateClasspath uses Path.SEPARATOR
[1] to build a CLASSPATH entry while another similar-looking line uses
buildPath method [2].

Could a pull request with a change to use buildPath at [1] be
accepted? I'm always confused how to fix such small changes.

[1] 
https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1298
[2] Path.SEPARATOR

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org