Re: Unit testing framework for Spark Jobs?

2016-03-02 Thread Yin Yang
Cycling prior bits: http://search-hadoop.com/m/q3RTto4sby1Cd2rt=Re+Unit+test+with+sqlContext On Wed, Mar 2, 2016 at 9:54 AM, SRK wrote: > Hi, > > What is a good unit testing framework for Spark batch/streaming jobs? I > have > core spark, spark sql with dataframes

Re: a basic question on first use of PySpark shell and example, which is failing

2016-02-29 Thread Yin Yang
RDDOperationScope is in spark-core_2.1x jar file. 7148 Mon Feb 29 09:21:32 PST 2016 org/apache/spark/rdd/RDDOperationScope.class Can you check whether the spark-core jar is in classpath ? FYI On Mon, Feb 29, 2016 at 1:40 PM, Taylor, Ronald C wrote: > Hi Jules,

Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-02-29 Thread Yin Yang
The default value for spark.shuffle.reduceLocality.enabled is true. To reduce surprise to users of 1.5 and earlier releases, should the default value be set to false ? On Mon, Feb 29, 2016 at 5:38 AM, Lior Chaga wrote: > Hi Koret, > Try

Re: Ordering two dimensional arrays of (String, Int) in the order of second element

2016-02-27 Thread Yin Yang
Is there particular reason you cannot use temporary table ? Thanks On Sat, Feb 27, 2016 at 10:59 AM, Ashok Kumar <ashok34...@yahoo.com> wrote: > Thank you sir. > > Can one do this sorting without using temporary table if possible? > > Best > > > On Saturday, 27

Re: Ordering two dimensional arrays of (String, Int) in the order of second element

2016-02-27 Thread Yin Yang
scala> Seq((1, "b", "test"), (2, "a", "foo")).toDF("id", "a", "b").registerTempTable("test") scala> val df = sql("SELECT struct(id, b, a) from test order by b") df: org.apache.spark.sql.DataFrame = [struct(id, b, a): struct] scala> df.show ++ |struct(id, b, a)|

Re: Ordering two dimensional arrays of (String, Int) in the order of second element

2016-02-27 Thread Yin Yang
Is this what you look for ? scala> Seq((2, "a", "test"), (2, "b", "foo")).toDF("id", "a", "b").registerTempTable("test") scala> val df = sql("SELECT struct(id, b, a) from test") df: org.apache.spark.sql.DataFrame = [struct(id, b, a): struct] scala> df.show ++ |struct(id, b, a)|

Re: Spark SQL support for sub-queries

2016-02-26 Thread Yin Yang
I tried the following: scala> Seq((2, "a", "test"), (2, "b", "foo")).toDF("id", "a", "b").registerTempTable("test") scala> val df = sql("SELECT maxRow.* FROM (SELECT max(struct(id, b, a)) as maxRow FROM test) a") df: org.apache.spark.sql.DataFrame = [id: int, b: string ... 1 more field] scala>

Re: Spark 1.5 on Mesos

2016-02-26 Thread Yin Yang
Have you read this ? https://spark.apache.org/docs/latest/running-on-mesos.html On Fri, Feb 26, 2016 at 11:03 AM, Ashish Soni wrote: > Hi All , > > Is there any proper documentation as how to run spark on mesos , I am > trying from the last few days and not able to make

Re: Spark SQL support for sub-queries

2016-02-26 Thread Yin Yang
Since collect is involved, the approach would be slower compared to the SQL Mich gave in his first email. On Fri, Feb 26, 2016 at 1:42 AM, Michał Zieliński < zielinski.mich...@gmail.com> wrote: > You need to collect the value. > > val m: Int = d.agg(max($"id")).collect.apply(0).getInt(0) >

Re: DirectFileOutputCommiter

2016-02-25 Thread Yin Yang
The header of DirectOutputCommitter.scala says Databricks. Did you get it from Databricks ? On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu wrote: > interesting in this topic as well, why the DirectFileOutputCommitter not > included? > > we added it in our fork, under >

Re: Spark 1.6.0 running jobs in yarn shows negative no of tasks in executor

2016-02-25 Thread Yin Yang
Which release of hadoop are you using ? Can you share a bit about the logic of your job ? Pastebinning portion of relevant logs would give us more clue. Thanks On Thu, Feb 25, 2016 at 8:54 AM, unk1102 wrote: > Hi I have spark job which I run on yarn and sometimes it

Re: Running executors missing in sparkUI

2016-02-25 Thread Yin Yang
Which Spark / hadoop release are you running ? Thanks On Thu, Feb 25, 2016 at 4:28 AM, Jan Štěrba wrote: > Hello, > > I have quite a weird behaviour that I can't quite wrap my head around. > I am running Spark on a Hadoop YARN cluster. I have Spark configured > in such a

Re: Error:java.lang.RuntimeException: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2016-02-24 Thread Yin Yang
See slides starting with slide #25 of http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications FYI On Wed, Feb 24, 2016 at 7:25 PM, xiazhuchang wrote: > When cache data to memory, the code DiskStore$getBytes will be called. If > there is

Re: Filter on a column having multiple values

2016-02-24 Thread Yin Yang
However, when the number of choices gets big, the following notation becomes cumbersome. On Wed, Feb 24, 2016 at 3:41 PM, Mich Talebzadeh < mich.talebza...@cloudtechnologypartners.co.uk> wrote: > You can use operators here. > > t.filter($"column1" === 1 || $"column1" === 2) > > > > > > On

Re: Execution plan in spark

2016-02-24 Thread Yin Yang
Is the following what you were looking for ? sqlContext.sql(""" CREATE TEMPORARY TABLE partitionedParquet USING org.apache.spark.sql.parquet OPTIONS ( path '/tmp/partitioned' )""") table("partitionedParquet").explain(true) On Wed, Feb 24, 2016 at 1:16 AM, Ashok

Re: metrics not reported by spark-cassandra-connector

2016-02-23 Thread Yin Yang
Hi, Sa: Have you asked on spark-cassandra-connector mailing list ? Seems you would get better response there. Cheers