Re: not found: value SQLContextSingleton

2015-04-11 Thread Tathagata Das
Have you created a class called  SQLContextSingleton ? If so, is it in the
compile class path?

On Fri, Apr 10, 2015 at 6:47 AM, Mukund Ranjan (muranjan) <
muran...@cisco.com> wrote:

>  Hi All,
>
>
>  Any idea why I am getting this error?
>
>
>
>
>
>  wordsTenSeconds.foreachRDD((rdd: RDD[String], time: Time) => {
>
>
>
>   val sqlContext =   SQLContextSingleton.getInstance(rdd.sparkContext)
>   // This line is creating this error
>
>
>
> })
>
>
>
>  — E R R O R ——
>
>
>
>  [error]
> /Users/muranjan/workspace/kafka/src/main/scala/kafka/KafkaConsumer.scala:103:
> not found: value SQLContextSingleton
>
> [error]   val sqlContext =
> SQLContextSingleton.getInstance(rdd.sparkContext)
>
> [error]^
>
> [error]
> /Users/muranjan/workspace/kafka/src/main/scala/kafka/KafkaConsumer.scala:158:
> not found: value SQLContextSingleton
>
> [error]   val sqlContext =
> SQLContextSingleton.getInstance(rdd.sparkContext)
>
> [error]^
>
> [error] two errors found
>
> [error] (compile:compileIncremental) Compilation failed
>
>
>
>  — E R R O R ——
>
>  Thanks,
> Mukund
>


Re: Spark support for Hadoop Formats (Avro)

2015-04-11 Thread ๏̯͡๏
The read seem to be successfully as the values for each field in record are
different and correct. The problem is when i collect it or trigger next
processing (join with other table) , each of this probably triggers
serialization and thats when all the fields in the record get the value of
first field (or element).



On Sun, Apr 12, 2015 at 9:14 AM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> We have very large processing being done on Hadoop (400 M/r Jobs, 1 Day
> duration, 100s of TB data, 100s of joins). We are exploring Spark as
> alternative to speed up our processing time. We use Scala + Scoobie today
> and Avro is the data format across steps.
>
>
> I observed a strange behavior, i read sample data (avro format, 10
> records) and i collect it and print each record. All the data for each
> element within a record is wiped out and i only see data of first element
> being copied for everything.
>
> Is this a problem with Spark ? Or with using Avro ?
>
>
> Example:
>
> I took that RDD run through it and printed 4 elements from it, they all
> printed correctly.
>
>
> val x = viEvents.map {
>   case (itemId, event) =>
> println(event.get("guid"), itemId, event.get("itemId"),
> event.get("siteId"))
> (itemId, event)
> }
>
> The above code prints
>
> (27c9fbc014b4f61526f0574001b73b00,261197590161,261197590161,3)
> (27c9fbc014b4f61526f0574001b73b00,261197590161,261197590161,3)
> (27c9fbc014b4f61526f0574001b73b00,261197590161,261197590161,3)
> (340da8c014a46272c0c8c830011c3bf0,221733319941,221733319941,77)
> (340da8c014a46272c0c8c830011c3bf0,181704048554,181704048554,77)
> (340da8c014a46272c0c8c830011c3bf0,231524481696,231524481696,77)
> (340da8c014a46272c0c8c830011c3bf0,271830464992,271830464992,77)
> (393938d71480a2aaf8e440d1fff709f4,141586046141,141586046141,0)
> (3a792a7c14c0a35882346c04fff4e236,161605492016,161605492016,0)
> (3a792a7c14c0a35882346c04fff4e236,161605492016,161605492016,0)
>
> viEvents.collect.foreach(a => println(a._2.get("guid"), a._1,
> a._2.get("itemId"), a._2.get("siteId")))
>
> *Now, i collected it, this might have lead to serialization of the RDD.* Now
> when i print the same 4 elements, *i only get guid values for all. Has
> this got something to do with serialization ?*
>
>
> (27c9fbc014b4f61526f0574001b73b00,261197590161,27c9fbc014b4f61526f0574001b73b00,27c9fbc014b4f61526f0574001b73b00)
>
> (27c9fbc014b4f61526f0574001b73b00,261197590161,27c9fbc014b4f61526f0574001b73b00,27c9fbc014b4f61526f0574001b73b00)
>
> (27c9fbc014b4f61526f0574001b73b00,261197590161,27c9fbc014b4f61526f0574001b73b00,27c9fbc014b4f61526f0574001b73b00)
>
> (340da8c014a46272c0c8c830011c3bf0,221733319941,340da8c014a46272c0c8c830011c3bf0,340da8c014a46272c0c8c830011c3bf0)
>
> (340da8c014a46272c0c8c830011c3bf0,181704048554,340da8c014a46272c0c8c830011c3bf0,340da8c014a46272c0c8c830011c3bf0)
>
> (340da8c014a46272c0c8c830011c3bf0,231524481696,340da8c014a46272c0c8c830011c3bf0,340da8c014a46272c0c8c830011c3bf0)
>
> (340da8c014a46272c0c8c830011c3bf0,271830464992,340da8c014a46272c0c8c830011c3bf0,340da8c014a46272c0c8c830011c3bf0)
>
> (393938d71480a2aaf8e440d1fff709f4,141586046141,393938d71480a2aaf8e440d1fff709f4,393938d71480a2aaf8e440d1fff709f4)
>
> (3a792a7c14c0a35882346c04fff4e236,161605492016,3a792a7c14c0a35882346c04fff4e236,3a792a7c14c0a35882346c04fff4e236)
>
> (3a792a7c14c0a35882346c04fff4e236,161605492016,3a792a7c14c0a35882346c04fff4e236,3a792a7c14c0a35882346c04fff4e236)
>
>
>
> The RDD is of type org.apache.spark.rdd.RDD[(Long,
>  com.ebay.ep.poc.spark.reporting.process.detail.model.DetailInputRecord)]
>
> At the time of context creation i did this
> val conf = new SparkConf()
>   .setAppName(detail)
>   .set("spark.serializer", "org.apache.spark.serializer.
> *KryoSerializer*")
>   .set("spark.kryoserializer.buffer.mb",
> arguments.get("buffersize").get)
>   .set("spark.kryoserializer.buffer.max.mb",
> arguments.get("maxbuffersize").get)
>   .set("spark.driver.maxResultSize",
> arguments.get("maxResultSize").get)
>   .set("spark.yarn.maxAppAttempts", "1")
>
> .registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum],
>
> classOf[com.ebay.ep.poc.spark.reporting.process.detail.model.DetailInputRecord],
>
> classOf[com.ebay.ep.poc.spark.reporting.process.detail.model.InputRecord],
>
> classOf[com.ebay.ep.poc.spark.reporting.process.model.SessionRecord],
>
> classOf[com.ebay.ep.poc.spark.reporting.process.model.DataRecord],
>
> classOf[com.ebay.ep.poc.spark.reporting.process.model.ExperimentationRecord]))
>
> The class heirarchy is
>
> DetailInputRecord extends InputRecord extends SessionRecord extends
> ExperimentationRecord extends
>org.apache.avro.generic.GenericRecord.Record(schema: Schema)
>
>
> Please suggest.
>
>
>
> --
> Deepak
>
>


-- 
Deepak


Spark support for Hadoop Formats (Avro)

2015-04-11 Thread ๏̯͡๏
We have very large processing being done on Hadoop (400 M/r Jobs, 1 Day
duration, 100s of TB data, 100s of joins). We are exploring Spark as
alternative to speed up our processing time. We use Scala + Scoobie today
and Avro is the data format across steps.


I observed a strange behavior, i read sample data (avro format, 10 records)
and i collect it and print each record. All the data for each element
within a record is wiped out and i only see data of first element being
copied for everything.

Is this a problem with Spark ? Or with using Avro ?


Example:

I took that RDD run through it and printed 4 elements from it, they all
printed correctly.


val x = viEvents.map {
  case (itemId, event) =>
println(event.get("guid"), itemId, event.get("itemId"),
event.get("siteId"))
(itemId, event)
}

The above code prints

(27c9fbc014b4f61526f0574001b73b00,261197590161,261197590161,3)
(27c9fbc014b4f61526f0574001b73b00,261197590161,261197590161,3)
(27c9fbc014b4f61526f0574001b73b00,261197590161,261197590161,3)
(340da8c014a46272c0c8c830011c3bf0,221733319941,221733319941,77)
(340da8c014a46272c0c8c830011c3bf0,181704048554,181704048554,77)
(340da8c014a46272c0c8c830011c3bf0,231524481696,231524481696,77)
(340da8c014a46272c0c8c830011c3bf0,271830464992,271830464992,77)
(393938d71480a2aaf8e440d1fff709f4,141586046141,141586046141,0)
(3a792a7c14c0a35882346c04fff4e236,161605492016,161605492016,0)
(3a792a7c14c0a35882346c04fff4e236,161605492016,161605492016,0)

viEvents.collect.foreach(a => println(a._2.get("guid"), a._1,
a._2.get("itemId"), a._2.get("siteId")))

*Now, i collected it, this might have lead to serialization of the RDD.* Now
when i print the same 4 elements, *i only get guid values for all. Has this
got something to do with serialization ?*

(27c9fbc014b4f61526f0574001b73b00,261197590161,27c9fbc014b4f61526f0574001b73b00,27c9fbc014b4f61526f0574001b73b00)
(27c9fbc014b4f61526f0574001b73b00,261197590161,27c9fbc014b4f61526f0574001b73b00,27c9fbc014b4f61526f0574001b73b00)
(27c9fbc014b4f61526f0574001b73b00,261197590161,27c9fbc014b4f61526f0574001b73b00,27c9fbc014b4f61526f0574001b73b00)
(340da8c014a46272c0c8c830011c3bf0,221733319941,340da8c014a46272c0c8c830011c3bf0,340da8c014a46272c0c8c830011c3bf0)
(340da8c014a46272c0c8c830011c3bf0,181704048554,340da8c014a46272c0c8c830011c3bf0,340da8c014a46272c0c8c830011c3bf0)
(340da8c014a46272c0c8c830011c3bf0,231524481696,340da8c014a46272c0c8c830011c3bf0,340da8c014a46272c0c8c830011c3bf0)
(340da8c014a46272c0c8c830011c3bf0,271830464992,340da8c014a46272c0c8c830011c3bf0,340da8c014a46272c0c8c830011c3bf0)
(393938d71480a2aaf8e440d1fff709f4,141586046141,393938d71480a2aaf8e440d1fff709f4,393938d71480a2aaf8e440d1fff709f4)
(3a792a7c14c0a35882346c04fff4e236,161605492016,3a792a7c14c0a35882346c04fff4e236,3a792a7c14c0a35882346c04fff4e236)
(3a792a7c14c0a35882346c04fff4e236,161605492016,3a792a7c14c0a35882346c04fff4e236,3a792a7c14c0a35882346c04fff4e236)



The RDD is of type org.apache.spark.rdd.RDD[(Long,
 com.ebay.ep.poc.spark.reporting.process.detail.model.DetailInputRecord)]

At the time of context creation i did this
val conf = new SparkConf()
  .setAppName(detail)
  .set("spark.serializer", "org.apache.spark.serializer.*KryoSerializer*
")
  .set("spark.kryoserializer.buffer.mb",
arguments.get("buffersize").get)
  .set("spark.kryoserializer.buffer.max.mb",
arguments.get("maxbuffersize").get)
  .set("spark.driver.maxResultSize", arguments.get("maxResultSize").get)
  .set("spark.yarn.maxAppAttempts", "1")

.registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum],

classOf[com.ebay.ep.poc.spark.reporting.process.detail.model.DetailInputRecord],

classOf[com.ebay.ep.poc.spark.reporting.process.detail.model.InputRecord],

classOf[com.ebay.ep.poc.spark.reporting.process.model.SessionRecord],
  classOf[com.ebay.ep.poc.spark.reporting.process.model.DataRecord],

classOf[com.ebay.ep.poc.spark.reporting.process.model.ExperimentationRecord]))

The class heirarchy is

DetailInputRecord extends InputRecord extends SessionRecord extends
ExperimentationRecord extends
   org.apache.avro.generic.GenericRecord.Record(schema: Schema)


Please suggest.



-- 
Deepak


Re: DataFrame column name restriction

2015-04-11 Thread Michael Armbrust
That is a good question.  Names with `.` in them are in particular broken
by SPARK-5632 , which I'd
like to fix.

There is a more general question of whether strings that are passed to
DataFrames should be treated as quoted identifiers (i.e. `as though they
were in backticks`) or interpreted as normal identifiers in SQL.  I've
opened this JIRA to discuss further: SPARK-6865


On Fri, Apr 10, 2015 at 7:18 PM, Justin Yip  wrote:

> Hello,
>
> Are there any restriction in the column name? I tried to use ".", but
> sqlContext.sql cannot find the column. I would guess that "." is tricky as
> this affects accessing StructType, but are there any more restriction on
> column name?
>
> scala> case class A(a: Int)
> defined class A
>
> scala> sqlContext.createDataFrame(Seq(A(10), A(20))).withColumn("b.b",
> $"a" + 1)
> res19: org.apache.spark.sql.DataFrame = [a: int, b.b: int]
>
> scala> res19.registerTempTable("res19")
>
> scala> res19.select("a")
> res24: org.apache.spark.sql.DataFrame = [a: int]
>
> scala> res19.select("a", "b.b")
> org.apache.spark.sql.AnalysisException: cannot resolve 'b.b' given input
> columns a, b.b;
> at
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> 
>
>
> Thanks.
>
> Justin
>


Re: Spark on Mesos / Executor Memory

2015-04-11 Thread Tim Chen
(Adding spark user list)

Hi Tom,

If I understand correctly you're saying that you're running into memory
problems because the scheduler is allocating too much CPUs and not enough
memory to acoomodate them right?

In the case of fine grain mode I don't think that's a problem since we have
a fixed amount of CPU and memory per task.
However, in coarse grain you can run into that problem if you're with in
the spark.cores.max limit, and memory is a fixed number.

I have a patch out to configure how much max cpus should coarse grain
executor use, and it also allows multiple executors in coarse grain mode.
So you could say try to launch multiples of max 4 cores with
spark.executor.memory (+ overhead and etc) in a slave. (
https://github.com/apache/spark/pull/4027)

It also might be interesting to include a cores to memory multiplier so
that with a larger amount of cores we try to scale the memory with some
factor, but I'm not entirely sure that's intuitive to use and what people
know what to set it to, as that can likely change with different workload.

Tim







On Sat, Apr 11, 2015 at 9:51 AM, Tom Arnfeld  wrote:

> We're running Spark 1.3.0 (with a couple of patches over the top for
> docker related bits).
>
> I don't think SPARK-4158 is related to what we're seeing, things do run
> fine on the cluster, given a ridiculously large executor memory
> configuration. As for SPARK-3535 although that looks useful I think we'e
> seeing something else.
>
> Put a different way, the amount of memory required at any given time by
> the spark JVM process is directly proportional to the amount of CPU it has,
> because more CPU means more tasks and more tasks means more memory. Even if
> we're using coarse mode, the amount of executor memory should be
> proportionate to the amount of CPUs in the offer.
>
> On 11 April 2015 at 17:39, Brenden Matthews  wrote:
>
>> I ran into some issues with it a while ago, and submitted a couple PRs to
>> fix it:
>>
>> https://github.com/apache/spark/pull/2401
>> https://github.com/apache/spark/pull/3024
>>
>> Do these look relevant? What version of Spark are you running?
>>
>> On Sat, Apr 11, 2015 at 9:33 AM, Tom Arnfeld  wrote:
>>
>>> Hey,
>>>
>>> Not sure whether it's best to ask this on the spark mailing list or the
>>> mesos one, so I'll try here first :-)
>>>
>>> I'm having a bit of trouble with out of memory errors in my spark
>>> jobs... it seems fairly odd to me that memory resources can only be set at
>>> the executor level, and not also at the task level. For example, as far as
>>> I can tell there's only a *spark.executor.memory* config option.
>>>
>>> Surely the memory requirements of a single executor are quite
>>> dramatically influenced by the number of concurrent tasks running? Given a
>>> shared cluster, I have no idea what % of an individual slave my executor is
>>> going to get, so I basically have to set the executor memory to a value
>>> that's correct when the whole machine is in use...
>>>
>>> Has anyone else running Spark on Mesos come across this, or maybe
>>> someone could correct my understanding of the config options?
>>>
>>> Thanks!
>>>
>>> Tom.
>>>
>>
>>
>


Re: Unusual behavior with leftouterjoin

2015-04-11 Thread ๏̯͡๏
I took that RDD run through it and printed 4 elements from it, they all
printed correctly.


val x = viEvents.map {
  case (itemId, event) =>
println(event.get("guid"), itemId, event.get("itemId"),
event.get("siteId"))
(itemId, event)
}

The above code prints

(27c9fbc014b4f61526f0574001b73b00,261197590161,261197590161,3)
(27c9fbc014b4f61526f0574001b73b00,261197590161,261197590161,3)
(27c9fbc014b4f61526f0574001b73b00,261197590161,261197590161,3)
(340da8c014a46272c0c8c830011c3bf0,221733319941,221733319941,77)
(340da8c014a46272c0c8c830011c3bf0,181704048554,181704048554,77)
(340da8c014a46272c0c8c830011c3bf0,231524481696,231524481696,77)
(340da8c014a46272c0c8c830011c3bf0,271830464992,271830464992,77)
(393938d71480a2aaf8e440d1fff709f4,141586046141,141586046141,0)
(3a792a7c14c0a35882346c04fff4e236,161605492016,161605492016,0)
(3a792a7c14c0a35882346c04fff4e236,161605492016,161605492016,0)

viEvents.collect.foreach(a => println(a._2.get("guid"), a._1,
a._2.get("itemId"), a._2.get("siteId")))

*Now, i collected it, this might have lead to serialization of the RDD.*
Now when i print the same 4 elements, *i only get guid values for all. Has
this got something to do with serialization ?*

(27c9fbc014b4f61526f0574001b73b00,261197590161,27c9fbc014b4f61526f0574001b73b00,27c9fbc014b4f61526f0574001b73b00)
(27c9fbc014b4f61526f0574001b73b00,261197590161,27c9fbc014b4f61526f0574001b73b00,27c9fbc014b4f61526f0574001b73b00)
(27c9fbc014b4f61526f0574001b73b00,261197590161,27c9fbc014b4f61526f0574001b73b00,27c9fbc014b4f61526f0574001b73b00)
(340da8c014a46272c0c8c830011c3bf0,221733319941,340da8c014a46272c0c8c830011c3bf0,340da8c014a46272c0c8c830011c3bf0)
(340da8c014a46272c0c8c830011c3bf0,181704048554,340da8c014a46272c0c8c830011c3bf0,340da8c014a46272c0c8c830011c3bf0)
(340da8c014a46272c0c8c830011c3bf0,231524481696,340da8c014a46272c0c8c830011c3bf0,340da8c014a46272c0c8c830011c3bf0)
(340da8c014a46272c0c8c830011c3bf0,271830464992,340da8c014a46272c0c8c830011c3bf0,340da8c014a46272c0c8c830011c3bf0)
(393938d71480a2aaf8e440d1fff709f4,141586046141,393938d71480a2aaf8e440d1fff709f4,393938d71480a2aaf8e440d1fff709f4)
(3a792a7c14c0a35882346c04fff4e236,161605492016,3a792a7c14c0a35882346c04fff4e236,3a792a7c14c0a35882346c04fff4e236)
(3a792a7c14c0a35882346c04fff4e236,161605492016,3a792a7c14c0a35882346c04fff4e236,3a792a7c14c0a35882346c04fff4e236)



The RDD is of type org.apache.spark.rdd.RDD[(Long,
 com.ebay.ep.poc.spark.reporting.process.detail.model.DetailInputRecord)]

At the time of context creation i did this
val conf = new SparkConf()
  .setAppName(detail)
  .set("spark.serializer", "org.apache.spark.serializer.*KryoSerializer*
")
  .set("spark.kryoserializer.buffer.mb",
arguments.get("buffersize").get)
  .set("spark.kryoserializer.buffer.max.mb",
arguments.get("maxbuffersize").get)
  .set("spark.driver.maxResultSize", arguments.get("maxResultSize").get)
  .set("spark.yarn.maxAppAttempts", "1")

.registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum],

classOf[com.ebay.ep.poc.spark.reporting.process.detail.model.DetailInputRecord],

classOf[com.ebay.ep.poc.spark.reporting.process.detail.model.InputRecord],

classOf[com.ebay.ep.poc.spark.reporting.process.model.SessionRecord],
  classOf[com.ebay.ep.poc.spark.reporting.process.model.DataRecord],

classOf[com.ebay.ep.poc.spark.reporting.process.model.ExperimentationRecord]))

The class heirarchy is

DetailInputRecord extends InputRecord extends SessionRecord extends
ExperimentationRecord extends
   org.apache.avro.generic.GenericRecord.Record(schema: Schema)


Please suggest.







On Sat, Apr 11, 2015 at 4:50 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> I have two RDD
>
> leftRDD = RDD[(Long, (DetailInputRecord, VISummary, Long))]
> and
> rightRDD =
> RDD[(Long, com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum)
>
> DetailInputRecord is a object that contains (guid, sessionKey,
> sessionStartDAte, siteID)
>
> There are 10 records in leftRDD (confirmed with leftRDD.count, and each of
> DetailInputRecord record in leftRDD has data within its members)
>
> I do leftRDD.leftOuterJoin(rightRDD)
>
> viEventsWithListings  = leftRDD
> spsLvlMetric   = rightRDD
>
> val viEventsWithListingsJoinSpsLevelMetric =
> viEventsWithListings.leftOuterJoin(spsLvlMetric).map  {
>   case (viJoinSpsLevelMetric) => {
> val (sellerId, ((viEventDetail, viSummary, itemId), spsLvlMetric))
> = viJoinSpsLevelMetric
>
> println("sellerId:" + sellerId)
> println("sessionKey:" + viEventDetail.get("sessionKey"))
> println("guid:" + viEventDetail.get("guid"))
> println("sessionStartDate:" +
> viEventDetail.get("sessionStartDate"))
> println("siteId:" + viEventDetail.get("siteId"))
>
> if (spsLvlMetric.isDefined) {
>
> // do something
>
>  }
> }
>
> I print  each of the items within the DetailInputRecord (viEventDetail) of
> viEventsWithListings before and w

Taks going into NODE_LOCAL at beginning of job

2015-04-11 Thread Jeetendra Gangele
I have 3 transformation and then I am running for each job is going
Process is going in NODE_LOCAL level and no executor in waiting for long
time
no task is running.

Regarding
Jeetendra


RE: HiveThriftServer2

2015-04-11 Thread Mohammed Guller
Thanks, Cheng.

BTW, there is another thread on the same topic. It looks like the thrift-server 
will be published for 1.3.1.

Mohammed

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Saturday, April 11, 2015 5:37 AM
To: Mohammed Guller; user@spark.apache.org
Subject: Re: HiveThriftServer2

Unfortunately the spark-hive-thriftserver hasn't been published yet, you may 
either publish it locally or use it as an unmanaged SBT dependency.
On 4/8/15 8:58 AM, Mohammed Guller wrote:
Hi -

I want to create an instance of HiveThriftServer2 in my Scala application, so  
I imported the following line:

import org.apache.spark.sql.hive.thriftserver._

However, when I compile the code, I get the following error:

object thriftserver is not a member of package org.apache.spark.sql.hive

I tried to include the following in build.sbt, but it looks like it is not 
published:

"org.apache.spark" %% "spark-hive-thriftserver" % "1.3.0",

What library dependency do I need to include in my build.sbt to use the 
ThriftServer2 object?

Thanks,
Mohammed





Re: Yarn application state monitor thread dying on IOException

2015-04-11 Thread Steve Loughran

> On 10 Apr 2015, at 13:40, Lorenz Knies  wrote:
> 
> i would consider it a bug, that the "Yarn application state monitor” thread
> dies on an, i think even expected (at least in the java methods called
> further down the stack), exception.
> 
> What do you think? Is it a problem, that we compiled against hadoop 2.5?

Code path still exists in Hadoop 2.6, so no.

Looks more like YarnClientSchedulerBackend.asyncMonitorApplication should be 
catching the IOE (RM failure triggered) and retrying. 

why not file a JIRA on it, in SPARK

That said, loss of RM is pretty dramatic in a Hadoop cluster, because unless 
you have RM HA enabled, the restarted RM will have no idea what is running, the 
node managers will kill the processes spawned in the containers, etc etc. Your 
app was probably doomed anyway.

Re: How to use Joda Time with Spark SQL?

2015-04-11 Thread Cheng Lian
One possible approach can be defining a UDT (user-defined type) for Joda 
time. A UDT maps an arbitrary type to and from Spark SQL data types. You 
may check the ExamplePointUDT [1] for more details.


[1]: 
https://github.com/apache/spark/blob/694aef0d71d2683eaf63cbd1d8e95c2da423b72e/sql/core/src/main/scala/org/apache/spark/sql/test/ExamplePointUDT.scala


On 4/8/15 6:09 AM, adamgerst wrote:

I've been using Joda Time in all my spark jobs (by using the nscala-time
package) and have not run into any issues until I started trying to use
spark sql.  When I try to convert a case class that has a
com.github.nscala_time.time.Imports.DateTime object in it, an exception is
thrown for with a MatchError

My assumption is that this is because the basic types of spark sql are
java.sql.Timestamp and java.sql.Date and therefor spark doesn't know what to
do about the DateTime value.

How can I get around this? I would prefer not to have to change my code to
make the values be Timestamps but I'm concerned that might be the only way.
Would something like implicit conversions work here?

It seems that even if I specify the schema manually then I would still have
the issue since you have to specify the column type which has to be of type
org.apache.spark.sql.types.DataType



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-Joda-Time-with-Spark-SQL-tp22415.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: HiveThriftServer2

2015-04-11 Thread Cheng Lian
Unfortunately the spark-hive-thriftserver hasn't been published yet, you 
may either publish it locally or use it as an unmanaged SBT dependency.


On 4/8/15 8:58 AM, Mohammed Guller wrote:


Hi –

I want to create an instance of HiveThriftServer2 in my Scala 
application, so  I imported the following line:


import org.apache.spark.sql.hive.thriftserver._

However, when I compile the code, I get the following error:

object thriftserver is not a member of package org.apache.spark.sql.hive

I tried to include the following in build.sbt, but it looks like it is 
not published:


"org.apache.spark" %% "spark-hive-thriftserver" % "1.3.0",

What library dependency do I need to include in my build.sbt to use 
the ThriftServer2 object?


Thanks,

Mohammed





Re: Spark SQL or rules hot reload

2015-04-11 Thread Cheng Lian
What do you mean by "rules"? Spark SQL optimization rules? Currently 
these are entirely private to Spark SQL, and are not configurable during 
runtime.


Cheng

On 4/10/15 2:55 PM, Bruce Dou wrote:

Hi,

How to manage the life cycle of spark sql and rules applying on the 
data stream. Enabling or disable rules without deploying new jar and 
codes.Which is the best practice?


Thanks. 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Microsoft SQL jdbc support from spark sql

2015-04-11 Thread Cheng Lian
Your first DDL should be correct (as long as the JDBC URL is correct). 
The string after USING should be the data source name 
("org.apache.spark.sql.jdbc" or simply "jdbc").


The SQLException here indicates that Spark SQL couldn't find SQL Server 
JDBC driver in the classpath.


As what Denny said, currently the JDBC data source is not extensible and 
only provides good support for MySQL and PostgreSQL, but I'd expect that 
simple standard SQL queries should work.


Cheng

On 4/7/15 1:40 PM, bipin wrote:

Hi, I am trying to pull data from ms-sql server. I have tried using the
spark.sql.jdbc

CREATE TEMPORARY TABLE c
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:sqlserver://10.1.0.12:1433\;databaseName=dbname\;",
dbtable "Customer"
);

But it shows java.sql.SQLException: No suitable driver found for
jdbc:sqlserver

I have jdbc drivers for mssql but i am not sure how to use them I provide
the jars to the sql shell and then tried the following:

CREATE TEMPORARY TABLE c
USING com.microsoft.sqlserver.jdbc.SQLServerDriver
OPTIONS (
url "jdbc:sqlserver://10.1.0.12:1433\;databaseName=dbname\;",
dbtable "Customer"
);

But this gives ERROR CliDriver: scala.MatchError: SQLServerDriver:4 (of
class com.microsoft.sqlserver.jdbc.SQLServerDriver)

Can anyone tell what is the proper way to connect to ms-sql server.
Thanks






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Microsoft-SQL-jdbc-support-from-spark-sql-tp22399.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

.




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Unusual behavior with leftouterjoin

2015-04-11 Thread ๏̯͡๏
I have two RDD

leftRDD = RDD[(Long, (DetailInputRecord, VISummary, Long))]
and
rightRDD =
RDD[(Long, com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLevelMetricSum)

DetailInputRecord is a object that contains (guid, sessionKey,
sessionStartDAte, siteID)

There are 10 records in leftRDD (confirmed with leftRDD.count, and each of
DetailInputRecord record in leftRDD has data within its members)

I do leftRDD.leftOuterJoin(rightRDD)

viEventsWithListings  = leftRDD
spsLvlMetric   = rightRDD

val viEventsWithListingsJoinSpsLevelMetric =
viEventsWithListings.leftOuterJoin(spsLvlMetric).map  {
  case (viJoinSpsLevelMetric) => {
val (sellerId, ((viEventDetail, viSummary, itemId), spsLvlMetric))
= viJoinSpsLevelMetric

println("sellerId:" + sellerId)
println("sessionKey:" + viEventDetail.get("sessionKey"))
println("guid:" + viEventDetail.get("guid"))
println("sessionStartDate:" + viEventDetail.get("sessionStartDate"))
println("siteId:" + viEventDetail.get("siteId"))

if (spsLvlMetric.isDefined) {

// do something

 }
}

I print  each of the items within the DetailInputRecord (viEventDetail) of
viEventsWithListings before and within leftOuterJoin.  Before leftOuterJoin
i get values of each member within record (total 10 records).

Within join when i do the print i get only guid as value for all members.
How is this possible ?

Within join: (print statements. These are guids)
sessionKey:27c9fbc014b4f61526f0574001b73b00
guid:27c9fbc014b4f61526f0574001b73b00
sessionStartDate:27c9fbc014b4f61526f0574001b73b00
siteId:27c9fbc014b4f61526f0574001b73b00

What went wrong, i have debugged multiple times but fail to understand the
reason.
Appreciate your help
-- 
Deepak