from:"Sahil Sareen"

Re: rpc.RpcTimeoutException: Futures timed out after [120 seconds]

2016-05-20 Thread Sahil Sareen

I'm not sure if this happens on small files or big ones as I have a mix of
them always.
Did you see this only for big files?

On Fri, May 20, 2016 at 7:36 PM, Mail.com <pradeep.mi...@mail.com> wrote:

> Hi Sahil,
>
> I have seen this with high GC time. Do you ever get this error with small
> volume files
>
> Pradeep
>
> On May 20, 2016, at 9:32 AM, Sahil Sareen <sareen...@gmail.com> wrote:
>
> Hey all
>
> I'm using Spark-1.6.1 and occasionally seeing executors lost and hurting
> my application performance due to these errors.
> Can someone please let out all the possible problems that could cause this?
>
>
> Full log:
>
> 16/05/19 02:17:54 ERROR ContextCleaner: Error cleaning broadcast 266685
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at org.apache.spark.rpc.RpcTimeout.org
> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:229)
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:225)
> at
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:242)
> at
> org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:136)
> at
> org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:228)
> at
> org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
> at
> org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:67)
> at
> org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:214)
> at
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:170)
> at
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:161)
> at scala.Option.foreach(Option.scala:257)
> at
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:161)
> at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
> at org.apache.spark.ContextCleaner.org
> $apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:154)
> at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:67)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [120 seconds]
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
> at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:190)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:241)
> ... 12 more
> 16/05/19 02:18:26 ERROR TaskSchedulerImpl: Lost executor
> 20160421-192532-1677787146-5050-40596-S23 on ip-10-0-1-70.ec2.internal:
> Executor heartbeat timed out after 161447 ms
> 16/05/19 02:18:53 ERROR TaskSchedulerImpl: Lost executor
> 20160421-192532-1677787146-5050-40596-S23 on ip-10-0-1-70.ec2.internal:
> remote Rpc client disassociated
>
>
>
> Thanks
> Sahil
>
>

rpc.RpcTimeoutException: Futures timed out after [120 seconds]

2016-05-20 Thread Sahil Sareen

Hey all

I'm using Spark-1.6.1 and occasionally seeing executors lost and hurting my
application performance due to these errors.
Can someone please let out all the possible problems that could cause this?


Full log:

16/05/19 02:17:54 ERROR ContextCleaner: Error cleaning broadcast 266685
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
seconds]. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org
$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:229)
at
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:225)
at
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:242)
at
org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:136)
at
org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:228)
at
org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
at
org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:67)
at
org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:214)
at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:170)
at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:161)
at scala.Option.foreach(Option.scala:257)
at
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:161)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136)
at org.apache.spark.ContextCleaner.org
$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:154)
at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:67)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after
[120 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:241)
... 12 more
16/05/19 02:18:26 ERROR TaskSchedulerImpl: Lost executor
20160421-192532-1677787146-5050-40596-S23 on ip-10-0-1-70.ec2.internal:
Executor heartbeat timed out after 161447 ms
16/05/19 02:18:53 ERROR TaskSchedulerImpl: Lost executor
20160421-192532-1677787146-5050-40596-S23 on ip-10-0-1-70.ec2.internal:
remote Rpc client disassociated



Thanks
Sahil

Re: GraphX can show graph?

2016-01-28 Thread Sahil Sareen

Try Neo4j for visualization, GraphX does a pretty god job at distributed
graph processing.

On Thu, Jan 28, 2016 at 12:42 PM, Balachandar R.A.  wrote:

> Hi
>
> I am new to GraphX. I have a simple csv file which I could load and
> compute few graph statistics. However, I am not sure whether it is possible
> to create ad show graph (for visualization purpose) using GraphX. Any
> pointer to tutorial or information connected to this will be really helpful
>
> Thanks and regards
> Bala
>

Neo4j and Spark/GraphX

2016-01-27 Thread Sahil Sareen

Hey everyone!

I'm using spark and graphx for graph processing and wish to export a
subgraph to Neo4j(from the spark-submit console) for visualisation and
basic graph querying that neo4j supports.
I looked at the mazerunner project but it seems to be overkill. Any
alternatives?

-Sahil

Re: JSON to SQL

2016-01-27 Thread Sahil Sareen

Isn't this just about defining a case class and using
parse(json).extract[CaseClassName]  using Jackson?

-Sahil

On Wed, Jan 27, 2016 at 11:08 PM, Andrés Ivaldi  wrote:

> We dont have Domain Objects, its a service like a pipeline, data is read
> from source and they are saved it in relational Database
>
> I can read the structure from DataFrames, and do some transformations, I
> would prefer to do it with Spark to be consistent with the process
>
>
>
> On Wed, Jan 27, 2016 at 12:25 PM, Al Pivonka  wrote:
>
>> Are you using an Relational Database?
>> If so why not use a nojs DB ? then pull from it to your relational?
>>
>> Or utilize a library that understands Json structure like Jackson to
>> obtain the data from the Json structure the persist the Domain Objects ?
>>
>> On Wed, Jan 27, 2016 at 9:45 AM, Andrés Ivaldi 
>> wrote:
>>
>>> Sure,
>>> The Job is like an etl, but without interface, so I decide the rules of
>>> how the JSON will be saved into a SQL Table.
>>>
>>> I need to Flatten the hierarchies where is possible in case of list
>>> flatten also, nested objects Won't be processed by now
>>>
>>> {"a":1,"b":[2,3],"c"="Field", "d":[4,5,6,7,8] }
>>> {"a":11,"b":[22,33],"c"="Field1", "d":[44,55,66,77,88] }
>>> {"a":111,"b":[222,333],"c"="Field2", "d":[44,55,666,777,888] }
>>>
>>> I would like something like this on my SQL table
>>>
>>> ab  c d
>>> 12,3Field 4,5,6,7,8
>>> 11   22,33  Field144,55,66,77,88
>>> 111  222,333Field2444,555,,666,777,888
>>>
>>> Right now this is what i need
>>>
>>> I will later add more intelligence, like detection of list or nested
>>> objects and create relations in other tables.
>>>
>>>
>>>
>>> On Wed, Jan 27, 2016 at 11:25 AM, Al Pivonka 
>>> wrote:
>>>
 More detail is needed.
 Can you provide some context to the use-case ?

 On Wed, Jan 27, 2016 at 8:33 AM, Andrés Ivaldi 
 wrote:

> Hello, I'm trying to Save a JSON filo into SQL table.
>
> If i try to do this directly the IlligalArgumentException is raised, I
> suppose this is beacouse JSON have a hierarchical structure, is that
> correct?
>
> If that is the problem, how can I flatten the JSON structure? The JSON
> structure to be processed would be unknow, so I need to do it
> programatically
>
> regards
>
> --
> Ing. Ivaldi Andres
>



 --
 Those who say it can't be done, are usually interrupted by those doing
 it.

>>>
>>>
>>>
>>> --
>>> Ing. Ivaldi Andres
>>>
>>
>>
>>
>> --
>> Those who say it can't be done, are usually interrupted by those doing it.
>>
>
>
>
> --
> Ing. Ivaldi Andres
>

Difference between DataFrame.cache() and hiveContext.cacheTable()?

2015-12-18 Thread Sahil Sareen

Is there any difference between the following snippets:

val df = hiveContext.createDataFrame(rows, schema)
df.registerTempTable("myTable")
df.cache()

and

val df = hiveContext.createDataFrame(rows, schema)
df.registerTempTable("myTable")
hiveContext.cacheTable("myTable")


-Sahil

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Does calling sqlContext.cacheTable("oldTableName") remove the cached contents of the oldTable

2015-12-18 Thread Sahil Sareen

Spark 1.5.2

dfOld.registerTempTable("oldTableName")
sqlContext.cacheTable("oldTableName")
// 
// do something
// 
dfNew.registerTempTable("oldTableName")
sqlContext.cacheTable("oldTableName")


Now when I use the "oldTableName" table I do get the latest contents
from dfNew but do the contents of dfOld get removed from the memory?

Or is the right usage to do this:
dfOld.registerTempTable("oldTableName")
sqlContext.cacheTable("oldTableName")
// 
// do something
// 
dfNew.registerTempTable("oldTableName")
sqlContext.unCacheTable("oldTableName") <== unCache the old
contents first
sqlContext.cacheTable("oldTableName")

-Sahil

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Does calling sqlContext.cacheTable("oldTableName") remove the cached contents of the oldTable

2015-12-18 Thread Sahil Sareen

Thanks Ted!

Yes, The schema might be different or the same.
What would be the answer for each situation?

On Fri, Dec 18, 2015 at 6:02 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> CacheManager#cacheQuery() is called where:
>   * Caches the data produced by the logical representation of the given
> [[Queryable]].
> ...
> val planToCache = query.queryExecution.analyzed
> if (lookupCachedData(planToCache).nonEmpty) {
>
> Is the schema for dfNew different from that of dfOld ?
>
> Cheers
>
> On Fri, Dec 18, 2015 at 3:33 AM, Sahil Sareen <sareen...@gmail.com> wrote:
>
>> Spark 1.5.2
>>
>> dfOld.registerTempTable("oldTableName")
>> sqlContext.cacheTable("oldTableName")
>> // 
>> // do something
>> // 
>> dfNew.registerTempTable("oldTableName")
>> sqlContext.cacheTable("oldTableName")
>>
>>
>> Now when I use the "oldTableName" table I do get the latest contents
>> from dfNew but do the contents of dfOld get removed from the memory?
>>
>> Or is the right usage to do this:
>> dfOld.registerTempTable("oldTableName")
>> sqlContext.cacheTable("oldTableName")
>> // 
>> // do something
>> // 
>> dfNew.registerTempTable("oldTableName")
>> sqlContext.unCacheTable("oldTableName") <== unCache the old
>> contents first
>> sqlContext.cacheTable("oldTableName")
>>
>> -Sahil
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Does calling sqlContext.cacheTable("oldTableName") remove the cached contents of the oldTable

2015-12-18 Thread Sahil Sareen

So I looked at the function, my only worry is that the cache should be
cleared if I'm overwriting the cache with the same table name. I did this
experiment and the cache shows as table not cached but want to confirm
this. In addition to not using the old table values is it actually
removed/overwritten in memory?

scala> df.collect
res54: Array[org.apache.spark.sql.Row] = Array([blue,#0033FF],
[red,#FF], [green,#FSKA])  <=== 3 rows

scala> df2.collect
res55: Array[org.apache.spark.sql.Row] = Array([blue,#0033FF],
[red,#FF])  <=== 2 rows

scala> df.registerTempTable("myColorsTable")

scala> sqlContext.isCached("myColorsTable")
res58: Boolean = false

scala> sqlContext.cacheTable("myColorsTable") <=== cache table in df(3 rows)

scala> sqlContext.isCached("myColorsTable")
res60: Boolean = true

scala> sqlContext.sql("select * from myColorsTable").foreach(println)
<=== sql is running on df(3 rows)
[blue,#0033FF]
[red,#FF]
[green,#FSKA]

scala> df2.registerTempTable("myColorsTable") <=== register another
table with the same table name
*scala> sqlContext.isCached("myColorsTable")
res63: Boolean = false*

scala> sqlContext.sql("select * from myColorsTable").foreach(println)
<=== sql is running on df2(2 rows)
[blue,#0033FF]
[red,#FF]


On Fri, Dec 18, 2015 at 11:17 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> This method in CacheManager:
>   private[sql] def lookupCachedData(plan: LogicalPlan): Option[CachedData]
> = readLock {
> cachedData.find(cd => plan.sameResult(cd.plan))
>
> Ied me to the following in
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
> :
>
>   def sameResult(plan: LogicalPlan): Boolean = {
>
> There is detailed comment above this method which should give some idea.
>
> Cheers
>
> On Fri, Dec 18, 2015 at 9:21 AM, Sahil Sareen <sareen...@gmail.com> wrote:
>
>> Thanks Ted!
>>
>> Yes, The schema might be different or the same.
>> What would be the answer for each situation?
>>
>> On Fri, Dec 18, 2015 at 6:02 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> CacheManager#cacheQuery() is called where:
>>>   * Caches the data produced by the logical representation of the given
>>> [[Queryable]].
>>> ...
>>> val planToCache = query.queryExecution.analyzed
>>> if (lookupCachedData(planToCache).nonEmpty) {
>>>
>>> Is the schema for dfNew different from that of dfOld ?
>>>
>>> Cheers
>>>
>>> On Fri, Dec 18, 2015 at 3:33 AM, Sahil Sareen <sareen...@gmail.com>
>>> wrote:
>>>
>>>> Spark 1.5.2
>>>>
>>>> dfOld.registerTempTable("oldTableName")
>>>> sqlContext.cacheTable("oldTableName")
>>>> // 
>>>> // do something
>>>> // 
>>>> dfNew.registerTempTable("oldTableName")
>>>> sqlContext.cacheTable("oldTableName")
>>>>
>>>>
>>>> Now when I use the "oldTableName" table I do get the latest contents
>>>> from dfNew but do the contents of dfOld get removed from the memory?
>>>>
>>>> Or is the right usage to do this:
>>>> dfOld.registerTempTable("oldTableName")
>>>> sqlContext.cacheTable("oldTableName")
>>>> // 
>>>> // do something
>>>> // 
>>>> dfNew.registerTempTable("oldTableName")
>>>> sqlContext.unCacheTable("oldTableName") <== unCache the old
>>>> contents first
>>>> sqlContext.cacheTable("oldTableName")
>>>>
>>>> -Sahil
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: Does calling sqlContext.cacheTable("oldTableName") remove the cached contents of the oldTable

2015-12-18 Thread Sahil Sareen

>From the UI I see two rows for this on a streaming application:

RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize in
ExternalBlockStoreSize on DiskIn-memory table myColorsTableMemory
Deserialized 1x Replicated2100%728.2 KB0.0 B0.0 BIn-memory table
myColorsTableMemory Deserialized 1x Replicated2100%728.2 KB0.0 B0.0 BThis
means it wasn't overwritten :(
My question now is, if only the latest table is going to be used, why isn't
the earlier version auto cleared?

On Fri, Dec 18, 2015 at 11:44 PM, Sahil Sareen <sareen...@gmail.com> wrote:

> So I looked at the function, my only worry is that the cache should be
> cleared if I'm overwriting the cache with the same table name. I did this
> experiment and the cache shows as table not cached but want to confirm
> this. In addition to not using the old table values is it actually
> removed/overwritten in memory?
>
> scala> df.collect
> res54: Array[org.apache.spark.sql.Row] = Array([blue,#0033FF], [red,#FF], 
> [green,#FSKA])  <=== 3 rows
>
> scala> df2.collect
> res55: Array[org.apache.spark.sql.Row] = Array([blue,#0033FF], [red,#FF]) 
>  <=== 2 rows
>
> scala> df.registerTempTable("myColorsTable")
>
> scala> sqlContext.isCached("myColorsTable")
> res58: Boolean = false
>
> scala> sqlContext.cacheTable("myColorsTable") <=== cache table in df(3 rows)
>
> scala> sqlContext.isCached("myColorsTable")
> res60: Boolean = true
>
> scala> sqlContext.sql("select * from myColorsTable").foreach(println) <=== 
> sql is running on df(3 rows)
> [blue,#0033FF]
> [red,#FF]
> [green,#FSKA]
>
> scala> df2.registerTempTable("myColorsTable") <=== register another table 
> with the same table name
> *scala> sqlContext.isCached("myColorsTable")
> res63: Boolean = false*
>
> scala> sqlContext.sql("select * from myColorsTable").foreach(println) <=== 
> sql is running on df2(2 rows)
> [blue,#0033FF]
> [red,#FF]
>
>
> On Fri, Dec 18, 2015 at 11:17 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> This method in CacheManager:
>>   private[sql] def lookupCachedData(plan: LogicalPlan):
>> Option[CachedData] = readLock {
>> cachedData.find(cd => plan.sameResult(cd.plan))
>>
>> Ied me to the following in
>> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
>> :
>>
>>   def sameResult(plan: LogicalPlan): Boolean = {
>>
>> There is detailed comment above this method which should give some idea.
>>
>> Cheers
>>
>> On Fri, Dec 18, 2015 at 9:21 AM, Sahil Sareen <sareen...@gmail.com>
>> wrote:
>>
>>> Thanks Ted!
>>>
>>> Yes, The schema might be different or the same.
>>> What would be the answer for each situation?
>>>
>>> On Fri, Dec 18, 2015 at 6:02 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> CacheManager#cacheQuery() is called where:
>>>>   * Caches the data produced by the logical representation of the given
>>>> [[Queryable]].
>>>> ...
>>>> val planToCache = query.queryExecution.analyzed
>>>> if (lookupCachedData(planToCache).nonEmpty) {
>>>>
>>>> Is the schema for dfNew different from that of dfOld ?
>>>>
>>>> Cheers
>>>>
>>>> On Fri, Dec 18, 2015 at 3:33 AM, Sahil Sareen <sareen...@gmail.com>
>>>> wrote:
>>>>
>>>>> Spark 1.5.2
>>>>>
>>>>> dfOld.registerTempTable("oldTableName")
>>>>> sqlContext.cacheTable("oldTableName")
>>>>> // 
>>>>> // do something
>>>>> // 
>>>>> dfNew.registerTempTable("oldTableName")
>>>>> sqlContext.cacheTable("oldTableName")
>>>>>
>>>>>
>>>>> Now when I use the "oldTableName" table I do get the latest contents
>>>>> from dfNew but do the contents of dfOld get removed from the memory?
>>>>>
>>>>> Or is the right usage to do this:
>>>>> dfOld.registerTempTable("oldTableName")
>>>>> sqlContext.cacheTable("oldTableName")
>>>>> // 
>>>>> // do something
>>>>> // 
>>>>> dfNew.registerTempTable("oldTableName")
>>>>> sqlContext.unCacheTable("oldTableName") <== unCache the old
>>>>> contents first
>>>>> sqlContext.cacheTable("oldTableName")
>>>>>
>>>>> -Sahil
>>>>>
>>>>> -
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>

Using TestHiveContext/HiveContext in unit tests

2015-12-11 Thread Sahil Sareen

I'm trying to do this in unit tests:

val sConf = new SparkConf()
  .setAppName("RandomAppName")
  .setMaster("local")
val sc = new SparkContext(sConf)
val sqlContext = new TestHiveContext(sc)  // tried new HiveContext(sc)
as well


But I get this:

*[scalatest] **Exception encountered when invoking run on a nested suite -
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetaStoreClient *** ABORTED 

*[scalatest]   java.lang.RuntimeException: java.lang.RuntimeException:
Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetaStoreClient*[scalatest]
  at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:346)
[scalatest]   at
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:120)
[scalatest]   at
org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:163)
[scalatest]   at
org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:161)
[scalatest]   at
org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:168)
[scalatest]   at
org.apache.spark.sql.hive.test.TestHiveContext.(TestHive.scala:72)
[scalatest]   at mypackage.NewHiveTest.beforeAll(NewHiveTest.scala:48)
[scalatest]   at
org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
[scalatest]   at mypackage.NewHiveTest.beforeAll(NewHiveTest.scala:35)
[scalatest]   at
org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
[scalatest]   at mypackage.NewHiveTest.run(NewHiveTest.scala:35)
[scalatest]   at
org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1491)

The code works perfectly when I run using spark-submit, but not in unit
tests. Any inputs??

-Sahil

Re: Column Aliases are Ignored in callUDF while using struct()

2015-12-03 Thread Sahil Sareen

Attaching the JIRA as well for completeness:
https://issues.apache.org/jira/browse/SPARK-12117

On Thu, Dec 3, 2015 at 4:13 PM, Sachin Aggarwal 
wrote:

>
> Hi All,
>
> need help guys, I need a work around for this situation
>
> *case where this works:*
>
> val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"),
> ("Rishabh", "2"))).toDF("myText", "id")
>
> TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show
>
>
> steps to reproduce error case:
> 1) create a file copy following text--filename(a.json)
>
> { "myText": "Sachin Aggarwal", "id": "1"}
> { "myText": "Rishabh", "id": "2"}
>
> 2) define a simple UDF
> def mydef(r:Row)={println(r.schema); r.getAs("Text").asInstanceOf[String]}
>
> 3) register the udf
>  sqlContext.udf.register("mydef" ,mydef _)
>
> 4) read the input file
> val TestDoc2=sqlContext.read.json("/tmp/a.json")
>
> 5) make a call to UDF
>
> TestDoc2.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show
>
> ERROR received:
> java.lang.IllegalArgumentException: Field "Text" does not exist.
>  at
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
>  at
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
>  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>  at scala.collection.AbstractMap.getOrElse(Map.scala:58)
>  at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:233)
>  at
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.fieldIndex(rows.scala:212)
>  at org.apache.spark.sql.Row$class.getAs(Row.scala:325)
>  at
> org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191)
>  at
> $line414.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.mydef(:107)
>  at
> $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
>  at
> $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
>  at
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
>  at
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
>  at
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964)
>  at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
> Source)
>  at
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55)
>  at
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>  at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>  at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>  at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>  at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>  at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>  at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>  at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>  at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>  at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>  at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>  at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
>  at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>  at org.apache.spark.scheduler.Task.run(Task.scala:88)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>  at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1177)
>  at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
>  at java.lang.Thread.run(Thread.java:857)
>
> --
>
> Thanks & Regards
>
> Sachin Aggarwal
> 7760502772
>

Re: spark1.4.1 extremely slow for take(1) or head() or first() or show

2015-12-03 Thread Sahil Sareen

"select  'uid',max(length(uid)),count(distinct(uid)),count(uid),sum(case
when uid is null then 0 else 1 end),sum(case when uid is null then 1 else 0
end),sum(case when uid is null then 1 else 0 end)/count(uid) from tb"

Is this as is, or did you use a UDF here?

-Sahil

On Thu, Dec 3, 2015 at 4:06 PM, Mich Talebzadeh  wrote:

> Can you try running it directly on hive to see the timing or through
> spark-sql may be.
>
>
>
> Spark does what Hive does that is processing large sets of data, but it
> attempts to do the intermediate iterations in memory if it can (i.e. if
> there is enough memory available to keep the data set in memory), otherwise
> it will have to use disk space. So it boils down to how much memory you
> have.
>
>
>
> HTH
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* hxw黄祥为 [mailto:huang...@ctrip.com]
> *Sent:* 03 December 2015 10:29
> *To:* user@spark.apache.org
> *Subject:* spark1.4.1 extremely slow for take(1) or head() or first() or
> show
>
>
>
> Dear All,
>
>
>
> I have a hive table with 100 million data and I just ran some very simple
> operations on this dataset like:
>
>
>
>   val df = sqlContext.sql("select * from user ").toDF
>
>   df.cache
>
>   df.registerTempTable("tb")
>
>   val b=sqlContext.sql("select
> 'uid',max(length(uid)),count(distinct(uid)),count(uid),sum(case when uid is
> null then 0 else 1 end),sum(case when uid is null then 1 else 0
> end),sum(case when uid is null then 1 else 0 end)/count(uid) from tb")
>
>   b.show  //the result just one line but this step is extremely slow
>
>
>
> Is this expected? Why show is so slow for dataframe? Is it a bug in the
> optimizer? or I did something wrong?
>
>
>
>
>
> Best Regards,
>
> tylor
>

Re: spark sql cli query results written to file ?

2015-12-02 Thread Sahil Sareen

Yeah, Thats the example from the link I just posted.

-Sahil

On Thu, Dec 3, 2015 at 11:41 AM, Akhil Das 
wrote:

> Something like this?
>
> val df = 
> sqlContext.read.load("examples/src/main/resources/users.parquet")df.select("name",
>  "favorite_color").write.save("namesAndFavColors.parquet")
>
>
> It will save the name, favorite_color columns to a parquet file. You can
> read more information over here
> http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
>
>
>
> Thanks
> Best Regards
>
> On Thu, Dec 3, 2015 at 11:35 AM, fightf...@163.com 
> wrote:
>
>> HI，
>> How could I save the spark sql cli running queries results and write the
>> results to some local file ?
>> Is there any available command ?
>>
>> Thanks,
>> Sun.
>>
>> --
>> fightf...@163.com
>>
>
>

Re: Improve saveAsTextFile performance

2015-12-02 Thread Sahil Sareen

PTAL:
http://stackoverflow.com/questions/29213404/how-to-split-an-rdd-into-multiple-smaller-rdds-given-a-max-number-of-rows-per

-Sahil

On Thu, Dec 3, 2015 at 9:18 AM, Ram VISWANADHA <
ram.viswana...@dailymotion.com> wrote:

> Yes. That did not help.
>
> Best Regards,
> Ram
> From: Ted Yu 
> Date: Wednesday, December 2, 2015 at 3:25 PM
> To: Ram VISWANADHA 
> Cc: user 
> Subject: Re: Improve saveAsTextFile performance
>
> Have you tried calling coalesce() before saveAsTextFile ?
>
> Cheers
>
> On Wed, Dec 2, 2015 at 3:15 PM, Ram VISWANADHA <
> ram.viswana...@dailymotion.com> wrote:
>
>> JavaRDD.saveAsTextFile is taking a long time to succeed. There are 10
>> tasks, the first 9 complete in a reasonable time but the last task is
>> taking a long time to complete. The last task contains the maximum number
>> of records like 90% of the total number of records.  Is there any way to
>> parallelize the execution by increasing the number of tasks or evenly
>> distributing the number of records to different tasks?
>>
>> Thanks in advance.
>>
>> Best Regards,
>> Ram
>>
>
>

Re: spark sql cli query results written to file ?

2015-12-02 Thread Sahil Sareen

Did you see: http://spark.apache.org/docs/latest/sql-programming-guide.html

-Sahil

On Thu, Dec 3, 2015 at 11:35 AM, fightf...@163.com 
wrote:

> HI，
> How could I save the spark sql cli running queries results and write the
> results to some local file ?
> Is there any available command ?
>
> Thanks,
> Sun.
>
> --
> fightf...@163.com
>

Re: Multiplication on decimals in a dataframe query

2015-12-02 Thread Sahil Sareen

+1 looks like a bug

I think referencing trades() twice in multiplication is broken,

scala> trades.select(trades("quantity")*trades("quantity")).show

+-+
|(quantity * quantity)|
+-+
| null|
| null|

scala> sqlContext.sql("select price*price as PP from trades").show

++
|  PP|
++
|null|
|null|


-Sahil

On Thu, Dec 3, 2015 at 12:02 PM, Akhil Das 
wrote:

> Not quiet sure whats happening, but its not an issue with multiplication i
> guess as the following query worked for me:
>
> trades.select(trades("price")*9.5).show
> +-+
> |(price * 9.5)|
> +-+
> |199.5|
> |228.0|
> |190.0|
> |199.5|
> |190.0|
> |256.5|
> |218.5|
> |275.5|
> |218.5|
> ..
> ..
>
>
> Could it be with the precision? ccing dev list, may be you can open up a
> jira for this as it seems to be a bug.
>
> Thanks
> Best Regards
>
> On Mon, Nov 30, 2015 at 12:41 AM, Philip Dodds 
> wrote:
>
>> I hit a weird issue when I tried to multiply to decimals in a select
>> (either in scala or as SQL), and Im assuming I must be missing the point.
>>
>> The issue is fairly easy to recreate with something like the following:
>>
>>
>> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>> import sqlContext.implicits._
>> import org.apache.spark.sql.types.Decimal
>>
>> case class Trade(quantity: Decimal,price: Decimal)
>>
>> val data = Seq.fill(100) {
>>   val price = Decimal(20+scala.util.Random.nextInt(10))
>> val quantity = Decimal(20+scala.util.Random.nextInt(10))
>>
>>   Trade(quantity, price)
>> }
>>
>> val trades = sc.parallelize(data).toDF()
>> trades.registerTempTable("trades")
>>
>> trades.select(trades("price")*trades("quantity")).show
>>
>> sqlContext.sql("select
>> price/quantity,price*quantity,price+quantity,price-quantity from
>> trades").show
>>
>> The odd part is if you run it you will see that the addition/division and
>> subtraction works but the multiplication returns a null.
>>
>> Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11)
>>
>> ie.
>>
>> +--+
>>
>> |(price * quantity)|
>>
>> +--+
>>
>> |  null|
>>
>> |  null|
>>
>> |  null|
>>
>> |  null|
>>
>> |  null|
>>
>> +--+
>>
>>
>> +++++
>>
>> | _c0| _c1| _c2| _c3|
>>
>> +++++
>>
>> |0.952380952380952381|null|41.00...|-1.00...|
>>
>> |1.380952380952380952|null|50.00...|8.00|
>>
>> |1.272727272727272727|null|50.00...|6.00|
>>
>> |0.83|null|44.00...|-4.00...|
>>
>> |1.00|null|58.00...|   0E-18|
>>
>> +++++
>>
>>
>> Just keen to know what I did wrong?
>>
>>
>> Cheers
>>
>> P
>>
>> --
>> Philip Dodds
>>
>>
>>
>

Re: how to skip headers when reading multiple files

2015-12-02 Thread Sahil Sareen

You could use "filter" to eliminate headers from your text file RDD while
going over each line.

-Sahil

On Thu, Dec 3, 2015 at 9:37 AM, Jeff Zhang  wrote:

> Are you read csv file ? If so you can use spark-csv which support skip
> header
>
> http://spark-packages.org/package/databricks/spark-csv
>
>
>
> On Thu, Dec 3, 2015 at 10:52 AM, Divya Gehlot 
> wrote:
>
>> Hi,
>> I am new bee to Spark and Scala .
>> As one of my requirement to read and process multiple text files with
>> headers using DataFrame API .
>> How can I skip headers when processing data with DataFrame API
>>
>> Thanks in advance .
>> Regards,
>> Divya
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

java.io.FileNotFoundException: Job aborted due to stage failure

2015-11-26 Thread Sahil Sareen

Im using Spark1.4.2 with Hadoop 2.7, I tried increasing
spark.shuffle.io.maxRetries to 10  but didn't help.

Any ideas on what could be causing this??

This is the exception that I am getting:

[MySparkApplication] WARN : Failed to execute SQL statement select *
from TableS s join TableC c on s.property = c.property from X YZ
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 4 in stage 5710.0 failed 4 times, most recent failure: Lost task
4.3 in stage 5710.0 (TID 341269,
ip-10-0-1-80.us-west-2.compute.internal):
java.io.FileNotFoundException:
/mnt/md0/var/lib/spark/spark-549f7d96-82da-4b8d-b9fe-7f6fe8238478/blockmgr-
f44be41a-9036-4b93-8608-4a8b2fabbc06/0b/shuffle_3257_4_0.data
(Permission denied)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.open(
BlockObjectWriter.scala:128)
at org.apache.spark.storage.DiskBlockObjectWriter.write(
BlockObjectWriter.scala:203)
at org.apache.spark.util.collection.WritablePartitionedIterator$$
anon$3.writeNext(WritablePartitionedPairCollection.scala:104)
at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(
ExternalSorter.scala:757)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(
SortShuffleWriter.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(
ShuffleMapTask.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(
ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$
scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1276)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
DAGScheduler.scala:1267)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
DAGScheduler.scala:1266)
at scala.collection.mutable.ResizableArray$class.foreach(
ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(
DAGScheduler.scala:1266)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$
handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$
handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(
DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.
onReceive(DAGScheduler.scala:1460)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.
onReceive(DAGScheduler.scala:1421)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

Thanks
Sahil

Spark 1.4.2- java.io.FileNotFoundException: Job aborted due to stage failure

2015-11-24 Thread Sahil Sareen

I tried increasing spark.shuffle.io.maxRetries to 10 but didn't help.

This is the exception that I am getting:

[MySparkApplication] WARN : Failed to execute SQL statement select *
from TableS s join TableC c on s.property = c.property from X YZ
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 4 in stage 5710.0 failed 4 times, most recent failure: Lost task
4.3 in stage 5710.0 (TID 341269,
ip-10-0-1-80.us-west-2.compute.internal):
java.io.FileNotFoundException:
/mnt/md0/var/lib/spark/spark-549f7d96-82da-4b8d-b9fe-7f6fe8238478/blockmgr-f44be41a-9036-4b93-8608-4a8b2fabbc06/0b/shuffle_3257_4_0.data
(Permission denied)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at 
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:128)
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:203)
at 
org.apache.spark.util.collection.WritablePartitionedIterator$$anon$3.writeNext(WritablePartitionedPairCollection.scala:104)
at 
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:757)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1276)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1267)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1266)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1266)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1460)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1421)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: rpc.RpcTimeoutException: Futures timed out after [120 seconds]

rpc.RpcTimeoutException: Futures timed out after [120 seconds]

Re: GraphX can show graph?

Neo4j and Spark/GraphX

Re: JSON to SQL

Difference between DataFrame.cache() and hiveContext.cacheTable()?

Does calling sqlContext.cacheTable("oldTableName") remove the cached contents of the oldTable

Re: Does calling sqlContext.cacheTable("oldTableName") remove the cached contents of the oldTable

Re: Does calling sqlContext.cacheTable("oldTableName") remove the cached contents of the oldTable

Re: Does calling sqlContext.cacheTable("oldTableName") remove the cached contents of the oldTable

Using TestHiveContext/HiveContext in unit tests

Re: Column Aliases are Ignored in callUDF while using struct()

Re: spark1.4.1 extremely slow for take(1) or head() or first() or show

Re: spark sql cli query results written to file ?

Re: Improve saveAsTextFile performance

Re: spark sql cli query results written to file ?

Re: Multiplication on decimals in a dataframe query

Re: how to skip headers when reading multiple files

java.io.FileNotFoundException: Job aborted due to stage failure

Spark 1.4.2- java.io.FileNotFoundException: Job aborted due to stage failure

20 matches

Site Navigation

Mail list logo

Footer information