Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Ravindra
either increase overall executor memory if you have scope. or try to give
more % to overhead memory from default of .7.

Read this

for more details.


On Wed, Aug 2, 2017 at 11:03 PM Chetan Khatri 
wrote:

> Can anyone please guide me with above issue.
>
>
> On Wed, Aug 2, 2017 at 6:28 PM, Chetan Khatri  > wrote:
>
>> Hello Spark Users,
>>
>> I have Hbase table reading and writing to Hive managed table where i
>> applied partitioning by date column which worked fine but it has generate
>> more number of files in almost 700 partitions but i wanted to use
>> reparation to reduce File I/O by reducing number of files inside each
>> partition.
>>
>> *But i ended up with below exception:*
>>
>> ExecutorLostFailure (executor 11 exited caused by one of the running
>> tasks) Reason: Container killed by YARN for exceeding memory limits. 14.0
>> GB of 14 GB physical memory used. Consider boosting spark.yarn.executor.
>> memoryOverhead.
>>
>> Driver memory=4g, executor mem=12g, num-executors=8, executor core=8
>>
>> Do you think below setting can help me to overcome above issue:
>>
>> spark.default.parellism=1000
>> spark.sql.shuffle.partitions=1000
>>
>> Because default max number of partitions are 1000.
>>
>>
>>
>


Re: Spark 2.0.2 : Hang at "org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:623)"

2017-03-24 Thread Ravindra
Also noticed that there are 8 - "dispatcher-event-loop-0  7" and 8 -
"map-output-dispatcher-0  7" all waiting at the same location in the
code that is -
*sun.misc.Unsafe.park(Native Method)*
*java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)*
*java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)*
*java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)*
org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:338)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

So clearly there is a race condition. May be only option is to avoid it...
but how ??


On Fri, Mar 24, 2017 at 5:40 PM Ravindra <ravindra.baj...@gmail.com> wrote:

> Hi All,
>
> My Spark job hangs here... Looking into the thread dump I noticed that it
> hangs here (stack trace given below) on the count action on dataframe
> (given below). Data is very small. Its actually not more than even 10 rows.
>
> I noticed some JIRAs about this issue but all are resolved-closed in
> previous versions.
>
> Its running with 1 executor. Also noticed that the Storage tab is empty so
> no dataframe is cached.
>
> Looking into the DAGScheduler, I notice its stuck at runJob, probably its
> trying to run tasks concurrently and waiting here
>
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
> *org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:623)*
> org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
> org.apache.spark.SparkContext.runJob(SparkContext.scala:1886)
> org.apache.spark.SparkContext.runJob(SparkContext.scala:1899)
> org.apache.spark.SparkContext.runJob(SparkContext.scala:1913)
> org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:912)
>
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
> org.apache.spark.rdd.RDD.collect(RDD.scala:911)
>
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:290)
>
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2193)
>
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
> org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546)
> org.apache.spark.sql.Dataset.org
> $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2192)
> org.apache.spark.sql.Dataset.org
> $apache$spark$sql$Dataset$$collect(Dataset.scala:2199)
> org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2227)
> org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2226)
> org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2559)
> org.apache.spark.sql.Dataset.count(Dataset.scala:2226)
>
>
> Spark Properties
> NameValue
> spark.app.id local-1490350724879
> spark.app.name TestJob
> spark.default.parallelism 1
> spark.driver.allowMultipleContexts true
> spark.driver.host localhost
> spark.driver.memory 4g
> spark.driver.port 63405
> spark.executor.id driver
> spark.executor.memory 4g
> spark.hadoop.validateOutputSpecs false
> spark.master local[2]
> spark.scheduler.mode FIFO
> spark.sql.catalogImplementation hive
> spark.sql.crossJoin.enabled true
> spark.sql.shuffle.partitions 1
> spark.sql.warehouse.dir /tmp/hive/spark-warehouse
> spark.ui.enabled true
> spark.yarn.executor.memoryOverhead 2048
>
> Its count action on the given below dataframe -
> == Parsed Logical Plan ==
> Project [hash_composite_keys#27440L, id#27441, name#27442, master#27439,
> created_time#27448]
> +- Filter (rn#27493 = 1)
>+- Project [hash_composite_keys#27440L, id#27441, name#27442,
> master#27439, created_time#27448, rn#27493]
>   +- Project [hash_composite_keys#27440L, id#27441, name#27442,
> master#27439, created_time#27448, rn#27493, rn#27493]
> 

Spark 2.0.2 : Hang at "org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:623)"

2017-03-24 Thread Ravindra
Hi All,

My Spark job hangs here... Looking into the thread dump I noticed that it
hangs here (stack trace given below) on the count action on dataframe
(given below). Data is very small. Its actually not more than even 10 rows.

I noticed some JIRAs about this issue but all are resolved-closed in
previous versions.

Its running with 1 executor. Also noticed that the Storage tab is empty so
no dataframe is cached.

Looking into the DAGScheduler, I notice its stuck at runJob, probably its
trying to run tasks concurrently and waiting here

sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
*org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:623)*
org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
org.apache.spark.SparkContext.runJob(SparkContext.scala:1886)
org.apache.spark.SparkContext.runJob(SparkContext.scala:1899)
org.apache.spark.SparkContext.runJob(SparkContext.scala:1913)
org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:912)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
org.apache.spark.rdd.RDD.collect(RDD.scala:911)
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:290)
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2193)
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546)
org.apache.spark.sql.Dataset.org
$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2192)
org.apache.spark.sql.Dataset.org
$apache$spark$sql$Dataset$$collect(Dataset.scala:2199)
org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2227)
org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2226)
org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2559)
org.apache.spark.sql.Dataset.count(Dataset.scala:2226)


Spark Properties
NameValue
spark.app.id local-1490350724879
spark.app.name TestJob
spark.default.parallelism 1
spark.driver.allowMultipleContexts true
spark.driver.host localhost
spark.driver.memory 4g
spark.driver.port 63405
spark.executor.id driver
spark.executor.memory 4g
spark.hadoop.validateOutputSpecs false
spark.master local[2]
spark.scheduler.mode FIFO
spark.sql.catalogImplementation hive
spark.sql.crossJoin.enabled true
spark.sql.shuffle.partitions 1
spark.sql.warehouse.dir /tmp/hive/spark-warehouse
spark.ui.enabled true
spark.yarn.executor.memoryOverhead 2048

Its count action on the given below dataframe -
== Parsed Logical Plan ==
Project [hash_composite_keys#27440L, id#27441, name#27442, master#27439,
created_time#27448]
+- Filter (rn#27493 = 1)
   +- Project [hash_composite_keys#27440L, id#27441, name#27442,
master#27439, created_time#27448, rn#27493]
  +- Project [hash_composite_keys#27440L, id#27441, name#27442,
master#27439, created_time#27448, rn#27493, rn#27493]
 +- Window [rownumber()
windowspecdefinition(hash_composite_keys#27440L, created_time#27448 ASC,
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS rn#27493],
[hash_composite_keys#27440L], [created_time#27448 ASC]
+- Project [hash_composite_keys#27440L, id#27441, name#27442,
master#27439, created_time#27448]
   +- Union
  :- Project [hash_composite_keys#27440L, id#27441,
name#27442, master#27439, cast(0 as timestamp) AS created_time#27448]
  :  +- Project [hash_composite_keys#27440L, id#27441,
name#27442, master#27439]
  : +- SubqueryAlias table1
  :+-
Relation[hash_composite_keys#27440L,id#27441,name#27442,master#27439]
parquet
  +- Project [hash_composite_keys#27391L, id#27397,
name#27399, master#27401, created_time#27414]
 +- Project [id#27397, name#27399,
hash_composite_keys#27391L, master#27401, cast(if
(isnull(created_time#27407L)) null else UDF(created_time#27407L) as
timestamp) AS created_time#27414]
+- Project [id#27397, name#27399,
hash_composite_keys#27391L, master#27401, 1490350732895 AS
created_time#27407L]
   +- Project [id#27397, name#27399,
hash_composite_keys#27391L, master AS master#27401]
  +- Aggregate 

Re: Spark 2.0.2 - hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()

2017-03-17 Thread Ravindra
Thanks a lot young for explanation. But its sounds like an API behaviour
change. For now I do the counts != o on both dataframes before these
operations. Not good from performance point of view hence have created a
JIRA (SPARK-20008) to track it.

Thanks,
Ravindra.

On Fri, Mar 17, 2017 at 8:51 PM Yong Zhang <java8...@hotmail.com> wrote:

> Starting from Spark 2, these kind of operation are implemented in left
> anti join, instead of using RDD operation directly.
>
>
> Same issue also on sqlContext.
>
>
> scala> spark.version
> res25: String = 2.0.2
>
>
> spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true)
>
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[], output=[])
> +- Exchange SinglePartition
>+- *HashAggregate(keys=[], functions=[], output=[])
>   +- BroadcastNestedLoopJoin BuildRight, *LeftAnti*, false
>  :- Scan ExistingRDD[]
>  +- BroadcastExchange IdentityBroadcastMode
> +- Scan ExistingRDD[]
>
> This arguably means a bug. But my guess is liking the logic of comparing
> NULL = NULL, should it return true or false, causing this kind of
> confusion.
>
> Yong
>
> --
> *From:* Ravindra <ravindra.baj...@gmail.com>
> *Sent:* Friday, March 17, 2017 4:30 AM
> *To:* user@spark.apache.org
> *Subject:* Spark 2.0.2 -
> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()
>
> Can someone please explain why
>
> println ( " Empty count " +
> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()
>
> *prints* -  Empty count 1
>
> This was not the case in Spark 1.5.2... I am upgrading to spark 2.0.2 and
> found this. This causes my tests to fail. Is there another way to check
> full equality of 2 dataframes.
>
> Thanks,
> Ravindra.
>


Spark 2.0.2 - hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()

2017-03-17 Thread Ravindra
Can someone please explain why

println ( " Empty count " +
hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()

*prints* -  Empty count 1

This was not the case in Spark 1.5.2... I am upgrading to spark 2.0.2 and
found this. This causes my tests to fail. Is there another way to check
full equality of 2 dataframes.

Thanks,
Ravindra.


Re: Spark Task is not created

2016-06-26 Thread Ravindra
I have a lot of spark tests. And the failure is not deterministic. It can
happen at any action that I do. Buy given below logs are common. And I
overcome that using the repartitioning, coalescing etc so that I don't get
that Submitting 2 missing tasks from ShuffleMapStage. Basically ensuring
that there is only one task.

I doubt if this has anything to do with some property.

In the Ui I don't see any failure. Just that few tasks have completed and
the last one is yet to be created

Thanks,

Ravi


On Sun, Jun 26, 2016, 11:33 Akhil Das <ak...@hacked.work> wrote:

> Would be good if you can paste the piece of code that you are executing.
>
> On Sun, Jun 26, 2016 at 11:21 AM, Ravindra <ravindra.baj...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> May be I need to just set some property or its a known issue. My spark
>> application hangs in test environment whenever I see following message -
>>
>> 16/06/26 11:13:34 INFO DAGScheduler: *Submitting 2 missing tasks from
>> ShuffleMapStage* 145 (MapPartitionsRDD[590] at rdd at
>> WriteDataFramesDecorator.scala:61)
>> 16/06/26 11:13:34 INFO TaskSchedulerImpl: Adding task set 145.0 with 2
>> tasks
>> 16/06/26 11:13:34 INFO TaskSetManager: Starting task 0.0 in stage 145.0
>> (TID 186, localhost, PROCESS_LOCAL, 2389 bytes)
>> 16/06/26 11:13:34 INFO Executor: Running task 0.0 in stage 145.0 (TID 186)
>> 16/06/26 11:13:34 INFO BlockManager: Found block rdd_575_0 locally
>> 16/06/26 11:13:34 INFO GenerateMutableProjection: Code generated in 3.796
>> ms
>> 16/06/26 11:13:34 INFO Executor: Finished task 0.0 in stage 145.0 (TID
>> 186). 2578 bytes result sent to driver
>> 16/06/26 11:13:34 INFO TaskSetManager: Finished task 0.0 in stage 145.0
>> (TID 186) in 24 ms on localhost (1/2)
>>
>> It happens with any action. The application works fine whenever I notice 
>> "*Submitting
>> 1 missing tasks from ShuffleMapStage". *For this I need to tweak the
>> plan like using repartition, coalesce etc but this also doesn't help
>> always.
>>
>> Some of the Spark properties are as given below -
>>
>> NameValue
>> spark.app.idlocal-1466914377931
>> spark.app.name  SparkTest
>> spark.cores.max  3
>> spark.default.parallelism 1
>> spark.driver.allowMultipleContexts true
>> spark.executor.iddriver
>> spark.externalBlockStore.folderName
>> spark-050049bd-c058-4035-bc3d-2e73a08e8d0c
>> spark.masterlocal[2]
>> spark.scheduler.mode FIFO
>> spark.ui.enabledtrue
>>
>>
>> Thanks,
>> Ravi.
>>
>>
>
>
> --
> Cheers!
>
>


Spark Task is not created

2016-06-25 Thread Ravindra
Hi All,

May be I need to just set some property or its a known issue. My spark
application hangs in test environment whenever I see following message -

16/06/26 11:13:34 INFO DAGScheduler: *Submitting 2 missing tasks from
ShuffleMapStage* 145 (MapPartitionsRDD[590] at rdd at
WriteDataFramesDecorator.scala:61)
16/06/26 11:13:34 INFO TaskSchedulerImpl: Adding task set 145.0 with 2 tasks
16/06/26 11:13:34 INFO TaskSetManager: Starting task 0.0 in stage 145.0
(TID 186, localhost, PROCESS_LOCAL, 2389 bytes)
16/06/26 11:13:34 INFO Executor: Running task 0.0 in stage 145.0 (TID 186)
16/06/26 11:13:34 INFO BlockManager: Found block rdd_575_0 locally
16/06/26 11:13:34 INFO GenerateMutableProjection: Code generated in 3.796 ms
16/06/26 11:13:34 INFO Executor: Finished task 0.0 in stage 145.0 (TID
186). 2578 bytes result sent to driver
16/06/26 11:13:34 INFO TaskSetManager: Finished task 0.0 in stage 145.0
(TID 186) in 24 ms on localhost (1/2)

It happens with any action. The application works fine whenever I
notice "*Submitting
1 missing tasks from ShuffleMapStage". *For this I need to tweak the plan
like using repartition, coalesce etc but this also doesn't help always.

Some of the Spark properties are as given below -

NameValue
spark.app.idlocal-1466914377931
spark.app.name  SparkTest
spark.cores.max  3
spark.default.parallelism 1
spark.driver.allowMultipleContexts true
spark.executor.iddriver
spark.externalBlockStore.folderName
spark-050049bd-c058-4035-bc3d-2e73a08e8d0c
spark.masterlocal[2]
spark.scheduler.mode FIFO
spark.ui.enabledtrue


Thanks,
Ravi.


Re: Forcing data from disk to memory

2016-03-25 Thread Ravindra
yup, cache is a transformation and hence lazy. you need to run action to
get the data into it.

http://apache-spark-user-list.1001560.n3.nabble.com/How-to-enforce-RDD-to-be-cached-td20230.html


On Fri, Mar 25, 2016 at 2:32 PM Jörn Franke  wrote:

> I am not 100% sure of the root cause, but if you need rdd caching then
> look at Apache Ignite or similar.
>
> On 24 Mar 2016, at 16:22, Daniel Imberman 
> wrote:
>
> Hi Takeshi,
>
> Thank you for getting back to me. If this is not possible then perhaps you
> can help me with the root problem that caused me to ask this question.
>
> Basically I have a job where I'm loading/persisting an RDD and running
> queries against it. The problem I'm having is that even though there is
> plenty of space in memory, the RDD is not fully persisting. Once I run
> multiple queries against it the RDD fully persists, but this means that the
> first 4/5 queries I run are extremely slow.
>
> Is there any way I can make sure that the entire RDD ends up in memory the
> first time I load it?
>
> Thank you
> On Thu, Mar 24, 2016 at 1:21 AM Takeshi Yamamuro 
> wrote:
>
>> just re-sent,
>>
>>
>> -- Forwarded message --
>> From: Takeshi Yamamuro 
>> Date: Thu, Mar 24, 2016 at 5:19 PM
>> Subject: Re: Forcing data from disk to memory
>> To: Daniel Imberman 
>>
>>
>> Hi,
>>
>> We have no direct approach; we need to unpersist cached data, then
>> re-cache data as MEMORY_ONLY.
>>
>> // maropu
>>
>> On Thu, Mar 24, 2016 at 8:22 AM, Daniel Imberman <
>> daniel.imber...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> So I have a question about persistence. Let's say I have an RDD that's
>>> persisted MEMORY_AND_DISK, and I know that I now have enough memory space
>>> cleared up that I can force the data on disk into memory. Is it possible
>>> to
>>> tell spark to re-evaluate the open RDD memory and move that information?
>>>
>>> Thank you
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Forcing-data-from-disk-to-memory-tp26585.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>> .
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>


Spark 1.6.1 : SPARK-12089 : java.lang.NegativeArraySizeException

2016-03-13 Thread Ravindra Rawat
Greetings,

I am getting following exception on joining a few parquet files.
SPARK-12089 description has details of the overflow condition which is
marked as fixed in 1.6.1. I recall seeing another issue related to csv
files creating same exception.

Any pointers on how to debug this or possible workarounds? Google
searches and JIRA comments point to either a > 2GB record size (less
likely) or RDD sizes being too large.

I had upgraded to Spark 1.6.1 due to Serialization errors from
Catalyst while reading Parquet files.

Related JIRA Issue => https://issues.apache.org/jira/browse/SPARK-12089

Related PR => https://github.com/apache/spark/pull/10142


java.lang.NegativeArraySizeException
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:45)
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:196)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


Thanks.

-- 
Regards
Ravindra


Re: About Huawei-Spark/Spark-SQL-on-HBase

2015-12-19 Thread Ravindra Pesala
Hi censj,

Please try the new repo at https://github.com/HuaweiBigData/astro , not
maintaining the old repo. Please let me know if you still get the error.
You can contact on my personal mail also at ravi.pes...@gmail.com

Thanks,
Ravindra.

On Sat 19 Dec, 2015 2:45 pm censj <ce...@lotuseed.com> wrote:

> I use Huawei-Spark/Spark-SQL-on-HBase,but running ./bin/hbase-sql throwing.
> 5/12/19 16:59:34 INFO storage.BlockManagerMaster: Registered BlockManager
> Exception in thread "main" java.lang.NoSuchMethodError:
> jline.Terminal.getTerminal()Ljline/Terminal;
> at jline.ConsoleReader.(ConsoleReader.java:191)
> at jline.ConsoleReader.(ConsoleReader.java:186)
> at jline.ConsoleReader.(ConsoleReader.java:174)
> at
> org.apache.spark.sql.hbase.HBaseSQLCliDriver$.main(HBaseSQLCliDriver.scala:55)
> at
> org.apache.spark.sql.hbase.HBaseSQLCliDriver.main(HBaseSQLCliDriver.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 15/12/19 16:59:35 INFO spark.SparkContext: Invoking stop() from shutdown
> hook
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Best practices to handle corrupted records

2015-10-16 Thread Ravindra
+1 Erwan..

May be a trivial solution like this -
class Result (msg: String, record: Record)

class Success (msgSuccess: String, val msg: String, val record: Record)
extends Result(msg, record)

class Failure (msgFailure: String, val msg: String, val record: Record)
extends Result (msg, record)

trait Warning {

}
class SuccessWithWarning(msgWaring: String, val msgSuccess:String,
override val msg: String, override val record: Record)
extends Success(msgSuccess, msg, record) with Warning


val record1 = new Record("k1", "val11", "val21")
val record2 = new Record("k2", "val12", "val22")
val record3 = new Record("k3", "val13", "val23")
val record4 = new Record("k4", "val14", "val24")
val records : List[Record] = List (record1, record2, record3, record4 )

def processRecord(record: Record) : Either[Result,Result] = {
//(record, new Result)
val result: Either[Result,Result] = {
if (record.key.equals("k1"))
Left(new Failure("failed", "result", record))
else
successHandler(record)
}
result
}

def successHandler (record: Record): Either[Result, Result] = {
val result: Either[Result, Result] = {
if (record.key.equals("k2"))
Left(new Success("success", "result", record))
else Right(new SuccessWithWarning("warning", "success", "result", record))
}
result
}

for(record <- records) {
println (processRecord(record))
}


On Fri, Oct 16, 2015 at 1:45 PM Erwan ALLAIN 
wrote:

> Either[FailureResult[T], Either[SuccessWithWarnings[T],
> SuccessResult[T]]]  maybe ?
>
>
> On Thu, Oct 15, 2015 at 5:31 PM, Antonio Murgia <
> antonio.murg...@studio.unibo.it> wrote:
>
>> 'Either' does not cover the case where the outcome was successful but
>> generated warnings. I already looked into it and also at 'Try' from which I
>> got inspired. Thanks for pointing it out anyway!
>>
>> #A.M.
>>
>> Il giorno 15 ott 2015, alle ore 16:19, Erwan ALLAIN <
>> eallain.po...@gmail.com> ha scritto:
>>
>> What about http://www.scala-lang.org/api/2.9.3/scala/Either.html ?
>>
>>
>> On Thu, Oct 15, 2015 at 2:57 PM, Roberto Congiu > > wrote:
>>
>>> I came to a similar solution to a similar problem. I deal with a lot of
>>> CSV files from many different sources and they are often malformed.
>>> HOwever, I just have success/failure. Maybe you should  make
>>> SuccessWithWarnings a subclass of success, or getting rid of it altogether
>>> making the warnings optional.
>>> I was thinking of making this cleaning/conforming library open source if
>>> you're interested.
>>>
>>> R.
>>>
>>> 2015-10-15 5:28 GMT-07:00 Antonio Murgia <
>>> antonio.murg...@studio.unibo.it>:
>>>
 Hello,
 I looked around on the web and I couldn’t find any way to deal in a
 structured way with malformed/faulty records during computation. All I was
 able to find was the flatMap/Some/None technique + logging.
 I’m facing this problem because I have a processing algorithm that
 extracts more than one value from each record, but can fail in extracting
 one of those multiple values, and I want to keep track of them. Logging is
 not feasible because this “warning” happens so frequently that the logs
 would become overwhelming and impossibile to read.
 Since I have 3 different possible outcomes from my processing I modeled
 it with this class hierarchy:
 That holds result and/or warnings.
 Since Result implements Traversable it can be used in a flatMap,
 discarding all warnings and failure results, in the other hand, if we want
 to keep track of warnings, we can elaborate them and output them if we 
 need.

 Kind Regards
 #A.M.

>>>
>>>
>>>
>>> --
>>> --
>>> "Good judgment comes from experience.
>>> Experience comes from bad judgment"
>>> --
>>>
>>
>>
>


Re: Research ideas using spark

2015-07-15 Thread Ravindra
Look at this :
http://www.forbes.com/sites/lisabrownlee/2015/07/10/the-11-trillion-internet-of-things-big-data-and-pattern-of-life-pol-analytics/

On Wed, Jul 15, 2015 at 10:19 PM shahid ashraf sha...@trialx.com wrote:

 Sorry Guys!

 I mistakenly added my question to this thread( Research ideas using
 spark). Moreover people can ask any question , this spark user group is for
 that.

 Cheers!
 

 On Wed, Jul 15, 2015 at 9:43 PM, Robin East robin.e...@xense.co.uk
 wrote:

 Well said Will. I would add that you might want to investigate GraphChi
 which claims to be able to run a number of large-scale graph processing
 tasks on a workstation much quicker than a very large Hadoop cluster. It
 would be interesting to know how widely applicable the approach GraphChi
 takes and what implications it has for parallel/distributed computing
 approaches. A rich seam to mine indeed.

 Robin

 On 15 Jul 2015, at 14:48, William Temperley willtemper...@gmail.com
 wrote:

 There seems to be a bit of confusion here - the OP (doing the PhD) had
 the thread hijacked by someone with a similar name asking a mundane
 question.

 It would be a shame to send someone away so rudely, who may do valuable
 work on Spark.

 Sashidar (not Sashid!) I'm personally interested in running graph
 algorithms for image segmentation using MLib and Spark.  I've got many
 questions though - like is it even going to give me a speed-up?  (
 http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)

 It's not obvious to me which classes of graph algorithms can be
 implemented correctly and efficiently in a highly parallel manner.  There's
 tons of work to be done here, I'm sure. Also, look at parallel geospatial
 algorithms - there's a lot of work being done on this.

 Best, Will



 On 15 July 2015 at 09:01, Vineel Yalamarthy vineelyalamar...@gmail.com
 wrote:

 Hi Daniel

 Well said

 Regards
 Vineel

 On Tue, Jul 14, 2015, 6:11 AM Daniel Darabos 
 daniel.dara...@lynxanalytics.com wrote:

 Hi Shahid,
 To be honest I think this question is better suited for Stack Overflow
 than for a PhD thesis.

 On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf sha...@trialx.com
 wrote:

 hi

 I have a 10 node cluster  i loaded the data onto hdfs, so the no. of
 partitions i get is 9. I am running a spark application , it gets stuck on
 one of tasks, looking at the UI it seems application is not using all 
 nodes
 to do calculations. attached is the screen shot of tasks, it seems tasks
 are put on each node more then once. looking at tasks 8 tasks get 
 completed
 under 7-8 minutes and one task takes around 30 minutes so causing the 
 delay
 in results.


 On Tue, Jul 14, 2015 at 10:48 AM, Shashidhar Rao 
 raoshashidhar...@gmail.com wrote:

 Hi,

 I am doing my PHD thesis on large scale machine learning e.g  Online
 learning, batch and mini batch learning.

 Could somebody help me with ideas especially in the context of Spark
 and to the above learning methods.

 Some ideas like improvement to existing algorithms, implementing new
 features especially the above learning methods and algorithms that have 
 not
 been implemented etc.

 If somebody could help me with some ideas it would really accelerate
 my work.

 Plus few ideas on research papers regarding Spark or Mahout.

 Thanks in advance.

 Regards




 --
 with Regards
 Shahid Ashraf


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







 --
 with Regards
 Shahid Ashraf



Re: Spark can not access jar from HDFS !!

2015-05-11 Thread Ravindra
Hi All,

Thanks for suggestions. What I tried is -
hiveContext.sql (add jar ) and that helps to complete the create
temporary function but while using this function I get ClassNotFound for
the class handling this function. The same class is present in the jar
added .

Please note that the same works fine from the Hive Shell.

Is there an issue with Spark while distributing jars across workers? May be
that is causing the problem. Also can you please suggest the manual way of
copying the jars to the workers, I just want to ascertain my assumption.

Thanks,
Ravi

On Sun, May 10, 2015 at 1:40 AM Michael Armbrust mich...@databricks.com
wrote:

 That code path is entirely delegated to hive.  Does hive support this?
 You might try instead using sparkContext.addJar.

 On Sat, May 9, 2015 at 12:32 PM, Ravindra ravindra.baj...@gmail.com
 wrote:

 Hi All,

 I am trying to create custom udfs with hiveContext as given below -
 scala hiveContext.sql (CREATE TEMPORARY FUNCTION sample_to_upper AS
 'com.abc.api.udf.MyUpper' USING JAR
 'hdfs:///users/ravindra/customUDF2.jar')

 I have put the udf jar in the hdfs at the path given above. The same
 command works well in the hive shell but failing here in the spark shell.
 And it fails as given below. -
 15/05/10 00:41:51 ERROR Task: FAILED:
 org.apache.hadoop.hive.ql.metadata.HiveException: Unable to load JAR
 hdfs:///users/ravindra/customUDF2.jar
 15/05/10 00:41:51 INFO FunctionTask: create function:
 org.apache.hadoop.hive.ql.metadata.HiveException: Unable to load JAR
 hdfs:///users/ravindra/customUDF2.jar
 at
 org.apache.hadoop.hive.ql.exec.FunctionTask.addFunctionResources(FunctionTask.java:305)
 at
 org.apache.hadoop.hive.ql.exec.FunctionTask.createTemporaryFunction(FunctionTask.java:179)
 at
 org.apache.hadoop.hive.ql.exec.FunctionTask.execute(FunctionTask.java:81)
 at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
 at
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
 at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
 at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
 at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:305)
 at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
 at
 org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
 at
 org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
 at org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
 at
 org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
 at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
 at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108)
 at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94)
 at $line17.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:18)
 at $line17.$read$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:23)
 at $line17.$read$$iwC$$iwC$$iwC$$iwC.init(console:25)
 at $line17.$read$$iwC$$iwC$$iwC.init(console:27)
 at $line17.$read$$iwC$$iwC.init(console:29)
 at $line17.$read$$iwC.init(console:31)
 at $line17.$read.init(console:33)
 at $line17.$read$.init(console:37)
 at $line17.$read$.clinit(console)
 at $line17.$eval$.init(console:7)
 at $line17.$eval$.clinit(console)
 at $line17.$eval.$print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
 at
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
 at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
 at
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
 at
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)
 at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)
 at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628)
 at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636)
 at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:968)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916

Re: Spark can not access jar from HDFS !!

2015-05-11 Thread Ravindra
After upgrading to spark 1.3, these statements on hivecontext are working
fine. Thanks

On Mon, May 11, 2015, 12:15 Ravindra ravindra.baj...@gmail.com wrote:

 Hi All,

 Thanks for suggestions. What I tried is -
 hiveContext.sql (add jar ) and that helps to complete the create
 temporary function but while using this function I get ClassNotFound for
 the class handling this function. The same class is present in the jar
 added .

 Please note that the same works fine from the Hive Shell.

 Is there an issue with Spark while distributing jars across workers? May
 be that is causing the problem. Also can you please suggest the manual way
 of copying the jars to the workers, I just want to ascertain my assumption.

 Thanks,
 Ravi

 On Sun, May 10, 2015 at 1:40 AM Michael Armbrust mich...@databricks.com
 wrote:

 That code path is entirely delegated to hive.  Does hive support this?
 You might try instead using sparkContext.addJar.

 On Sat, May 9, 2015 at 12:32 PM, Ravindra ravindra.baj...@gmail.com
 wrote:

 Hi All,

 I am trying to create custom udfs with hiveContext as given below -
 scala hiveContext.sql (CREATE TEMPORARY FUNCTION sample_to_upper AS
 'com.abc.api.udf.MyUpper' USING JAR
 'hdfs:///users/ravindra/customUDF2.jar')

 I have put the udf jar in the hdfs at the path given above. The same
 command works well in the hive shell but failing here in the spark shell.
 And it fails as given below. -
 15/05/10 00:41:51 ERROR Task: FAILED:
 org.apache.hadoop.hive.ql.metadata.HiveException: Unable to load JAR
 hdfs:///users/ravindra/customUDF2.jar
 15/05/10 00:41:51 INFO FunctionTask: create function:
 org.apache.hadoop.hive.ql.metadata.HiveException: Unable to load JAR
 hdfs:///users/ravindra/customUDF2.jar
 at
 org.apache.hadoop.hive.ql.exec.FunctionTask.addFunctionResources(FunctionTask.java:305)
 at
 org.apache.hadoop.hive.ql.exec.FunctionTask.createTemporaryFunction(FunctionTask.java:179)
 at
 org.apache.hadoop.hive.ql.exec.FunctionTask.execute(FunctionTask.java:81)
 at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
 at
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
 at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
 at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
 at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
 at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
 at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:305)
 at
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
 at
 org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
 at
 org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
 at
 org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
 at
 org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
 at
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
 at
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
 at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108)
 at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94)
 at $line17.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:18)
 at $line17.$read$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:23)
 at $line17.$read$$iwC$$iwC$$iwC$$iwC.init(console:25)
 at $line17.$read$$iwC$$iwC$$iwC.init(console:27)
 at $line17.$read$$iwC$$iwC.init(console:29)
 at $line17.$read$$iwC.init(console:31)
 at $line17.$read.init(console:33)
 at $line17.$read$.init(console:37)
 at $line17.$read$.clinit(console)
 at $line17.$eval$.init(console:7)
 at $line17.$eval$.clinit(console)
 at $line17.$eval.$print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
 at
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
 at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
 at
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
 at
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)
 at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)
 at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628)
 at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636)
 at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641

Spark can not access jar from HDFS !!

2015-05-09 Thread Ravindra
Hi All,

I am trying to create custom udfs with hiveContext as given below -
scala hiveContext.sql (CREATE TEMPORARY FUNCTION sample_to_upper AS
'com.abc.api.udf.MyUpper' USING JAR
'hdfs:///users/ravindra/customUDF2.jar')

I have put the udf jar in the hdfs at the path given above. The same
command works well in the hive shell but failing here in the spark shell.
And it fails as given below. -
15/05/10 00:41:51 ERROR Task: FAILED:
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to load JAR
hdfs:///users/ravindra/customUDF2.jar
15/05/10 00:41:51 INFO FunctionTask: create function:
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to load JAR
hdfs:///users/ravindra/customUDF2.jar
at
org.apache.hadoop.hive.ql.exec.FunctionTask.addFunctionResources(FunctionTask.java:305)
at
org.apache.hadoop.hive.ql.exec.FunctionTask.createTemporaryFunction(FunctionTask.java:179)
at org.apache.hadoop.hive.ql.exec.FunctionTask.execute(FunctionTask.java:81)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:305)
at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
at
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
at
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
at org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
at
org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:30)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94)
at $line17.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:18)
at $line17.$read$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:23)
at $line17.$read$$iwC$$iwC$$iwC$$iwC.init(console:25)
at $line17.$read$$iwC$$iwC$$iwC.init(console:27)
at $line17.$read$$iwC$$iwC.init(console:29)
at $line17.$read$$iwC.init(console:31)
at $line17.$read.init(console:33)
at $line17.$read$.init(console:37)
at $line17.$read$.clinit(console)
at $line17.$eval$.init(console:7)
at $line17.$eval$.clinit(console)
at $line17.$eval.$print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641)
at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:968)
at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
at
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
at
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1011)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

15/05/10 00:41:51 ERROR Driver: FAILED: Execution

Registring UDF from a different package fails

2015-03-20 Thread Ravindra
Hi All,

I have all my UDFs defined in the classes residing in a different package
than where I am instantiating my HiveContext.

I have a register function in my UDF class. I pass HiveContext to this
function. and in this function I call
hiveContext.registerFunction(myudf, myudf _)

All goes well but at the runtime when I execute query val sqlresult =
hiveContext.sql(query)
It doesn't work. sqlresult comes as null. There is no exception thrown by
spark and there is no proper logs indicating the error.

But all goes well if I bring my UDF class into the same package where I am
instantiating hiveContext.

I dig more into the spark code and found that (may be I am wrong here)

./sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

has a class named as SimpleFunctionRegistry wherein lookupFunction is not
throwing error if it doesn't find a function.


Kindly help


Appreciate

Ravi.


Re: Writing wide parquet file in Spark SQL

2015-03-11 Thread Ravindra
Even I am keen to learn an answer for this but as an alternate you can use
hive to create a table stored as parquet and then use it in spark.

On Wed, Mar 11, 2015 at 1:44 AM kpeng1 kpe...@gmail.com wrote:

 Hi All,

 I am currently trying to write a very wide file into parquet using spark
 sql.  I have 100K column records that I am trying to write out, but of
 course I am running into space issues(out of memory - heap space).  I was
 wondering if there are any tweaks or work arounds for this.

 I am basically calling saveAsParquetFile on the schemaRDD.





 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Writing-wide-parquet-file-in-Spark-SQL-tp21995.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: ANSI Standard Supported by the Spark-SQL

2015-03-10 Thread Ravindra
Thanks Michael,

That helps. So just to summarise that we should not make any assumption
about Spark being fully compliant with any SQL Standards until announced by
the community and maintain the same status quo as you have suggested.

Regards,
Ravi.

On Tue, Mar 10, 2015 at 11:14 PM Michael Armbrust mich...@databricks.com
wrote:

 Spark SQL supports a subset of HiveQL:
 http://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive

 On Mon, Mar 9, 2015 at 11:32 PM, Ravindra ravindra.baj...@gmail.com
 wrote:

 From the archives in this user list, It seems that Spark-SQL is yet to
 achieve SQL 92 level. But there are few things still not clear.
 1. This is from an old post dated : Aug 09, 2014.
 2.  It clearly says that it doesn't support DDL and DML operations. Does
 that means, all reads (select) are sql 92 compliant?

 Please clarify.

 Regards,
 Ravi

 On Tue, Mar 10, 2015 at 11:46 AM Ravindra ravindra.baj...@gmail.com
 wrote:

 Hi All,

 I am new to spark and trying to understand what SQL Standard is
 supported by the Spark. I googled around a lot but didn't get clear answer.

 Some where I saw that Spark supports sql-92 and at other location I
 found that spark is not fully compliant with sql-92.

 I also noticed that using Hive Context I can execute almost all the
 queries supported by the Hive, so does that means that Spark is equivalent
 to Hive in terms of sql standards, given that HiveContext is used for
 querying.

 Please suggest.

 Regards,
 Ravi





Re: ANSI Standard Supported by the Spark-SQL

2015-03-10 Thread Ravindra
From the archives in this user list, It seems that Spark-SQL is yet to
achieve SQL 92 level. But there are few things still not clear.
1. This is from an old post dated : Aug 09, 2014.
2.  It clearly says that it doesn't support DDL and DML operations. Does
that means, all reads (select) are sql 92 compliant?

Please clarify.

Regards,
Ravi

On Tue, Mar 10, 2015 at 11:46 AM Ravindra ravindra.baj...@gmail.com wrote:

 Hi All,

 I am new to spark and trying to understand what SQL Standard is supported
 by the Spark. I googled around a lot but didn't get clear answer.

 Some where I saw that Spark supports sql-92 and at other location I found
 that spark is not fully compliant with sql-92.

 I also noticed that using Hive Context I can execute almost all the
 queries supported by the Hive, so does that means that Spark is equivalent
 to Hive in terms of sql standards, given that HiveContext is used for
 querying.

 Please suggest.

 Regards,
 Ravi