Re: Kryo Issue on Spark 1.0.1, Mesos 0.18.2

2014-07-25 Thread Gary Malouf
Maybe this is me misunderstanding the Spark system property behavior, but
I'm not clear why the class being loaded ends up having '/' rather than '.'
in it's fully qualified name.  When I tested this out locally, the '/' were
preventing the class from being loaded.


On Fri, Jul 25, 2014 at 2:27 PM, Gary Malouf  wrote:

> After upgrading to Spark 1.0.1 from 0.9.1 everything seemed to be going
> well.  Looking at the Mesos slave logs, I noticed:
>
> ERROR KryoSerializer: Failed to run spark.kryo.registrator
> java.lang.ClassNotFoundException:
> com/mediacrossing/verrazano/kryo/MxDataRegistrator
>
> My spark-env.sh has the following when I run the Spark Shell:
>
> export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
>
> export MASTER=mesos://zk://n-01:2181,n-02:2181,n-03:2181/masters
>
> export ADD_JARS=/opt/spark/mx-lib/verrazano-assembly.jar
>
>
> # -XX:+UseCompressedOops must be disabled to use more than 32GB RAM
>
> SPARK_JAVA_OPTS="-Xss2m -XX:+UseCompressedOops
> -Dspark.local.dir=/opt/mesos-tmp -Dspark.executor.memory=4g
>  -Dspark.serializer=org.apache.spark.serializer.KryoSerializer
> -Dspark.kryo.registrator=com.mediacrossing.verrazano.kryo.MxDataRegistrator
> -Dspark.kryoserializer.buffer.mb=16 -Dspark.akka.askTimeout=30"
>
>
> I was able to verify that our custom jar was being copied to each worker,
> but for some reason it is not finding my registrator class.  Is anyone else
> struggling with Kryo on 1.0.x branch?
>


Re: spark streaming actor receiver doesn't play well with kryoserializer

2014-07-25 Thread Alan Ngai
The stack trace was from running the Actor count sample directly, without a 
spark cluster, so I guess the logs would be from both?  I enabled more logging 
and got this stack trace

4/07/25 17:55:26 [INFO] SecurityManager: Changing view acls to: alan
 14/07/25 17:55:26 [INFO] SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(alan)
 14/07/25 17:55:26 [DEBUG] AkkaUtils: In createActorSystem, requireCookie is: 
off
 14/07/25 17:55:26 [INFO] Slf4jLogger: Slf4jLogger started
 14/07/25 17:55:27 [INFO] Remoting: Starting remoting
 14/07/25 17:55:27 [INFO] Remoting: Remoting started; listening on addresses 
:[akka.tcp://spark@leungshwingchun:52156]
 14/07/25 17:55:27 [INFO] Remoting: Remoting now listens on addresses: 
[akka.tcp://spark@leungshwingchun:52156]
 14/07/25 17:55:27 [INFO] SparkEnv: Registering MapOutputTracker
 14/07/25 17:55:27 [INFO] SparkEnv: Registering BlockManagerMaster
 14/07/25 17:55:27 [DEBUG] DiskBlockManager: Creating local directories at root 
dirs '/var/folders/fq/fzcyqkcx7rgbycg4kr8z3m18gn/T/'
 14/07/25 17:55:27 [INFO] DiskBlockManager: Created local directory at 
/var/folders/fq/fzcyqkcx7rgbycg4kr8z3m18gn/T/spark-local-20140725175527-32f2
 14/07/25 17:55:27 [INFO] MemoryStore: MemoryStore started with capacity 297.0 
MB.
 14/07/25 17:55:27 [INFO] ConnectionManager: Bound socket to port 52157 with id 
= ConnectionManagerId(leungshwingchun,52157)
 14/07/25 17:55:27 [INFO] BlockManagerMaster: Trying to register BlockManager
 14/07/25 17:55:27 [INFO] BlockManagerInfo: Registering block manager 
leungshwingchun:52157 with 297.0 MB RAM
 14/07/25 17:55:27 [INFO] BlockManagerMaster: Registered BlockManager
 14/07/25 17:55:27 [INFO] HttpServer: Starting HTTP Server
 14/07/25 17:55:27 [DEBUG] HttpServer: HttpServer is not using security
 14/07/25 17:55:27 [INFO] HttpBroadcast: Broadcast server started at 
http://192.168.1.233:52158
 14/07/25 17:55:27 [INFO] HttpFileServer: HTTP File server directory is 
/var/folders/fq/fzcyqkcx7rgbycg4kr8z3m18gn/T/spark-5254ba11-037b-4761-b92a-3b18d42762de
 14/07/25 17:55:27 [INFO] HttpServer: Starting HTTP Server
 14/07/25 17:55:27 [DEBUG] HttpServer: HttpServer is not using security
 14/07/25 17:55:27 [DEBUG] HttpFileServer: HTTP file server started at: 
http://192.168.1.233:52159
 14/07/25 17:55:27 [INFO] SparkUI: Started SparkUI at 
http://leungshwingchun:4040
 14/07/25 17:55:27 [DEBUG] MutableMetricsFactory: field 
org.apache.hadoop.metrics2.lib.MutableRate 
org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with 
annotation @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, 
about=, value=[Rate of successful kerberos logins and latency (milliseconds)], 
always=false, type=DEFAULT, sampleName=Ops)
 14/07/25 17:55:27 [DEBUG] MutableMetricsFactory: field 
org.apache.hadoop.metrics2.lib.MutableRate 
org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with 
annotation @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, 
about=, value=[Rate of failed kerberos logins and latency (milliseconds)], 
always=false, type=DEFAULT, sampleName=Ops)
 14/07/25 17:55:27 [DEBUG] MetricsSystemImpl: UgiMetrics, User and group 
related metrics
 2014-07-25 17:55:27.796 java[79107:1703] Unable to load realm info from 
SCDynamicStore
14/07/25 17:55:27 [DEBUG] KerberosName: Kerberos krb5 configuration not found, 
setting default realm to empty
 14/07/25 17:55:27 [DEBUG] Groups:  Creating new Groups object
 14/07/25 17:55:27 [DEBUG] NativeCodeLoader: Trying to load the custom-built 
native-hadoop library...
 14/07/25 17:55:27 [DEBUG] NativeCodeLoader: Failed to load native-hadoop with 
error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
 14/07/25 17:55:27 [DEBUG] NativeCodeLoader: java.library.path=
 14/07/25 17:55:27 [WARN] NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
 14/07/25 17:55:27 [DEBUG] JniBasedUnixGroupsMappingWithFallback: Falling back 
to shell based
 14/07/25 17:55:27 [DEBUG] JniBasedUnixGroupsMappingWithFallback: Group mapping 
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping
 14/07/25 17:55:27 [DEBUG] Groups: Group mapping 
impl=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback; 
cacheTimeout=30
 14/07/25 17:55:28 [INFO] SparkContext: Added JAR 
file:/Users/alan/dev/spark-dev/examples/target/scala-2.10/spark-examples-1.0.1-hadoop2.2.0.jar
 at http://192.168.1.233:52159/jars/spark-examples-1.0.1-hadoop2.2.0.jar with 
timestamp 1406336128212
 14/07/25 17:55:28 [DEBUG] JobScheduler: Starting JobScheduler
 14/07/25 17:55:28 [INFO] ReceiverTracker: ReceiverTracker started
 14/07/25 17:55:28 [INFO] ForEachDStream: metadataCleanupDelay = -1
 14/07/25 17:55:28 [INFO] ShuffledDStream: metadataCleanupDelay = -1
 14/07/25 17:55:28 [INFO] MappedDStream: metadataCleanupDelay = -1
 14/07/25 17:55:28 [INFO] FlatMappedDStream: met

Re: Initial job has not accepted any resources (but workers are in UI)

2014-07-25 Thread Navicore
solution: opened all ports on the ec2 machine that the driver was running on. 
need to narrow down what ports akka wants... but the issue is solved.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Initial-job-has-not-accepted-any-resources-but-workers-are-in-UI-tp10659p10707.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


RE: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-25 Thread Andrew Lee
Hi Jianshi,
Could you provide which HBase version you're using?
By the way, a quick sanity check on whether the Workers can access HBase?
Were you able to manually write one record to HBase with the serialize 
function? Hardcode and test it ?

From: jianshi.hu...@gmail.com
Date: Fri, 25 Jul 2014 15:12:18 +0800
Subject: Re: Need help, got java.lang.ExceptionInInitializerError in 
Yarn-Client/Cluster mode
To: user@spark.apache.org

I nailed it down to a union operation, here's my code snippet:
val properties: RDD[((String, String, String), Externalizer[KeyValue])] = 
vertices.map { ve =>  val (vertices, dsName) = ve

  val rval = GraphConfig.getRval(datasetConf, Constants.VERTICES, dsName)   
   val (_, rvalAsc, rvalType) = rval
  println(s"Table name: $dsName, Rval: $rval")

  println(vertices.toDebugString)
  vertices.map { v =>val rk = appendHash(boxId(v.id)).getBytes  
  val cf = PROP_BYTES

val cq = boxRval(v.rval, rvalAsc, rvalType).getBytesval value = 
Serializer.serialize(v.properties)
((new String(rk), new String(cf), new String(cq)),

 Externalizer(put(rk, cf, cq, value)))  }
}.reduce(_.union(_)).sortByKey(numPartitions = 32)

Basically I read data from multiple tables (Seq[RDD[(key, value)]]) and they're 
transformed to the a KeyValue to be insert in HBase, so I need to do a 
.reduce(_.union(_)) to combine them into one RDD[(key, value)].


I cannot see what's wrong in my code.
Jianshi


On Fri, Jul 25, 2014 at 12:24 PM, Jianshi Huang  wrote:


I can successfully run my code in local mode using spark-submit (--master 
local[4]), but I got ExceptionInInitializerError errors in Yarn-client mode.


Any hints what is the problem? Is it a closure serialization problem? How can I 
debug it? Your answers would be very helpful. 

14/07/25 11:48:14 WARN scheduler.TaskSetManager: Loss was due to 
java.lang.ExceptionInInitializerErrorjava.lang.ExceptionInInitializerError  
  at 
com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scal


a:40)at 
com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scala:36)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)


at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016)   
 at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)at 
org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)


at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)  
  at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)  
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)


at org.apache.spark.scheduler.Task.run(Task.scala:51)at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)


at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
   at java.lang.Thread.run(Thread.java:745)



-- 
Jianshi Huang

LinkedIn: jianshi

Twitter: @jshuang
Github & Blog: http://huangjs.github.com/




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/



  

Re: saveAsTextFiles file not found exception

2014-07-25 Thread Bill Jay
I just saw another error after my job was run for 2 hours:

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on /apps/data/vddil/real-time/checkpoint/temp: File does not
exist. Holder DFSClient_NONMAPREDUCE_327993456_13 does not have any
open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2946)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2766)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2674)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

at org.apache.hadoop.ipc.Client.call(Client.java:1410)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy14.addBlock(Unknown Source)
at sun.reflect.GeneratedMethodAccessor146.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
at com.sun.proxy.$Proxy14.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:361)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1439)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1261)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:525)
14/07/25 14:45:12 WARN CheckpointWriter: Error in attempt 1 of writing
checkpoint to 
hdfs://gnosis-01-01-01.crl.samsung.com/apps/data/vddil/real-time/checkpoint/checkpoint-140632470



All my jobs use the same parameter to the function checkpoint. Is it the
reason for the error?

I will post the stack trace of the other error after it appears again.
Thanks!


Bill


On Fri, Jul 25, 2014 at 2:57 PM, Tathagata Das 
wrote:

> Can you give a stack trace and logs of the exception? Its hard to say
> anything without any associated stack trace and logs.
>
> TD
>
>
> On Fri, Jul 25, 2014 at 1:32 PM, Bill Jay 
> wrote:
>
>> Hi,
>>
>> I am running a Spark Streaming job that uses saveAsTextFiles to save
>> results into hdfs files. However, it has an exception after 20 batches
>>
>>
>> result-140631234/_temporary/0/task_201407251119__m_03 does not 
>> exist.
>>
>>
>> When the job is running, I do not change any file in the folder. Does
>> anyone know why the file cannot be found?
>>
>> Thanks!
>>
>> Bill
>>
>
>


RE: Spark SQL and Hive tables

2014-07-25 Thread Andrew Lee
Hi Michael,
If I understand correctly, the assembly JAR file is deployed onto HDFS 
/user/$USER/.stagingSpark folders that will be used by all computing (worker) 
nodes when people run in yarn-cluster mode.
Could you elaborate more what does the document mean by this? It is a bit 
misleading and I guess this only applies to standalone mode?
Andrew L

Date: Fri, 25 Jul 2014 15:25:42 -0700
Subject: RE: Spark SQL and Hive tables
From: ssti...@live.com
To: user@spark.apache.org






Thanks!  Will do.







Sent via the Samsung GALAXY S®4, an AT&T 4G LTE smartphone





 Original message 
From: Michael Armbrust 
Date:07/25/2014 3:24 PM (GMT-08:00) 
To: user@spark.apache.org 
Subject: Re: Spark SQL and Hive tables 






[S]ince Hive has a large number of dependencies, it is not included in the 
default Spark assembly. In order to use Hive
 you must first run ‘SPARK_HIVE=true sbt/sbt assembly/assembly’
 (or use -Phive for
 maven). This command builds a new assembly jar that includes Hive. Note that 
this Hive assembly jar must also be present on all of the worker nodes, as they 
will need access to the Hive serialization and deserialization libraries 
(SerDes) in order to acccess
 data stored in Hive.





On Fri, Jul 25, 2014 at 3:20 PM, Sameer Tilak 
 wrote:



Hi Jerry,




I am having trouble with this. May be something wrong with my import or version 
etc. 



scala> import org.apache.spark.sql._;
import org.apache.spark.sql._



scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
:24: error: object hive is not a member of package org.apache.spark.sql
   val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
  ^
Here is what I see for autocompletion:



scala> org.apache.spark.sql.
Row SQLContext  SchemaRDD   SchemaRDDLike   api
catalystcolumnarexecution   package parquet
test







Date: Fri, 25 Jul 2014 17:48:27 -0400


Subject: Re: Spark SQL and Hive tables


From: chiling...@gmail.com

To: user@spark.apache.org





Hi Sameer,



The blog post you referred to is about Spark SQL. I don't think the intent of 
the article is meant to guide you how to read data from Hive via Spark SQL. So 
don't worry too much about the blog post. 



The programming guide I referred to demonstrate how to read data from Hive 
using Spark SQL. It is a good starting point.



Best Regards,



Jerry





On Fri, Jul 25, 2014 at 5:38 PM, Sameer Tilak  wrote:



Hi Michael,
Thanks. I am not creating HiveContext, I am creating SQLContext. I am using CDH 
5.1. Can you please let me know which conf/ directory you are talking about? 





From: mich...@databricks.com

Date: Fri, 25 Jul 2014 14:34:53 -0700


Subject: Re: Spark SQL and Hive tables


To: user@spark.apache.org





In particular, have you put your hive-site.xml in the conf/ directory?  Also, 
are you creating a HiveContext instead of a SQLContext?




On Fri, Jul 25, 2014 at 2:27 PM, Jerry Lam  wrote:


Hi Sameer,



Maybe this page will help you: 
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables



Best Regards,



Jerry










On Fri, Jul 25, 2014 at 5:25 PM, Sameer Tilak  wrote:




Hi All,
I am trying to load data from Hive tables using Spark SQL. I am using 
spark-shell. Here is what I see: 



val trainingDataTable = sql("""SELECT prod.prod_num, demographics.gender, 
demographics.birth_year, demographics.income_group  FROM prod p JOIN 
demographics d ON d.user_id = p.user_id""")



14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch 
MultiInstanceRelations
14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch 
CaseInsensitiveAttributeReferences
java.lang.RuntimeException: Table Not Found: prod.



I have these tables in hive. I used show tables command to confirm this. Can 
someone please let me know how do I make them accessible here? 




































  

RE: Spark SQL and Hive tables

2014-07-25 Thread sstilak
Thanks!  Will do.


Sent via the Samsung GALAXY S®4, an AT&T 4G LTE smartphone

 Original message From: Michael Armbrust 
 Date:07/25/2014  3:24 PM  (GMT-08:00) 
To: user@spark.apache.org Subject: Re: Spark SQL and Hive 
tables 

>
> [S]ince Hive has a large number of dependencies, it is not included in the
> default Spark assembly. In order to use Hive you must first run 
> ‘SPARK_HIVE=true
> sbt/sbt assembly/assembly’ (or use -Phive for maven). This command builds
> a new assembly jar that includes Hive. Note that this Hive assembly jar
> must also be present on all of the worker nodes, as they will need access
> to the Hive serialization and deserialization libraries (SerDes) in order
> to acccess data stored in Hive.



On Fri, Jul 25, 2014 at 3:20 PM, Sameer Tilak  wrote:

> Hi Jerry,
>
> I am having trouble with this. May be something wrong with my import or
> version etc.
>
> scala> import org.apache.spark.sql._;
> import org.apache.spark.sql._
>
> scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> :24: error: object hive is not a member of package
> org.apache.spark.sql
>val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>   ^
> Here is what I see for autocompletion:
>
> scala> org.apache.spark.sql.
> Row SQLContext  SchemaRDD   SchemaRDDLike   api
> catalystcolumnarexecution   package parquet
> test
>
>
> --
> Date: Fri, 25 Jul 2014 17:48:27 -0400
>
> Subject: Re: Spark SQL and Hive tables
> From: chiling...@gmail.com
> To: user@spark.apache.org
>
>
> Hi Sameer,
>
> The blog post you referred to is about Spark SQL. I don't think the intent
> of the article is meant to guide you how to read data from Hive via Spark
> SQL. So don't worry too much about the blog post.
>
> The programming guide I referred to demonstrate how to read data from Hive
> using Spark SQL. It is a good starting point.
>
> Best Regards,
>
> Jerry
>
>
> On Fri, Jul 25, 2014 at 5:38 PM, Sameer Tilak  wrote:
>
> Hi Michael,
> Thanks. I am not creating HiveContext, I am creating SQLContext. I am
> using CDH 5.1. Can you please let me know which conf/ directory you are
> talking about?
>
> --
> From: mich...@databricks.com
> Date: Fri, 25 Jul 2014 14:34:53 -0700
>
> Subject: Re: Spark SQL and Hive tables
> To: user@spark.apache.org
>
>
> In particular, have you put your hive-site.xml in the conf/ directory?
>  Also, are you creating a HiveContext instead of a SQLContext?
>
>
> On Fri, Jul 25, 2014 at 2:27 PM, Jerry Lam  wrote:
>
> Hi Sameer,
>
> Maybe this page will help you:
> https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
>
> Best Regards,
>
> Jerry
>
>
>
> On Fri, Jul 25, 2014 at 5:25 PM, Sameer Tilak  wrote:
>
> Hi All,
> I am trying to load data from Hive tables using Spark SQL. I am using
> spark-shell. Here is what I see:
>
> val trainingDataTable = sql("""SELECT prod.prod_num, demographics.gender,
> demographics.birth_year, demographics.income_group  FROM prod p JOIN
> demographics d ON d.user_id = p.user_id""")
>
> 14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch
> MultiInstanceRelations
> 14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch
> CaseInsensitiveAttributeReferences
> java.lang.RuntimeException: Table Not Found: prod.
>
> I have these tables in hive. I used show tables command to confirm this.
> Can someone please let me know how do I make them accessible here?
>
>
>
>
>


Re: Spark SQL and Hive tables

2014-07-25 Thread Michael Armbrust
>
> [S]ince Hive has a large number of dependencies, it is not included in the
> default Spark assembly. In order to use Hive you must first run 
> ‘SPARK_HIVE=true
> sbt/sbt assembly/assembly’ (or use -Phive for maven). This command builds
> a new assembly jar that includes Hive. Note that this Hive assembly jar
> must also be present on all of the worker nodes, as they will need access
> to the Hive serialization and deserialization libraries (SerDes) in order
> to acccess data stored in Hive.



On Fri, Jul 25, 2014 at 3:20 PM, Sameer Tilak  wrote:

> Hi Jerry,
>
> I am having trouble with this. May be something wrong with my import or
> version etc.
>
> scala> import org.apache.spark.sql._;
> import org.apache.spark.sql._
>
> scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> :24: error: object hive is not a member of package
> org.apache.spark.sql
>val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>   ^
> Here is what I see for autocompletion:
>
> scala> org.apache.spark.sql.
> Row SQLContext  SchemaRDD   SchemaRDDLike   api
> catalystcolumnarexecution   package parquet
> test
>
>
> --
> Date: Fri, 25 Jul 2014 17:48:27 -0400
>
> Subject: Re: Spark SQL and Hive tables
> From: chiling...@gmail.com
> To: user@spark.apache.org
>
>
> Hi Sameer,
>
> The blog post you referred to is about Spark SQL. I don't think the intent
> of the article is meant to guide you how to read data from Hive via Spark
> SQL. So don't worry too much about the blog post.
>
> The programming guide I referred to demonstrate how to read data from Hive
> using Spark SQL. It is a good starting point.
>
> Best Regards,
>
> Jerry
>
>
> On Fri, Jul 25, 2014 at 5:38 PM, Sameer Tilak  wrote:
>
> Hi Michael,
> Thanks. I am not creating HiveContext, I am creating SQLContext. I am
> using CDH 5.1. Can you please let me know which conf/ directory you are
> talking about?
>
> --
> From: mich...@databricks.com
> Date: Fri, 25 Jul 2014 14:34:53 -0700
>
> Subject: Re: Spark SQL and Hive tables
> To: user@spark.apache.org
>
>
> In particular, have you put your hive-site.xml in the conf/ directory?
>  Also, are you creating a HiveContext instead of a SQLContext?
>
>
> On Fri, Jul 25, 2014 at 2:27 PM, Jerry Lam  wrote:
>
> Hi Sameer,
>
> Maybe this page will help you:
> https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
>
> Best Regards,
>
> Jerry
>
>
>
> On Fri, Jul 25, 2014 at 5:25 PM, Sameer Tilak  wrote:
>
> Hi All,
> I am trying to load data from Hive tables using Spark SQL. I am using
> spark-shell. Here is what I see:
>
> val trainingDataTable = sql("""SELECT prod.prod_num, demographics.gender,
> demographics.birth_year, demographics.income_group  FROM prod p JOIN
> demographics d ON d.user_id = p.user_id""")
>
> 14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch
> MultiInstanceRelations
> 14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch
> CaseInsensitiveAttributeReferences
> java.lang.RuntimeException: Table Not Found: prod.
>
> I have these tables in hive. I used show tables command to confirm this.
> Can someone please let me know how do I make them accessible here?
>
>
>
>
>


Re: spark streaming actor receiver doesn't play well with kryoserializer

2014-07-25 Thread Tathagata Das
Is this error on the executor or on the driver? Can you provide a larger
snippet of the logs, driver as well as if possible executor logs.

TD


On Thu, Jul 24, 2014 at 10:28 PM, Alan Ngai  wrote:

> bump.  any ideas?
>
> On Jul 24, 2014, at 3:09 AM, Alan Ngai  wrote:
>
> it looks like when you configure sparkconfig to use the kryoserializer in
> combination of using an ActorReceiver, bad things happen.  I modified the
> ActorWordCount example program from
>
> val sparkConf = new SparkConf().setAppName("ActorWordCount")
>
> to
>
> val sparkConf = new SparkConf()
>   .setAppName("ActorWordCount")
>   .set("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer”)
>
> and I get the stack trace below.  I figured it might be that Kryo doesn’t
> know how to serialize/deserialize the actor so I added a registry.  I also
> added a default empty constructor to SampleActorReceiver just for kicks
>
> class SerializationRegistry extends KryoRegistrator {
>   override def registerClasses(kryo: Kryo) {
> kryo.register(classOf[SampleActorReceiver])
>   }
> }
>
> …
>
> case class SampleActorReceiver[T: ClassTag](var urlOfPublisher: String)
> extends Actor with ActorHelper {
>   def this() = this(“”)
>   ...
> }
>
> ...
> val sparkConf = new SparkConf()
>   .setAppName("ActorWordCount")
>   .set("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer")
>   .set("spark.kryo.registrator",
> "org.apache.spark.examples.streaming.SerializationRegistry")
>
>
> None of this worked, same stack trace.  Any idea what’s going on?  Is this
> a known issue and is there a workaround?
>
> 14/07/24 02:58:19 [ERROR] OneForOneStrategy: configuration problem while
> creating [akka://spark/user/Supervisor0/SampleReceiver] with dispatcher
> [akka.actor.default-dispatcher] and mailbox [akka.actor.default-mailbox]
>  akka.actor.ActorInitializationException: exception during creation
>  at akka.actor.ActorInitializationException$.apply(Actor.scala:218)
> at akka.actor.ActorCell.create(ActorCell.scala:578)
>  at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:425)
>  at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
> at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
>  at akka.dispatch.Mailbox.run(Mailbox.scala:218)
>  at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>  at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>  at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: akka.ConfigurationException: configuration problem while
> creating [akka://spark/user/Supervisor0/SampleReceiver] with dispatcher
> [akka.actor.default-dispatcher] and mailbox [akka.actor.default-mailbox]
>  at akka.actor.LocalActorRefProvider.actorOf(ActorRefProvider.scala:723)
>  at
> akka.remote.RemoteActorRefProvider.actorOf(RemoteActorRefProvider.scala:296)
>  at akka.actor.dungeon.Children$class.makeChild(Children.scala:191)
> at akka.actor.dungeon.Children$class.actorOf(Children.scala:38)
>  at akka.actor.ActorCell.actorOf(ActorCell.scala:338)
>  at
> org.apache.spark.streaming.receiver.ActorReceiver$Supervisor.(ActorReceiver.scala:152)
>  at
> org.apache.spark.streaming.receiver.ActorReceiver$$anonfun$supervisor$1.apply(ActorReceiver.scala:145)
>  at
> org.apache.spark.streaming.receiver.ActorReceiver$$anonfun$supervisor$1.apply(ActorReceiver.scala:145)
>  at akka.actor.TypedCreatorFunctionConsumer.produce(Props.scala:401)
> at akka.actor.Props.newActor(Props.scala:339)
>  at akka.actor.ActorCell.newActor(ActorCell.scala:534)
>  at akka.actor.ActorCell.create(ActorCell.scala:560)
> ... 9 more
> Caused by: java.lang.IllegalArgumentException: constructor public
> akka.actor.CreatorFunctionConsumer(scala.Function0) is incompatible with
> arguments [class java.lang.Class, class
> org.apache.spark.examples.streaming.ActorWordCount$$anonfun$2]
>  at akka.util.Reflect$.instantiate(Reflect.scala:69)
>  at akka.actor.Props.cachedActorClass(Props.scala:203)
> at akka.actor.Props.actorClass(Props.scala:327)
>  at akka.dispatch.Mailboxes.getMailboxType(Mailboxes.scala:124)
>  at akka.actor.LocalActorRefProvider.actorOf(ActorRefProvider.scala:718)
> ... 20 more
> Caused by: java.lang.IllegalArgumentException: wrong number of arguments
>  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>  at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
>  at akka.util.Reflect$.instantiate(Reflect.scala:65)
> ... 24 more
>
>
>


RE: Spark SQL and Hive tables

2014-07-25 Thread Sameer Tilak
Hi Jerry,
I am having trouble with this. May be something wrong with my import or version 
etc. 
scala> import org.apache.spark.sql._;import org.apache.spark.sql._
scala> val hiveContext = new 
org.apache.spark.sql.hive.HiveContext(sc):24: error: object hive is 
not a member of package org.apache.spark.sql   val hiveContext = new 
org.apache.spark.sql.hive.HiveContext(sc)   
   ^Here is what I see for autocompletion:
scala> org.apache.spark.sql.Row SQLContext  SchemaRDD   
SchemaRDDLike   apicatalystcolumnarexecution   package  
   parquettest

Date: Fri, 25 Jul 2014 17:48:27 -0400
Subject: Re: Spark SQL and Hive tables
From: chiling...@gmail.com
To: user@spark.apache.org

Hi Sameer,
The blog post you referred to is about Spark SQL. I don't think the intent of 
the article is meant to guide you how to read data from Hive via Spark SQL. So 
don't worry too much about the blog post. 

The programming guide I referred to demonstrate how to read data from Hive 
using Spark SQL. It is a good starting point.
Best Regards,
Jerry


On Fri, Jul 25, 2014 at 5:38 PM, Sameer Tilak  wrote:




Hi Michael,Thanks. I am not creating HiveContext, I am creating SQLContext. I 
am using CDH 5.1. Can you please let me know which conf/ directory you are 
talking about? 

From: mich...@databricks.com

Date: Fri, 25 Jul 2014 14:34:53 -0700
Subject: Re: Spark SQL and Hive tables
To: user@spark.apache.org


In particular, have you put your hive-site.xml in the conf/ directory?  Also, 
are you creating a HiveContext instead of a SQLContext?

On Fri, Jul 25, 2014 at 2:27 PM, Jerry Lam  wrote:



Hi Sameer,
Maybe this page will help you: 
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables




Best Regards,
Jerry


On Fri, Jul 25, 2014 at 5:25 PM, Sameer Tilak  wrote:







Hi All,I am trying to load data from Hive tables using Spark SQL. I am using 
spark-shell. Here is what I see: 
val trainingDataTable = sql("""SELECT prod.prod_num, demographics.gender, 
demographics.birth_year, demographics.income_group  FROM prod p JOIN 
demographics d ON d.user_id = p.user_id""")




14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch 
MultiInstanceRelations14/07/25 14:18:46 INFO Analyzer: Max iterations (2) 
reached for batch CaseInsensitiveAttributeReferences



java.lang.RuntimeException: Table Not Found: prod.
I have these tables in hive. I used show tables command to confirm this. Can 
someone please let me know how do I make them accessible here? 



  



  

  

Re: saveAsTextFiles file not found exception

2014-07-25 Thread Tathagata Das
Can you give a stack trace and logs of the exception? Its hard to say
anything without any associated stack trace and logs.

TD


On Fri, Jul 25, 2014 at 1:32 PM, Bill Jay 
wrote:

> Hi,
>
> I am running a Spark Streaming job that uses saveAsTextFiles to save
> results into hdfs files. However, it has an exception after 20 batches
>
>
> result-140631234/_temporary/0/task_201407251119__m_03 does not 
> exist.
>
>
> When the job is running, I do not change any file in the folder. Does
> anyone know why the file cannot be found?
>
> Thanks!
>
> Bill
>


Re: Using Spark Streaming with Kafka 0.7.2

2014-07-25 Thread Tathagata Das
Spark Streaming is built as part of the whole Spark repository. Hence
follow Spark's building instructions
 to build
Spark Streaming along with Spark.
Spark Streaming 0.8.1 was built with kafka 0.7.2. You can take a look. If
necessary, I recommend modifying the current Kafka Receiver based on the
0.8.1 Kafka Receiver


TD


On Fri, Jul 25, 2014 at 10:16 AM, maddenpj  wrote:

> Hi all,
>
> Currently we have Kafka 0.7.2 running in production and can't upgrade for
> external reasons however spark streaming (1.0.1) was built with Kafka
> 0.8.0.
> What is the best way to use spark streaming with older versions of Kafka.
> Currently I'm investigating trying to build spark streaming myself but I
> can't find any documentation specifically for building spark streaming.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Streaming-with-Kafka-0-7-2-tp10674.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


RE: Spark SQL and Hive tables

2014-07-25 Thread Sameer Tilak

Thanks, Michael.
From: mich...@databricks.com
Date: Fri, 25 Jul 2014 14:49:00 -0700
Subject: Re: Spark SQL and Hive tables
To: user@spark.apache.org

>From the programming guide:
When working with Hive one must construct a HiveContext, which inherits from 
SQLContext, and adds support for finding tables in in the MetaStore and writing 
queries using HiveQL.


 conf/ is a top level directory in the spark distribution that you downloaded.

On Fri, Jul 25, 2014 at 2:35 PM, Sameer Tilak  wrote:





Hi Jerry,Thanks for your reply. I was following the steps in this programming 
guide. It does not mention anything about creating HiveContext or HQL 
explicitly. 



http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html


Users(userId INT, name String, email STRING,


age INT, latitude: DOUBLE, longitude: DOUBLE,
subscribed: BOOLEAN)Events(userId INT, action INT)

Given the data stored in in these tables, one might want to build a model that 
will predict which users are good targets for a new campaign, based on users 
that are similar.

// Data can easily be extracted from existing sources,
// such as Apache Hive.
val trainingDataTable = sql("""
  SELECT e.action
 u.age,
 u.latitude,
 u.logitude
  FROM Users u
  JOIN Events e
  ON u.userId = e.userId""")




Date: Fri, 25 Jul 2014 17:27:17 -0400
Subject: Re: Spark SQL and Hive tables
From: chiling...@gmail.com


To: user@spark.apache.org

Hi Sameer,
Maybe this page will help you: 
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables



Best Regards,
Jerry


On Fri, Jul 25, 2014 at 5:25 PM, Sameer Tilak  wrote:






Hi All,I am trying to load data from Hive tables using Spark SQL. I am using 
spark-shell. Here is what I see: 
val trainingDataTable = sql("""SELECT prod.prod_num, demographics.gender, 
demographics.birth_year, demographics.income_group  FROM prod p JOIN 
demographics d ON d.user_id = p.user_id""")



14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch 
MultiInstanceRelations14/07/25 14:18:46 INFO Analyzer: Max iterations (2) 
reached for batch CaseInsensitiveAttributeReferences


java.lang.RuntimeException: Table Not Found: prod.
I have these tables in hive. I used show tables command to confirm this. Can 
someone please let me know how do I make them accessible here? 


  

  

  

RE: Spark SQL and Hive tables

2014-07-25 Thread Sameer Tilak
Thanks, Jerry.

Date: Fri, 25 Jul 2014 17:48:27 -0400
Subject: Re: Spark SQL and Hive tables
From: chiling...@gmail.com
To: user@spark.apache.org

Hi Sameer,
The blog post you referred to is about Spark SQL. I don't think the intent of 
the article is meant to guide you how to read data from Hive via Spark SQL. So 
don't worry too much about the blog post. 

The programming guide I referred to demonstrate how to read data from Hive 
using Spark SQL. It is a good starting point.
Best Regards,
Jerry


On Fri, Jul 25, 2014 at 5:38 PM, Sameer Tilak  wrote:




Hi Michael,Thanks. I am not creating HiveContext, I am creating SQLContext. I 
am using CDH 5.1. Can you please let me know which conf/ directory you are 
talking about? 

From: mich...@databricks.com

Date: Fri, 25 Jul 2014 14:34:53 -0700
Subject: Re: Spark SQL and Hive tables
To: user@spark.apache.org


In particular, have you put your hive-site.xml in the conf/ directory?  Also, 
are you creating a HiveContext instead of a SQLContext?

On Fri, Jul 25, 2014 at 2:27 PM, Jerry Lam  wrote:



Hi Sameer,
Maybe this page will help you: 
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables




Best Regards,
Jerry


On Fri, Jul 25, 2014 at 5:25 PM, Sameer Tilak  wrote:







Hi All,I am trying to load data from Hive tables using Spark SQL. I am using 
spark-shell. Here is what I see: 
val trainingDataTable = sql("""SELECT prod.prod_num, demographics.gender, 
demographics.birth_year, demographics.income_group  FROM prod p JOIN 
demographics d ON d.user_id = p.user_id""")




14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch 
MultiInstanceRelations14/07/25 14:18:46 INFO Analyzer: Max iterations (2) 
reached for batch CaseInsensitiveAttributeReferences



java.lang.RuntimeException: Table Not Found: prod.
I have these tables in hive. I used show tables command to confirm this. Can 
someone please let me know how do I make them accessible here? 



  



  

  

Re: Spark SQL and Hive tables

2014-07-25 Thread Michael Armbrust
>From the programming guide:

When working with Hive one must construct a HiveContext, which inherits
> from SQLContext, and adds support for finding tables in in the MetaStore
> and writing queries using HiveQL.


 conf/ is a top level directory in the spark distribution that you
downloaded.


On Fri, Jul 25, 2014 at 2:35 PM, Sameer Tilak  wrote:

> Hi Jerry,
> Thanks for your reply. I was following the steps in this programming
> guide. It does not mention anything about creating HiveContext or HQL
> explicitly.
>
>
>
> http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html
>
>
>- Users(userId INT, name String, email STRING,
>age INT, latitude: DOUBLE, longitude: DOUBLE,
>subscribed: BOOLEAN)
>- Events(userId INT, action INT)
>
> Given the data stored in in these tables, one might want to build a model
> that will predict which users are good targets for a new campaign, based on
> users that are similar.
>
> // Data can easily be extracted from existing sources,// such as Apache 
> Hive.val trainingDataTable = sql("""  SELECT e.action u.age, 
> u.latitude, u.logitude  FROM Users u  JOIN Events e  ON u.userId = 
> e.userId""")
>
>
>
> --
> Date: Fri, 25 Jul 2014 17:27:17 -0400
> Subject: Re: Spark SQL and Hive tables
> From: chiling...@gmail.com
> To: user@spark.apache.org
>
>
> Hi Sameer,
>
> Maybe this page will help you:
> https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
>
> Best Regards,
>
> Jerry
>
>
>
> On Fri, Jul 25, 2014 at 5:25 PM, Sameer Tilak  wrote:
>
> Hi All,
> I am trying to load data from Hive tables using Spark SQL. I am using
> spark-shell. Here is what I see:
>
> val trainingDataTable = sql("""SELECT prod.prod_num, demographics.gender,
> demographics.birth_year, demographics.income_group  FROM prod p JOIN
> demographics d ON d.user_id = p.user_id""")
>
> 14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch
> MultiInstanceRelations
> 14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch
> CaseInsensitiveAttributeReferences
> java.lang.RuntimeException: Table Not Found: prod.
>
> I have these tables in hive. I used show tables command to confirm this.
> Can someone please let me know how do I make them accessible here?
>
>
>


Re: Spark SQL and Hive tables

2014-07-25 Thread Jerry Lam
Hi Sameer,

The blog post you referred to is about Spark SQL. I don't think the intent
of the article is meant to guide you how to read data from Hive via Spark
SQL. So don't worry too much about the blog post.

The programming guide I referred to demonstrate how to read data from Hive
using Spark SQL. It is a good starting point.

Best Regards,

Jerry


On Fri, Jul 25, 2014 at 5:38 PM, Sameer Tilak  wrote:

> Hi Michael,
> Thanks. I am not creating HiveContext, I am creating SQLContext. I am
> using CDH 5.1. Can you please let me know which conf/ directory you are
> talking about?
>
> --
> From: mich...@databricks.com
> Date: Fri, 25 Jul 2014 14:34:53 -0700
>
> Subject: Re: Spark SQL and Hive tables
> To: user@spark.apache.org
>
>
> In particular, have you put your hive-site.xml in the conf/ directory?
>  Also, are you creating a HiveContext instead of a SQLContext?
>
>
> On Fri, Jul 25, 2014 at 2:27 PM, Jerry Lam  wrote:
>
> Hi Sameer,
>
> Maybe this page will help you:
> https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
>
> Best Regards,
>
> Jerry
>
>
>
> On Fri, Jul 25, 2014 at 5:25 PM, Sameer Tilak  wrote:
>
> Hi All,
> I am trying to load data from Hive tables using Spark SQL. I am using
> spark-shell. Here is what I see:
>
> val trainingDataTable = sql("""SELECT prod.prod_num, demographics.gender,
> demographics.birth_year, demographics.income_group  FROM prod p JOIN
> demographics d ON d.user_id = p.user_id""")
>
> 14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch
> MultiInstanceRelations
> 14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch
> CaseInsensitiveAttributeReferences
> java.lang.RuntimeException: Table Not Found: prod.
>
> I have these tables in hive. I used show tables command to confirm this.
> Can someone please let me know how do I make them accessible here?
>
>
>
>


RE: Spark SQL and Hive tables

2014-07-25 Thread Sameer Tilak
Hi Michael,Thanks. I am not creating HiveContext, I am creating SQLContext. I 
am using CDH 5.1. Can you please let me know which conf/ directory you are 
talking about? 

From: mich...@databricks.com
Date: Fri, 25 Jul 2014 14:34:53 -0700
Subject: Re: Spark SQL and Hive tables
To: user@spark.apache.org

In particular, have you put your hive-site.xml in the conf/ directory?  Also, 
are you creating a HiveContext instead of a SQLContext?

On Fri, Jul 25, 2014 at 2:27 PM, Jerry Lam  wrote:


Hi Sameer,
Maybe this page will help you: 
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables



Best Regards,
Jerry


On Fri, Jul 25, 2014 at 5:25 PM, Sameer Tilak  wrote:






Hi All,I am trying to load data from Hive tables using Spark SQL. I am using 
spark-shell. Here is what I see: 
val trainingDataTable = sql("""SELECT prod.prod_num, demographics.gender, 
demographics.birth_year, demographics.income_group  FROM prod p JOIN 
demographics d ON d.user_id = p.user_id""")



14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch 
MultiInstanceRelations14/07/25 14:18:46 INFO Analyzer: Max iterations (2) 
reached for batch CaseInsensitiveAttributeReferences


java.lang.RuntimeException: Table Not Found: prod.
I have these tables in hive. I used show tables command to confirm this. Can 
someone please let me know how do I make them accessible here? 


  



  

Re: Emacs Setup Anyone?

2014-07-25 Thread Andrei
I have never tried Spark REPL from within Emacs, but I remember that
switching from normal Python to Pyspark was as simple as changing
interpreter name at the beginning of session. Seems like ensime [1]
(together with ensime-emacs [2]) should be a good point to start. For
example, take a look at ensime-sbt.el [3] that defines a number of
Scala/SBT commands.

[1]: https://github.com/ensime/ensime-server
[2]: https://github.com/ensime/ensime-emacs
[3]: https://github.com/ensime/ensime-emacs/blob/master/ensime-sbt.el




On Thu, Jul 24, 2014 at 10:14 PM, Steve Nunez 
wrote:

> Anyone out there have a good configuration for emacs? Scala-mode sort of
> works, but I’d love to see a fully-supported spark-mode with an inferior
> shell. Searching didn’t turn up much of anything.
>
> Any emacs users out there? What setup are you using?
>
> Cheers,
> - SteveN
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.


Re: Spark SQL and Hive tables

2014-07-25 Thread Michael Armbrust
In particular, have you put your hive-site.xml in the conf/ directory?
 Also, are you creating a HiveContext instead of a SQLContext?


On Fri, Jul 25, 2014 at 2:27 PM, Jerry Lam  wrote:

> Hi Sameer,
>
> Maybe this page will help you:
> https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
>
> Best Regards,
>
> Jerry
>
>
>
> On Fri, Jul 25, 2014 at 5:25 PM, Sameer Tilak  wrote:
>
>> Hi All,
>> I am trying to load data from Hive tables using Spark SQL. I am using
>> spark-shell. Here is what I see:
>>
>> val trainingDataTable = sql("""SELECT prod.prod_num, demographics.gender,
>> demographics.birth_year, demographics.income_group  FROM prod p JOIN
>> demographics d ON d.user_id = p.user_id""")
>>
>> 14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch
>> MultiInstanceRelations
>> 14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch
>> CaseInsensitiveAttributeReferences
>> java.lang.RuntimeException: Table Not Found: prod.
>>
>> I have these tables in hive. I used show tables command to confirm this.
>> Can someone please let me know how do I make them accessible here?
>>
>
>


RE: Spark SQL and Hive tables

2014-07-25 Thread Sameer Tilak
Hi Jerry,Thanks for your reply. I was following the steps in this programming 
guide. It does not mention anything about creating HiveContext or HQL 
explicitly. 

http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html
Users(userId INT, name String, email STRING,age INT, latitude: DOUBLE, 
longitude: DOUBLE,subscribed: BOOLEAN)Events(userId INT, action INT)Given the 
data stored in in these tables, one might want to build a model that will 
predict which users are good targets for a new campaign, based on users that 
are similar.// Data can easily be extracted from existing sources,
// such as Apache Hive.
val trainingDataTable = sql("""
  SELECT e.action
 u.age,
 u.latitude,
 u.logitude
  FROM Users u
  JOIN Events e
  ON u.userId = e.userId""")


Date: Fri, 25 Jul 2014 17:27:17 -0400
Subject: Re: Spark SQL and Hive tables
From: chiling...@gmail.com
To: user@spark.apache.org

Hi Sameer,
Maybe this page will help you: 
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

Best Regards,
Jerry


On Fri, Jul 25, 2014 at 5:25 PM, Sameer Tilak  wrote:




Hi All,I am trying to load data from Hive tables using Spark SQL. I am using 
spark-shell. Here is what I see: 
val trainingDataTable = sql("""SELECT prod.prod_num, demographics.gender, 
demographics.birth_year, demographics.income_group  FROM prod p JOIN 
demographics d ON d.user_id = p.user_id""")

14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch 
MultiInstanceRelations14/07/25 14:18:46 INFO Analyzer: Max iterations (2) 
reached for batch CaseInsensitiveAttributeReferences
java.lang.RuntimeException: Table Not Found: prod.
I have these tables in hive. I used show tables command to confirm this. Can 
someone please let me know how do I make them accessible here? 
  

  

Re: Spark SQL and Hive tables

2014-07-25 Thread Jerry Lam
Hi Sameer,

Maybe this page will help you:
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

Best Regards,

Jerry



On Fri, Jul 25, 2014 at 5:25 PM, Sameer Tilak  wrote:

> Hi All,
> I am trying to load data from Hive tables using Spark SQL. I am using
> spark-shell. Here is what I see:
>
> val trainingDataTable = sql("""SELECT prod.prod_num, demographics.gender,
> demographics.birth_year, demographics.income_group  FROM prod p JOIN
> demographics d ON d.user_id = p.user_id""")
>
> 14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch
> MultiInstanceRelations
> 14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch
> CaseInsensitiveAttributeReferences
> java.lang.RuntimeException: Table Not Found: prod.
>
> I have these tables in hive. I used show tables command to confirm this.
> Can someone please let me know how do I make them accessible here?
>


Spark SQL and Hive tables

2014-07-25 Thread Sameer Tilak
Hi All,I am trying to load data from Hive tables using Spark SQL. I am using 
spark-shell. Here is what I see: 
val trainingDataTable = sql("""SELECT prod.prod_num, demographics.gender, 
demographics.birth_year, demographics.income_group  FROM prod p JOIN 
demographics d ON d.user_id = p.user_id""")
14/07/25 14:18:46 INFO Analyzer: Max iterations (2) reached for batch 
MultiInstanceRelations14/07/25 14:18:46 INFO Analyzer: Max iterations (2) 
reached for batch CaseInsensitiveAttributeReferencesjava.lang.RuntimeException: 
Table Not Found: prod.
I have these tables in hive. I used show tables command to confirm this. Can 
someone please let me know how do I make them accessible here?  
 

saveAsTextFiles file not found exception

2014-07-25 Thread Bill Jay
Hi,

I am running a Spark Streaming job that uses saveAsTextFiles to save
results into hdfs files. However, it has an exception after 20 batches

result-140631234/_temporary/0/task_201407251119__m_03 does
not exist.


When the job is running, I do not change any file in the folder. Does
anyone know why the file cannot be found?

Thanks!

Bill


Re: Support for Percentile and Variance Aggregation functions in Spark with HiveContext

2014-07-25 Thread Michael Armbrust
Hmm, in general we try to support all the UDAFs, but this one must be using
a different base class that we don't have a wrapper for.  JIRA here:
https://issues.apache.org/jira/browse/SPARK-2693


On Fri, Jul 25, 2014 at 8:06 AM,  wrote:

>
> Hi all,
>
> I am using Spark 1.0.0 with CDH 5.1.0.
>
> I want to aggregate the data in a raw table using a simple query like below
>
> *SELECT MIN(field1), MAX(field2), AVG(field3), PERCENTILE(field4),
> year,month,day FROM  raw_data_table  GROUP BY year, month, day*
>
> MIN, MAX and AVG functions work fine for me, but with PERCENTILE, I get an
> error as shown below.
>
> Exception in thread "main" java.lang.RuntimeException: No handler for udf
> class org.apache.hadoop.hive.ql.udf.UDAFPercentile
> at scala.sys.package$.error(package.scala:27)
> at
> org.apache.spark.sql.hive.HiveFunctionRegistry$.lookupFunction(hiveUdfs.scala:69)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:115)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:113)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
>
> I have read in the documentation that with HiveContext Spark SQL supports
> all the UDFs supported in Hive.
>
> I want to know if there is anything else I need to follow to use
> Percentile with Spark SQL..?? Or .. Are there any limitations still in
> Spark SQL with respect to UDFs and UDAFs in the version I am using..??
>
>
>
>
>
> Thanks and regards
>
> Vinay Kashyap
>


Re: sparkcontext stop and then start again

2014-07-25 Thread Davies Liu
Hey Mohit,

Behind the pyspark.SparkContext, there is SparkContext in JVM, so the overhead
 of creating a SparkContext is pretty high. Also, during starting and
stopping an
SparkContext, there are lots of things need to setup and release,
maybe there some
corner cases make it not so solid enough.

So, It's better to re-use only one SparkContex for all the jobs in one
Python scripts.
In the meantime, you will have an WebUI to see all the histories.

Davies


On Fri, Jul 25, 2014 at 10:21 AM, Mohit Jaggi  wrote:
> Folks,
> I had some pyspark code which used to hang with no useful debug logs. It got
> fixed when I changed my code to keep the sparkcontext forever instead of
> stopping it and then creating another one later. Is this a bug or expected
> behavior?
>
> Mohit.


Kryo Issue on Spark 1.0.1, Mesos 0.18.2

2014-07-25 Thread Gary Malouf
After upgrading to Spark 1.0.1 from 0.9.1 everything seemed to be going
well.  Looking at the Mesos slave logs, I noticed:

ERROR KryoSerializer: Failed to run spark.kryo.registrator
java.lang.ClassNotFoundException:
com/mediacrossing/verrazano/kryo/MxDataRegistrator

My spark-env.sh has the following when I run the Spark Shell:

export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so

export MASTER=mesos://zk://n-01:2181,n-02:2181,n-03:2181/masters

export ADD_JARS=/opt/spark/mx-lib/verrazano-assembly.jar


# -XX:+UseCompressedOops must be disabled to use more than 32GB RAM

SPARK_JAVA_OPTS="-Xss2m -XX:+UseCompressedOops
-Dspark.local.dir=/opt/mesos-tmp -Dspark.executor.memory=4g
 -Dspark.serializer=org.apache.spark.serializer.KryoSerializer
-Dspark.kryo.registrator=com.mediacrossing.verrazano.kryo.MxDataRegistrator
-Dspark.kryoserializer.buffer.mb=16 -Dspark.akka.askTimeout=30"


I was able to verify that our custom jar was being copied to each worker,
but for some reason it is not finding my registrator class.  Is anyone else
struggling with Kryo on 1.0.x branch?


Re: Issue submitting spark job to yarn

2014-07-25 Thread Ron Gonzalez
Turns out that it had the spark assembly in the target/dependency dir so 
resource localization failed because spark was writing both the assembly from 
the dependency dir along with the main resource localization of the assembly as 
part of the main code line. Yarn doesn't like it if the file is overwritten in 
hdfs after it's been registered as a local resource. Node manager logs are your 
friend!

Just sharing in case other folks run into the same problem.

Thanks,
Ron

Sent from my iPhone

> On Jul 25, 2014, at 9:36 AM, Ron Gonzalez  wrote:
> 
> Folks,
>   I've been able to submit simple jobs to yarn thus far. However, when I did 
> something more complicated that added 194 dependency jars using --addJars, 
> the job fails in YARN with no logs. What ends up happening is that no 
> container logs get created (app master or executor). If I add just a couple 
> of dependencies, it works, so this is clearly a case of too many dependencies 
> passed into the invocation.
> 
>   Not sure if this means that no container was created at all, but bottom 
> line is that I get no logs that can help me determine what's wrong. Because 
> of the large number of jars, I figured it might have been a permgen issue so 
> I added these options. However, that didn't help. It seems as if the actual 
> submission wasn't even spawned since no container was created or no log was 
> found.
> 
>   Any ideas for this?
> 
> Thanks,
> Ron


Re: Hadoop client protocol mismatch with spark 1.0.1, cdh3u5

2014-07-25 Thread Bharath Ravi Kumar
That's right, I'm looking to depend on spark in general and change only the
hadoop client deps. The spark master and slaves use the
spark-1.0.1-bin-hadoop1 binaries from the downloads page.  The relevant
snippet from the app's maven pom is as follows:


org.apache.spark
spark-core_2.10
1.0.1
provided


  org.apache.hadoop
  hadoop-client
  0.20.2-cdh3u5
  jar





Cloudera repository
https://repository.cloudera.com/artifactory/cloudera-repos/



Akka repository
http://repo.akka.io/releases




Thanks,
Bharath


On Fri, Jul 25, 2014 at 10:29 PM, Sean Owen  wrote:

> If you link against the pre-built binary, that's for Hadoop 1.0.4. Can
> you show your deps to clarify what you are depending on? Building
> custom Spark and depending on it is a different thing from depending
> on plain Spark and changing its deps. I think you want the latter.
>
> On Fri, Jul 25, 2014 at 5:46 PM, Bharath Ravi Kumar 
> wrote:
> > Thanks for responding. I used the pre built spark binaries meant for
> > hadoop1,cdh3u5. I do not intend to build spark against a specific
> > distribution. Irrespective of whether I build my app with the explicit
> cdh
> > hadoop client dependency,  I get the same error message. I also verified
> > that my  app's uber jar had pulled in the cdh hadoop client dependencies.
> >
> > On 25-Jul-2014 9:26 pm, "Sean Owen"  wrote:
> >>
> >> This indicates your app is not actually using the version of the HDFS
> >> client you think. You built Spark from source with the right deps it
> >> seems, but are you sure you linked to your build in your app?
> >>
> >> On Fri, Jul 25, 2014 at 4:32 PM, Bharath Ravi Kumar <
> reachb...@gmail.com>
> >> wrote:
> >> > Any suggestions to  work around this issue ? The pre built spark
> >> > binaries
> >> > don't appear to work against cdh as documented, unless there's a build
> >> > issue, which seems unlikely.
> >> >
> >> > On 25-Jul-2014 3:42 pm, "Bharath Ravi Kumar" 
> >> > wrote:
> >> >>
> >> >>
> >> >> I'm encountering a hadoop client protocol mismatch trying to read
> from
> >> >> HDFS (cdh3u5) using the pre-build spark from the downloads page
> (linked
> >> >> under "For Hadoop 1 (HDP1, CDH3)"). I've also  followed the
> >> >> instructions at
> >> >>
> >> >>
> http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
> >> >> (i.e. building the app against hadoop-client 0.20.2-cdh3u5), but
> >> >> continue to
> >> >> see the following error regardless of whether I link the app with the
> >> >> cdh
> >> >> client:
> >> >>
> >> >> 14/07/25 09:53:43 INFO client.AppClient$ClientActor: Executor
> updated:
> >> >> app-20140725095343-0016/1 is now RUNNING
> >> >> 14/07/25 09:53:43 WARN util.NativeCodeLoader: Unable to load
> >> >> native-hadoop
> >> >> library for your platform... using builtin-java classes where
> >> >> applicable
> >> >> 14/07/25 09:53:43 WARN snappy.LoadSnappy: Snappy native library not
> >> >> loaded
> >> >> Exception in thread "main" org.apache.hadoop.ipc.RPC$VersionMismatch:
> >> >> Protocol org.apache.hadoop.hdfs.protocol.ClientProtocol version
> >> >> mismatch.
> >> >> (client = 61, server = 63)
> >> >> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:401)
> >> >> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
> >> >>
> >> >>
> >> >> While I can build spark against the exact hadoop distro version, I'd
> >> >> rather work with the standard prebuilt binaries, making additional
> >> >> changes
> >> >> while building the app if necessary. Any workarounds/recommendations?
> >> >>
> >> >> Thanks,
> >> >> Bharath
>


Re: Decision tree classifier in MLlib

2014-07-25 Thread Evan R. Sparks
Can you share the dataset via a gist or something and we can take a look at
what's going on?


On Fri, Jul 25, 2014 at 10:51 AM, SK  wrote:

> yes, the output  is continuous. So I used a threshold to get binary labels.
> If prediction < threshold, then class is 0 else 1. I use this binary label
> to then compute the accuracy. Even with this binary transformation, the
> accuracy with decision tree model is low compared to LR or SVM (for the
> specific dataset I used).
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Decision-tree-classifier-in-MLlib-tp9457p10678.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: memory leak query

2014-07-25 Thread Rico
Hi Michael, 

I have  similar question

  
before. My problem was that my data was too large to be cached in memory
because of serialization.

But I tried to reproduce your test and I did not experience any memory
problem. First, since count operates on the same rdd, it should not increase
the memory usage. Second, since you do not cache the rdd, each new action
such as count will simply reload the data.

I am not sure how much memory you have in your machine, but by default Spark
allocates 512M for each executor and spark.memory.fraction is set to 0.6,
which means you virtually have about 360Mbyte in reality. If you are running
your app on local machine, then you can monitor it by opening the GUI on
your browser using localhost:4040



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/memory-leak-query-tp8961p10679.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Decision tree classifier in MLlib

2014-07-25 Thread SK
yes, the output  is continuous. So I used a threshold to get binary labels.
If prediction < threshold, then class is 0 else 1. I use this binary label
to then compute the accuracy. Even with this binary transformation, the
accuracy with decision tree model is low compared to LR or SVM (for the
specific dataset I used). 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Decision-tree-classifier-in-MLlib-tp9457p10678.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Caching issue with msg: RDD block could not be dropped from memory as it does not exist

2014-07-25 Thread Rico
I could find out the issue. In fact, I did not realize before that when
loaded into memory, the data is deserialized. As a result, what seems to be
a 21Gb dataset occupies 77Gb in memory. 

Details about this is clearly explained in the guide on  serialization and
memory tuning

 
.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Caching-issue-with-msg-RDD-block-could-not-be-dropped-from-memory-as-it-does-not-exist-tp10248p10677.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Are all transformations lazy?

2014-07-25 Thread Rico
It may be confusing at first but there is also an important difference
between reduce and reduceByKey operations. 

reduce is an action on an RDD. Hence, it will request the evaluation of
transformations that resulted to the RDD.

In contrast, reduceByKey is a transformation on PairRDDs, not an action.
Therefore, distinct is implemented as a chain of transformations as below: 

map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Are-all-transformations-lazy-tp2582p10675.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


sparkcontext stop and then start again

2014-07-25 Thread Mohit Jaggi
Folks,
I had some pyspark code which used to hang with no useful debug logs. It
got fixed when I changed my code to keep the sparkcontext forever instead
of stopping it and then creating another one later. Is this a bug or
expected behavior?

Mohit.


Using Spark Streaming with Kafka 0.7.2

2014-07-25 Thread maddenpj
Hi all,

Currently we have Kafka 0.7.2 running in production and can't upgrade for
external reasons however spark streaming (1.0.1) was built with Kafka 0.8.0.
What is the best way to use spark streaming with older versions of Kafka.
Currently I'm investigating trying to build spark streaming myself but I
can't find any documentation specifically for building spark streaming.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Streaming-with-Kafka-0-7-2-tp10674.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Hadoop client protocol mismatch with spark 1.0.1, cdh3u5

2014-07-25 Thread Sean Owen
If you link against the pre-built binary, that's for Hadoop 1.0.4. Can
you show your deps to clarify what you are depending on? Building
custom Spark and depending on it is a different thing from depending
on plain Spark and changing its deps. I think you want the latter.

On Fri, Jul 25, 2014 at 5:46 PM, Bharath Ravi Kumar  wrote:
> Thanks for responding. I used the pre built spark binaries meant for
> hadoop1,cdh3u5. I do not intend to build spark against a specific
> distribution. Irrespective of whether I build my app with the explicit cdh
> hadoop client dependency,  I get the same error message. I also verified
> that my  app's uber jar had pulled in the cdh hadoop client dependencies.
>
> On 25-Jul-2014 9:26 pm, "Sean Owen"  wrote:
>>
>> This indicates your app is not actually using the version of the HDFS
>> client you think. You built Spark from source with the right deps it
>> seems, but are you sure you linked to your build in your app?
>>
>> On Fri, Jul 25, 2014 at 4:32 PM, Bharath Ravi Kumar 
>> wrote:
>> > Any suggestions to  work around this issue ? The pre built spark
>> > binaries
>> > don't appear to work against cdh as documented, unless there's a build
>> > issue, which seems unlikely.
>> >
>> > On 25-Jul-2014 3:42 pm, "Bharath Ravi Kumar" 
>> > wrote:
>> >>
>> >>
>> >> I'm encountering a hadoop client protocol mismatch trying to read from
>> >> HDFS (cdh3u5) using the pre-build spark from the downloads page (linked
>> >> under "For Hadoop 1 (HDP1, CDH3)"). I've also  followed the
>> >> instructions at
>> >>
>> >> http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
>> >> (i.e. building the app against hadoop-client 0.20.2-cdh3u5), but
>> >> continue to
>> >> see the following error regardless of whether I link the app with the
>> >> cdh
>> >> client:
>> >>
>> >> 14/07/25 09:53:43 INFO client.AppClient$ClientActor: Executor updated:
>> >> app-20140725095343-0016/1 is now RUNNING
>> >> 14/07/25 09:53:43 WARN util.NativeCodeLoader: Unable to load
>> >> native-hadoop
>> >> library for your platform... using builtin-java classes where
>> >> applicable
>> >> 14/07/25 09:53:43 WARN snappy.LoadSnappy: Snappy native library not
>> >> loaded
>> >> Exception in thread "main" org.apache.hadoop.ipc.RPC$VersionMismatch:
>> >> Protocol org.apache.hadoop.hdfs.protocol.ClientProtocol version
>> >> mismatch.
>> >> (client = 61, server = 63)
>> >> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:401)
>> >> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>> >>
>> >>
>> >> While I can build spark against the exact hadoop distro version, I'd
>> >> rather work with the standard prebuilt binaries, making additional
>> >> changes
>> >> while building the app if necessary. Any workarounds/recommendations?
>> >>
>> >> Thanks,
>> >> Bharath


Re: Hadoop client protocol mismatch with spark 1.0.1, cdh3u5

2014-07-25 Thread Bharath Ravi Kumar
Thanks for responding. I used the pre built spark binaries meant for
hadoop1,cdh3u5. I do not intend to build spark against a specific
distribution. Irrespective of whether I build my app with the explicit cdh
hadoop client dependency,  I get the same error message. I also verified
that my  app's uber jar had pulled in the cdh hadoop client dependencies.
On 25-Jul-2014 9:26 pm, "Sean Owen"  wrote:

> This indicates your app is not actually using the version of the HDFS
> client you think. You built Spark from source with the right deps it
> seems, but are you sure you linked to your build in your app?
>
> On Fri, Jul 25, 2014 at 4:32 PM, Bharath Ravi Kumar 
> wrote:
> > Any suggestions to  work around this issue ? The pre built spark binaries
> > don't appear to work against cdh as documented, unless there's a build
> > issue, which seems unlikely.
> >
> > On 25-Jul-2014 3:42 pm, "Bharath Ravi Kumar" 
> wrote:
> >>
> >>
> >> I'm encountering a hadoop client protocol mismatch trying to read from
> >> HDFS (cdh3u5) using the pre-build spark from the downloads page (linked
> >> under "For Hadoop 1 (HDP1, CDH3)"). I've also  followed the
> instructions at
> >>
> http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
> >> (i.e. building the app against hadoop-client 0.20.2-cdh3u5), but
> continue to
> >> see the following error regardless of whether I link the app with the
> cdh
> >> client:
> >>
> >> 14/07/25 09:53:43 INFO client.AppClient$ClientActor: Executor updated:
> >> app-20140725095343-0016/1 is now RUNNING
> >> 14/07/25 09:53:43 WARN util.NativeCodeLoader: Unable to load
> native-hadoop
> >> library for your platform... using builtin-java classes where applicable
> >> 14/07/25 09:53:43 WARN snappy.LoadSnappy: Snappy native library not
> loaded
> >> Exception in thread "main" org.apache.hadoop.ipc.RPC$VersionMismatch:
> >> Protocol org.apache.hadoop.hdfs.protocol.ClientProtocol version
> mismatch.
> >> (client = 61, server = 63)
> >> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:401)
> >> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
> >>
> >>
> >> While I can build spark against the exact hadoop distro version, I'd
> >> rather work with the standard prebuilt binaries, making additional
> changes
> >> while building the app if necessary. Any workarounds/recommendations?
> >>
> >> Thanks,
> >> Bharath
>


Re: Initial job has not accepted any resources (but workers are in UI)

2014-07-25 Thread Navicore
thx for the reply,

the UI says my application has cores and mem

ID  NameCores   Memory per Node Submitted Time  UserState   Duration
app-20140725164107-0001 SectionsAndSeamsPipeline6   512.0 MB
2014/07/25
16:41:07tercel  RUNNING 21 s



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Initial-job-has-not-accepted-any-resources-but-workers-are-in-UI-tp10659p10671.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Issue submitting spark job to yarn

2014-07-25 Thread Ron Gonzalez
Folks,
  I've been able to submit simple jobs to yarn thus far. However, when I did 
something more complicated that added 194 dependency jars using --addJars, the 
job fails in YARN with no logs. What ends up happening is that no container 
logs get created (app master or executor). If I add just a couple of 
dependencies, it works, so this is clearly a case of too many dependencies 
passed into the invocation.

  Not sure if this means that no container was created at all, but bottom line 
is that I get no logs that can help me determine what's wrong. Because of the 
large number of jars, I figured it might have been a permgen issue so I added 
these options. However, that didn't help. It seems as if the actual submission 
wasn't even spawned since no container was created or no log was found.

  Any ideas for this?

Thanks,
Ron

Re: Questions about disk IOs

2014-07-25 Thread Charles Li
Hi Xiangrui,

I have 16 * 40 cpu cores in total. But I am only using 200 partitions on the 
200 executors. I use coalesce without shuffle to reduce the default partition 
of RDD.

The shuffle size from the WebUI is nearly 100m.

On Jul 25, 2014, at 23:51, Xiangrui Meng  wrote:

> How many partitions did you use and how many CPU cores in total? The
> former shouldn't be much larger than the latter. Could you also check
> the shuffle size from the WebUI? -Xiangrui
> 
> On Fri, Jul 25, 2014 at 4:10 AM, Charles Li  wrote:
>> Hi Xiangrui,
>> 
>> Thanks for your treeAggregate patch. It is very helpful.
>> After applying your patch in my local repos, the new spark can handle more 
>> partition than before.
>> But after some iteration(mapPartition + reduceByKey), the reducer seems 
>> become more slower and finally hang.
>> 
>> The logs shows there always 1 message pending in the outbox, and we are 
>> waiting for it. Are you aware this kind issue?
>> How can I know which message is pending?  Where is it supposed to go?
>> 
>> Log:
>> 
>> 14/07/25 17:49:54 INFO storage.BlockManager: Found block rdd_2_158 locally
>> 14/07/25 17:50:03 INFO spark.SparkRegistry: Using kryo with register
>> 14/07/25 17:50:03 INFO spark.SparkRegistry: Using kryo with register
>> 14/07/25 17:50:03 INFO spark.SparkRegistry: Using kryo with register
>> 14/07/25 17:50:03 INFO executor.Executor: Serialized size of result for 302 
>> is 752
>> 14/07/25 17:50:03 INFO executor.Executor: Sending result for 302 directly to 
>> driver
>> 14/07/25 17:50:03 INFO executor.Executor: Finished task ID 302
>> 14/07/25 17:50:34 INFO network.ConnectionManager: Accepted connection from 
>> [*/**]
>> 14/07/25 17:50:34 INFO network.SendingConnection: Initiating connection to 
>> [/]
>> 14/07/25 17:50:34 INFO network.SendingConnection: Connected to 
>> [/], 1 messages pending
>> 14/07/25 17:51:28 INFO storage.ShuffleBlockManager: Deleted all files for 
>> shuffle 0
>> 14/07/25 17:51:37 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
>> task 742
>> 14/07/25 17:51:37 INFO executor.Executor: Running task ID 742
>> 14/07/25 17:51:37 INFO storage.BlockManager: Found block broadcast_1 locally
>> 14/07/25 17:51:38 INFO spark.MapOutputTrackerWorker: Updating epoch to 1 and 
>> clearing cache
>> 14/07/25 17:51:38 INFO spark.SparkRegistry: Using kryo with register
>> 14/07/25 17:51:38 INFO storage.BlockManager: Found block rdd_2_158 locally
>> 14/07/25 17:51:48 INFO spark.SparkRegistry: Using kryo with register
>> 14/07/25 17:51:48 INFO spark.SparkRegistry: Using kryo with register
>> 14/07/25 17:51:48 INFO spark.SparkRegistry: Using kryo with register
>> 14/07/25 17:51:48 INFO executor.Executor: Serialized size of result for 742 
>> is 752
>> 14/07/25 17:51:48 INFO executor.Executor: Sending result for 742 directly to 
>> driver
>> 14/07/25 17:51:48 INFO executor.Executor: Finished task ID 742
>> <—— I have shutdown the App
>> 14/07/25 18:16:36 INFO executor.CoarseGrainedExecutorBackend: Driver 
>> commanded a shutdown
>> 
>> On Jul 2, 2014, at 0:08, Xiangrui Meng  wrote:
>> 
>>> Try to reduce number of partitions to match the number of cores. We
>>> will add treeAggregate to reduce the communication cost.
>>> 
>>> PR: https://github.com/apache/spark/pull/1110
>>> 
>>> -Xiangrui
>>> 
>>> On Tue, Jul 1, 2014 at 12:55 AM, Charles Li  wrote:
 Hi Spark,
 
 I am running LBFGS on our user data. The data size with Kryo serialisation 
 is about 210G. The weight size is around 1,300,000. I am quite confused 
 that the performance is very close whether the data is cached or not.
 
 The program is simple:
 points = sc.hadoopFIle(int, SequenceFileInputFormat.class …..)
 points.persist(StorageLevel.Memory_AND_DISK_SER()) // comment it if not 
 cached
 gradient = new LogisticGrandient();
 updater = new SquaredL2Updater();
 initWeight = Vectors.sparse(size, new int[]{}, new double[]{})
 result = LBFGS.runLBFGS(points.rdd(), grandaunt, updater, numCorrections, 
 convergeTol, maxIter, regParam, initWeight);
 
 I have 13 machines with 16 cpus, 48G RAM each. Spark is running on its 
 cluster mode. Below are some arguments I am using:
 —executor-memory 10G
 —num-executors 50
 —executor-cores 2
 
 Storage Using:
 When caching:
 Cached Partitions 951
 Fraction Cached 100%
 Size in Memory 215.7GB
 Size in Tachyon 0.0B
 Size on Disk 1029.7MB
 
 The time cost by every aggregate is around 5 minutes with cache enabled. 
 Lots of disk IOs can be seen on the hadoop node. I have the same result 
 with cache disabled.
 
 Should data points caching improve the performance? Should caching 
 decrease the disk IO?
 
 Thanks in advance.
>> 



Re: Questions about disk IOs

2014-07-25 Thread Xiangrui Meng
How many partitions did you use and how many CPU cores in total? The
former shouldn't be much larger than the latter. Could you also check
the shuffle size from the WebUI? -Xiangrui

On Fri, Jul 25, 2014 at 4:10 AM, Charles Li  wrote:
> Hi Xiangrui,
>
> Thanks for your treeAggregate patch. It is very helpful.
> After applying your patch in my local repos, the new spark can handle more 
> partition than before.
> But after some iteration(mapPartition + reduceByKey), the reducer seems 
> become more slower and finally hang.
>
> The logs shows there always 1 message pending in the outbox, and we are 
> waiting for it. Are you aware this kind issue?
> How can I know which message is pending?  Where is it supposed to go?
>
> Log:
>
> 14/07/25 17:49:54 INFO storage.BlockManager: Found block rdd_2_158 locally
> 14/07/25 17:50:03 INFO spark.SparkRegistry: Using kryo with register
> 14/07/25 17:50:03 INFO spark.SparkRegistry: Using kryo with register
> 14/07/25 17:50:03 INFO spark.SparkRegistry: Using kryo with register
> 14/07/25 17:50:03 INFO executor.Executor: Serialized size of result for 302 
> is 752
> 14/07/25 17:50:03 INFO executor.Executor: Sending result for 302 directly to 
> driver
> 14/07/25 17:50:03 INFO executor.Executor: Finished task ID 302
> 14/07/25 17:50:34 INFO network.ConnectionManager: Accepted connection from 
> [*/**]
> 14/07/25 17:50:34 INFO network.SendingConnection: Initiating connection to 
> [/]
> 14/07/25 17:50:34 INFO network.SendingConnection: Connected to 
> [/], 1 messages pending
> 14/07/25 17:51:28 INFO storage.ShuffleBlockManager: Deleted all files for 
> shuffle 0
> 14/07/25 17:51:37 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 742
> 14/07/25 17:51:37 INFO executor.Executor: Running task ID 742
> 14/07/25 17:51:37 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/07/25 17:51:38 INFO spark.MapOutputTrackerWorker: Updating epoch to 1 and 
> clearing cache
> 14/07/25 17:51:38 INFO spark.SparkRegistry: Using kryo with register
> 14/07/25 17:51:38 INFO storage.BlockManager: Found block rdd_2_158 locally
> 14/07/25 17:51:48 INFO spark.SparkRegistry: Using kryo with register
> 14/07/25 17:51:48 INFO spark.SparkRegistry: Using kryo with register
> 14/07/25 17:51:48 INFO spark.SparkRegistry: Using kryo with register
> 14/07/25 17:51:48 INFO executor.Executor: Serialized size of result for 742 
> is 752
> 14/07/25 17:51:48 INFO executor.Executor: Sending result for 742 directly to 
> driver
> 14/07/25 17:51:48 INFO executor.Executor: Finished task ID 742
> <—— I have shutdown the App
> 14/07/25 18:16:36 INFO executor.CoarseGrainedExecutorBackend: Driver 
> commanded a shutdown
>
> On Jul 2, 2014, at 0:08, Xiangrui Meng  wrote:
>
>> Try to reduce number of partitions to match the number of cores. We
>> will add treeAggregate to reduce the communication cost.
>>
>> PR: https://github.com/apache/spark/pull/1110
>>
>> -Xiangrui
>>
>> On Tue, Jul 1, 2014 at 12:55 AM, Charles Li  wrote:
>>> Hi Spark,
>>>
>>> I am running LBFGS on our user data. The data size with Kryo serialisation 
>>> is about 210G. The weight size is around 1,300,000. I am quite confused 
>>> that the performance is very close whether the data is cached or not.
>>>
>>> The program is simple:
>>> points = sc.hadoopFIle(int, SequenceFileInputFormat.class …..)
>>> points.persist(StorageLevel.Memory_AND_DISK_SER()) // comment it if not 
>>> cached
>>> gradient = new LogisticGrandient();
>>> updater = new SquaredL2Updater();
>>> initWeight = Vectors.sparse(size, new int[]{}, new double[]{})
>>> result = LBFGS.runLBFGS(points.rdd(), grandaunt, updater, numCorrections, 
>>> convergeTol, maxIter, regParam, initWeight);
>>>
>>> I have 13 machines with 16 cpus, 48G RAM each. Spark is running on its 
>>> cluster mode. Below are some arguments I am using:
>>> —executor-memory 10G
>>> —num-executors 50
>>> —executor-cores 2
>>>
>>> Storage Using:
>>> When caching:
>>> Cached Partitions 951
>>> Fraction Cached 100%
>>> Size in Memory 215.7GB
>>> Size in Tachyon 0.0B
>>> Size on Disk 1029.7MB
>>>
>>> The time cost by every aggregate is around 5 minutes with cache enabled. 
>>> Lots of disk IOs can be seen on the hadoop node. I have the same result 
>>> with cache disabled.
>>>
>>> Should data points caching improve the performance? Should caching decrease 
>>> the disk IO?
>>>
>>> Thanks in advance.
>


Re: Hadoop client protocol mismatch with spark 1.0.1, cdh3u5

2014-07-25 Thread Sean Owen
This indicates your app is not actually using the version of the HDFS
client you think. You built Spark from source with the right deps it
seems, but are you sure you linked to your build in your app?

On Fri, Jul 25, 2014 at 4:32 PM, Bharath Ravi Kumar  wrote:
> Any suggestions to  work around this issue ? The pre built spark binaries
> don't appear to work against cdh as documented, unless there's a build
> issue, which seems unlikely.
>
> On 25-Jul-2014 3:42 pm, "Bharath Ravi Kumar"  wrote:
>>
>>
>> I'm encountering a hadoop client protocol mismatch trying to read from
>> HDFS (cdh3u5) using the pre-build spark from the downloads page (linked
>> under "For Hadoop 1 (HDP1, CDH3)"). I've also  followed the instructions at
>> http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
>> (i.e. building the app against hadoop-client 0.20.2-cdh3u5), but continue to
>> see the following error regardless of whether I link the app with the cdh
>> client:
>>
>> 14/07/25 09:53:43 INFO client.AppClient$ClientActor: Executor updated:
>> app-20140725095343-0016/1 is now RUNNING
>> 14/07/25 09:53:43 WARN util.NativeCodeLoader: Unable to load native-hadoop
>> library for your platform... using builtin-java classes where applicable
>> 14/07/25 09:53:43 WARN snappy.LoadSnappy: Snappy native library not loaded
>> Exception in thread "main" org.apache.hadoop.ipc.RPC$VersionMismatch:
>> Protocol org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch.
>> (client = 61, server = 63)
>> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:401)
>> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>>
>>
>> While I can build spark against the exact hadoop distro version, I'd
>> rather work with the standard prebuilt binaries, making additional changes
>> while building the app if necessary. Any workarounds/recommendations?
>>
>> Thanks,
>> Bharath


sharing spark context among machines

2014-07-25 Thread myxjtu
Is it possible now to share spark context among machines (through
serialization or some other ways)? I am looking for possible ways to make
the spark job submission to be HA (high availability).  For example, if a
job submitted to machine A fails in the middle (due to machine A crash), I
want this job automatically re-run on machine B.

Thanks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sharing-spark-context-among-machines-tp10665.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark got stuck with a loop

2014-07-25 Thread Denis RP
Anyone can help? I'm using spark 1.0.1

I'm confusing that if the block is found, why no non-empty blocks is got,
and the process keeps going forever? 

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-got-stuck-with-a-loop-tp10590p10663.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: NMF implementaion is Spark

2014-07-25 Thread Xiangrui Meng
It is ALS with setNonnegative. -Xiangrui

On Fri, Jul 25, 2014 at 7:38 AM, Aureliano Buendia  wrote:
> Hi,
>
> Is there an implementation for Nonnegative Matrix Factorization in Spark? I
> understand that MLlib comes with matrix factorization, but it does not seem
> to cover the nonnegative case.


Re: Hadoop client protocol mismatch with spark 1.0.1, cdh3u5

2014-07-25 Thread Bharath Ravi Kumar
Any suggestions to  work around this issue ? The pre built spark binaries
don't appear to work against cdh as documented, unless there's a build
issue, which seems unlikely.
On 25-Jul-2014 3:42 pm, "Bharath Ravi Kumar"  wrote:

>
> I'm encountering a hadoop client protocol mismatch trying to read from
> HDFS (cdh3u5) using the pre-build spark from the downloads page (linked
> under "For Hadoop 1 (HDP1, CDH3)"). I've also  followed the instructions at
> http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
> (i.e. building the app against hadoop-client 0.20.2-cdh3u5), but continue
> to see the following error regardless of whether I link the app with the
> cdh client:
>
> 14/07/25 09:53:43 INFO client.AppClient$ClientActor: Executor updated:
> app-20140725095343-0016/1 is now RUNNING
> 14/07/25 09:53:43 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 14/07/25 09:53:43 WARN snappy.LoadSnappy: Snappy native library not loaded
> Exception in thread "main" org.apache.hadoop.ipc.RPC$VersionMismatch:
> Protocol org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch.
> (client = 61, server = 63)
> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:401)
> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>
>
> While I can build spark against the exact hadoop distro version, I'd
> rather work with the standard prebuilt binaries, making additional changes
> while building the app if necessary. Any workarounds/recommendations?
>
> Thanks,
> Bharath
>


Re: Down-scaling Spark on EC2 cluster

2014-07-25 Thread Nicholas Chammas
No idea. Right now implementing this is up for grabs by the community.


On Fri, Jul 25, 2014 at 5:40 AM, Shubhabrata  wrote:

> Any idea about the probable dates for this implementation. I believe it
> would
> be a wonderful (and essential) functionality to gain more acceptance in the
> community.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Down-scaling-Spark-on-EC2-cluster-tp10494p10639.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Initial job has not accepted any resources (but workers are in UI)

2014-07-25 Thread Ed Sweeney
Hi all,

Amazon Linux, AWS, Spark 1.0.1 reading a file.

The UI shows there are workers and shows this app context with the 2
tasks waiting.  All the hostnames resolve properly so I am guessing
the message is correct and that the workers won't accept the job for
mem reasons.

What params do I tweak to know for sure?

Thanks for any help, -Ed

ps. most of the searches for this error end up being about workers now
connecting but our instance's UI shows the workers.

46692 [pool-10-thread-1] INFO org.apache.hadoop.mapred.FileInputFormat
- Total input paths to process : 1
46709 [pool-10-thread-1] INFO org.apache.spark.SparkContext - Starting
job: count at DListRDD.scala:19
46728 [spark-akka.actor.default-dispatcher-5] INFO
org.apache.spark.scheduler.DAGScheduler - Got job 0 (count at
DListRDD.scala:19) with 2 output partitions (allowLocal=false)
46728 [spark-akka.actor.default-dispatcher-5] INFO
org.apache.spark.scheduler.DAGScheduler - Final stage: Stage 0(count
at DListRDD.scala:19)
46729 [spark-akka.actor.default-dispatcher-5] INFO
org.apache.spark.scheduler.DAGScheduler - Parents of final stage:
List()
46753 [spark-akka.actor.default-dispatcher-5] INFO
org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
46764 [spark-akka.actor.default-dispatcher-5] INFO
org.apache.spark.scheduler.DAGScheduler - Submitting Stage 0
(FilteredRDD[6] at filter at DListRDD.scala:13), which has no missing
parents
46765 [pool-38-thread-1] INFO org.apache.hadoop.mapred.FileInputFormat
- Total input paths to process : 1
46772 [pool-38-thread-1] INFO org.apache.spark.SparkContext - Starting
job: count at DListRDD.scala:19
46832 [spark-akka.actor.default-dispatcher-5] INFO
org.apache.spark.scheduler.DAGScheduler - Submitting 2 missing tasks
from Stage 0 (FilteredRDD[6] at filter at DListRDD.scala:13)
46834 [spark-akka.actor.default-dispatcher-5] INFO
org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 0.0
with 2 tasks
46845 [spark-akka.actor.default-dispatcher-5] INFO
org.apache.spark.scheduler.FairSchedulableBuilder - Added task set
TaskSet_0 tasks to pool default
46852 [spark-akka.actor.default-dispatcher-5] INFO
org.apache.spark.scheduler.DAGScheduler - Got job 1 (count at
DListRDD.scala:19) with 2 output partitions (allowLocal=false)
46852 [spark-akka.actor.default-dispatcher-5] INFO
org.apache.spark.scheduler.DAGScheduler - Final stage: Stage 1(count
at DListRDD.scala:19)
46853 [spark-akka.actor.default-dispatcher-5] INFO
org.apache.spark.scheduler.DAGScheduler - Parents of final stage:
List()
46855 [spark-akka.actor.default-dispatcher-5] INFO
org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
46856 [spark-akka.actor.default-dispatcher-5] INFO
org.apache.spark.scheduler.DAGScheduler - Submitting Stage 1
(FilteredRDD[6] at filter at DListRDD.scala:13), which has no missing
parents
46860 [spark-akka.actor.default-dispatcher-5] INFO
org.apache.spark.scheduler.DAGScheduler - Submitting 2 missing tasks
from Stage 1 (FilteredRDD[6] at filter at DListRDD.scala:13)
46860 [spark-akka.actor.default-dispatcher-5] INFO
org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 1.0
with 2 tasks
46861 [spark-akka.actor.default-dispatcher-5] INFO
org.apache.spark.scheduler.FairSchedulableBuilder - Added task set
TaskSet_1 tasks to pool default
2014-07-25 01:25:09,616 [Thread-2] DEBUG
falkonry.commons.service.ServiceHandler - Listening...
61847 [Timer-0] WARN org.apache.spark.scheduler.TaskSchedulerImpl -
Initial job has not accepted any resources; check your cluster UI to
ensure that workers are registered and have sufficient memory


Support for Percentile and Variance Aggregation functions in Spark with HiveContext

2014-07-25 Thread vinay . kashyap





Hi all,
I am using Spark 1.0.0 with CDH 5.1.0.
I want to
aggregate the data in a raw table using a simple query like
below
SELECT MIN(field1), MAX(field2), AVG(field3),
PERCENTILE(field4), year,month,day FROM  raw_data_table  GROUP
BY year, month, day
MIN, MAX and AVG functions work fine
for me, but with PERCENTILE, I get an error as shown
below.
Exception in thread "main"
java.lang.RuntimeException: No handler for udf class
org.apache.hadoop.hive.ql.udf.UDAFPercentile

    at
scala.sys.package$.error(package.scala:27)

    at
org.apache.spark.sql.hive.HiveFunctionRegistry$.lookupFunction(hiveUdfs.scala:69)

    at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:115)

    at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:113)

    at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
I
have read in the documentation that with HiveContext Spark SQL supports
all the UDFs supported in Hive.
I want to know if there is anything
else I need to follow to use Percentile with Spark SQL..?? Or .. Are there
any limitations still in Spark SQL with respect to UDFs and UDAFs in the
version I am using..??
 
 
Thanks and
regards
Vinay Kashyap


Re: mapToPair vs flatMapToPair vs flatMap function usage.

2014-07-25 Thread Daniel Siegmann
The map and flatMap methods have a similar purpose, but map is 1 to 1,
while flatMap is 1 to 0-N (outputting 0 is similar to a filter, except of
course it could be outputting a different type).


On Thu, Jul 24, 2014 at 6:41 PM, abhiguruvayya 
wrote:

> Can any one help me understand the key difference between mapToPair vs
> flatMapToPair vs flatMap functions and also when to apply these functions
> in
> particular.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/mapToPair-vs-flatMapToPair-vs-flatMap-function-usage-tp10617.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>



-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


Re: Strange exception on coalesce()

2014-07-25 Thread Sean Owen
I'm pretty sure this was already fixed last week in SPARK-2414:
https://github.com/apache/spark/commit/7c23c0dc3ed721c95690fc49f435d9de6952523c

On Fri, Jul 25, 2014 at 1:34 PM, innowireless TaeYun Kim
 wrote:
> Hi,
> I'm using Spark 1.0.0.
>
> On filter() - map() - coalesce() - saveAsText() sequence, the following
> exception is thrown.
>
> Exception in thread "main" java.util.NoSuchElementException: None.get
> at scala.None$.get(Option.scala:313)
> at scala.None$.get(Option.scala:311)
> at
> org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:270)
> at org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:337)
> at
> org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:83)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
> at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1086)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.s
> cala:788)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scal
> a:674)
> at
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scal
> a:593)
> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
> at
> org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala
> :436)
> at org.apache.spark.api.java.JavaRDD.saveAsTextFile(JavaRDD.scala:29)
>
> The partition count of the original rdd is 306.
>
> When the argument of coalesce() is one of 59, 60, 61, 62, 63, the exception
> above is thrown.
>
> But the argument is one of 50, 55, 58, 64, 65, 80, 100, the exception is not
> thrown. (I've not tried other values, I think that they will be ok.)
>
> Is there any magic number for the argument of coalesce() ?
>
> Thanks.
>
>


NMF implementaion is Spark

2014-07-25 Thread Aureliano Buendia
Hi,

Is there an implementation for Nonnegative Matrix Factorization in Spark? I
understand that MLlib comes with matrix factorization, but it does not seem
to cover the nonnegative case.


How to pass additional options to Mesos when submitting job?

2014-07-25 Thread Krisztián Szűcs
Hi,

We’re trying to use Docker containerization within Mesos via Deimos. We’re 
submitting Spark jobs from localhost to our cluster. We’ve managed it to work 
(with fix deimos configuration), but we have issues with passing some options 
(like job dependent container image) in TaskInfo to Mesos during Spark job 
submitting. 
Examplehttps://github.com/mesosphere/deimos#passing-parameters-to-docker

Are there any ways to achieve it?

Thank in advance,
Krisztian Szucs

Re: EOFException when I list all files in hdfs directory

2014-07-25 Thread Bertrand Dechoux
Well, anyone can open an account on apache jira and post a new
ticket/enhancement/issue/bug...

Bertrand Dechoux


On Fri, Jul 25, 2014 at 4:07 PM, Sparky  wrote:

> Thanks for the suggestion.  I can confirm that my problem is I have files
> with zero bytes.  It's a known bug and is marked as a high priority:
>
> https://issues.apache.org/jira/browse/SPARK-1960
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/EOFException-when-I-list-all-files-in-hdfs-directory-tp10648p10651.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: EOFException when I list all files in hdfs directory

2014-07-25 Thread Sparky
Thanks for the suggestion.  I can confirm that my problem is I have files
with zero bytes.  It's a known bug and is marked as a high priority:

https://issues.apache.org/jira/browse/SPARK-1960



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/EOFException-when-I-list-all-files-in-hdfs-directory-tp10648p10651.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: EOFException when I list all files in hdfs directory

2014-07-25 Thread Akhil Das
Try without the *

val avroRdd = sc.newAPIHadoopFile("hdfs://:8020//",
classOf[AvroSequenceFileInputFormat[AvroKey[GenericRecord],NullWritable]],
classOf[AvroKey[GenericRecord]], classOf[NullWritable])
avroRdd.collect()



Thanks
Best Regards


On Fri, Jul 25, 2014 at 7:22 PM, Sparky  wrote:

> I'm pretty sure my problem is related to this unresolved bug regarding
> files
> with size zero:
>
> https://issues.apache.org/jira/browse/SPARK-1960
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/EOFException-when-I-list-all-files-in-hdfs-directory-tp10648p10649.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: EOFException when I list all files in hdfs directory

2014-07-25 Thread Sparky
I'm pretty sure my problem is related to this unresolved bug regarding files
with size zero:

https://issues.apache.org/jira/browse/SPARK-1960



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/EOFException-when-I-list-all-files-in-hdfs-directory-tp10648p10649.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


EOFException when I list all files in hdfs directory

2014-07-25 Thread Sparky
I'm trying to list and then process all files in an hdfs directory.

I'm able to run the code below when I supply a specific AvroSequence file,
but if I use a wildcard to get all Avro sequence files in the directory it
fails. 

Anyone know how to do this?

val avroRdd = sc.newAPIHadoopFile("hdfs://:8020//*", 
classOf[AvroSequenceFileInputFormat[AvroKey[GenericRecord],NullWritable]],
classOf[AvroKey[GenericRecord]], classOf[NullWritable])
avroRdd.collect()

org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in
stage 1.0 failed 1 times, most recent failure: Lost task 8.0 in stage 1.0
(TID 20, localhost): java.io.EOFException (null)
java.io.DataInputStream.readFully(DataInputStream.java:197)
java.io.DataInputStream.readFully(DataInputStream.java:169)
   
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1800)
   
org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1765)
   
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1714)
   
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1728)
   
org.apache.avro.hadoop.io.AvroSequenceFile.getMetadata(AvroSequenceFile.java:727)
   
org.apache.avro.hadoop.io.AvroSequenceFile.access$100(AvroSequenceFile.java:71)
   
org.apache.avro.hadoop.io.AvroSequenceFile$Reader$Options.getConfigurationWithAvroSerialization(AvroSequenceFile.java:672)
   
org.apache.avro.hadoop.io.AvroSequenceFile$Reader.(AvroSequenceFile.java:709)
   
org.apache.avro.mapreduce.AvroSequenceFileInputFormat$AvroSequenceFileRecordReader.initialize(AvroSequenceFileInputFormat.java:86)
   
org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:114)
org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:100)
org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:62)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:112)
org.apache.spark.scheduler.Task.run(Task.scala:51)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:189)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/EOFException-when-I-list-all-files-in-hdfs-directory-tp10648.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


RE: Strange exception on coalesce()

2014-07-25 Thread innowireless TaeYun Kim
(Sorry for resending, I've reformatted the text as HTML.)

 

Hi,

I'm using Spark 1.0.0.

 

On filter() - map() - coalesce() - saveAsText() sequence, the following
exception is thrown.

 

 

Exception in thread "main" java.util.NoSuchElementException: None.get

at scala.None$.get(Option.scala:313)

at scala.None$.get(Option.scala:311)

at
org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:270)

at org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:337)

at
org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:83)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)

at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)

at scala.Option.getOrElse(Option.scala:120)

at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1086)

at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.s
cala:788)

at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scal
a:674)

at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scal
a:593)

at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)

at
org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala
:436)

at org.apache.spark.api.java.JavaRDD.saveAsTextFile(JavaRDD.scala:29)

 

 

The partition count of the original rdd is 306.

 

When the argument of coalesce() is one of 59, 60, 61, 62, 63, the exception
above is thrown.

 

But the argument is one of 50, 55, 58, 64, 65, 80, 100, the exception is not
thrown. (I've not tried other values, I think that they will be ok.)

 

Is there any magic number for the argument of coalesce() ?

 

Thanks.

 



Re: Bad Digest error while doing aws s3 put

2014-07-25 Thread Akhil Das
Bad Digest error means the file you are trying to upload actually changed
while uploading. If you can make a temporary copy of the file before
uploading then you won't face this problem.


Thanks
Best Regards


On Fri, Jul 25, 2014 at 5:34 PM, lmk 
wrote:

> Can someone look into this and help me resolve this error pls..
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Bad-Digest-error-while-doing-aws-s3-put-tp10036p10644.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Strange exception on coalesce()

2014-07-25 Thread innowireless TaeYun Kim
Hi,
I'm using Spark 1.0.0.

On filter() - map() - coalesce() - saveAsText() sequence, the following
exception is thrown.

Exception in thread "main" java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:313)
at scala.None$.get(Option.scala:311)
at
org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:270)
at org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:337)
at
org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:83)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1086)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.s
cala:788)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scal
a:674)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scal
a:593)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
at
org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala
:436)
at org.apache.spark.api.java.JavaRDD.saveAsTextFile(JavaRDD.scala:29)

The partition count of the original rdd is 306.

When the argument of coalesce() is one of 59, 60, 61, 62, 63, the exception
above is thrown.

But the argument is one of 50, 55, 58, 64, 65, 80, 100, the exception is not
thrown. (I've not tried other values, I think that they will be ok.)

Is there any magic number for the argument of coalesce() ?

Thanks.




Re: Bad Digest error while doing aws s3 put

2014-07-25 Thread lmk
Can someone look into this and help me resolve this error pls..



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Bad-Digest-error-while-doing-aws-s3-put-tp10036p10644.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: data locality

2014-07-25 Thread Tsai Li Ming
Hi,

In the standalone mode, how can we check data locality is working as expected 
when tasks are assigned?

Thanks!


On 23 Jul, 2014, at 12:49 am, Sandy Ryza  wrote:

> On standalone there is still special handling for assigning tasks within 
> executors.  There just isn't special handling for where to place executors, 
> because standalone generally places an executor on every node.
> 
> 
> On Mon, Jul 21, 2014 at 7:42 PM, Haopu Wang  wrote:
> Sandy,
> 
>  
> 
> I just tried the standalone cluster and didn't have chance to try Yarn yet.
> 
> So if I understand correctly, there are *no* special handling of task 
> assignment according to the HDFS block's location when Spark is running as a 
> *standalone* cluster.
> 
> Please correct me if I'm wrong. Thank you for your patience!
> 
>  
> 
> From: Sandy Ryza [mailto:sandy.r...@cloudera.com] 
> Sent: 2014年7月22日 9:47
> 
> 
> To: user@spark.apache.org
> Subject: Re: data locality
> 
>  
> 
> This currently only works for YARN.  The standalone default is to place an 
> executor on every node for every job.
> 
>  
> 
> The total number of executors is specified by the user.
> 
>  
> 
> -Sandy
> 
>  
> 
> On Fri, Jul 18, 2014 at 2:00 AM, Haopu Wang  wrote:
> 
> Sandy,
> 
>  
> 
> Do you mean the “preferred location” is working for standalone cluster also? 
> Because I check the code of SparkContext and see comments as below:
> 
>  
> 
>   // This is used only by YARN for now, but should be relevant to other 
> cluster types (Mesos,
> 
>   // etc) too. This is typically generated from 
> InputFormatInfo.computePreferredLocations. It
> 
>   // contains a map from hostname to a list of input format splits on the 
> host.
> 
>   private[spark] var preferredNodeLocationData: Map[String, Set[SplitInfo]] = 
> Map()
> 
>  
> 
> BTW, even with the preferred hosts, how does Spark decide how many total 
> executors to use for this application?
> 
>  
> 
> Thanks again!
> 
>  
> 
> From: Sandy Ryza [mailto:sandy.r...@cloudera.com] 
> Sent: Friday, July 18, 2014 3:44 PM
> To: user@spark.apache.org
> Subject: Re: data locality
> 
>  
> 
> Hi Haopu,
> 
>  
> 
> Spark will ask HDFS for file block locations and try to assign tasks based on 
> these.
> 
>  
> 
> There is a snag.  Spark schedules its tasks inside of "executor" processes 
> that stick around for the lifetime of a Spark application.  Spark requests 
> executors before it runs any jobs, i.e. before it has any information about 
> where the input data for the jobs is located.  If the executors occupy 
> significantly fewer nodes than exist in the cluster, it can be difficult for 
> Spark to achieve data locality.  The workaround for this is an API that 
> allows passing in a set of preferred locations when instantiating a Spark 
> context.  This API is currently broken in Spark 1.0, and will likely changed 
> to be something a little simpler in a future release.
> 
>  
> 
> val locData = InputFormatInfo.computePreferredLocations
> 
>   (Seq(new InputFormatInfo(conf, classOf[TextInputFormat], new 
> Path(“myfile.txt”)))
> 
>  
> 
> val sc = new SparkContext(conf, locData)
> 
>  
> 
> -Sandy
> 
>  
> 
>  
> 
> On Fri, Jul 18, 2014 at 12:35 AM, Haopu Wang  wrote:
> 
> I have a standalone spark cluster and a HDFS cluster which share some of 
> nodes.
> 
>  
> 
> When reading HDFS file, how does spark assign tasks to nodes? Will it ask 
> HDFS the location for each file block in order to get a right worker node?
> 
>  
> 
> How about a spark cluster on Yarn?
> 
>  
> 
> Thank you very much!
> 
>  
> 
>  
> 
>  
> 
> 



Re: Questions about disk IOs

2014-07-25 Thread Charles Li
Hi Xiangrui,

Thanks for your treeAggregate patch. It is very helpful.
After applying your patch in my local repos, the new spark can handle more 
partition than before.
But after some iteration(mapPartition + reduceByKey), the reducer seems become 
more slower and finally hang.

The logs shows there always 1 message pending in the outbox, and we are waiting 
for it. Are you aware this kind issue?
How can I know which message is pending?  Where is it supposed to go?

Log:

14/07/25 17:49:54 INFO storage.BlockManager: Found block rdd_2_158 locally
14/07/25 17:50:03 INFO spark.SparkRegistry: Using kryo with register
14/07/25 17:50:03 INFO spark.SparkRegistry: Using kryo with register
14/07/25 17:50:03 INFO spark.SparkRegistry: Using kryo with register
14/07/25 17:50:03 INFO executor.Executor: Serialized size of result for 302 is 
752
14/07/25 17:50:03 INFO executor.Executor: Sending result for 302 directly to 
driver
14/07/25 17:50:03 INFO executor.Executor: Finished task ID 302
14/07/25 17:50:34 INFO network.ConnectionManager: Accepted connection from 
[*/**]
14/07/25 17:50:34 INFO network.SendingConnection: Initiating connection to 
[/]
14/07/25 17:50:34 INFO network.SendingConnection: Connected to 
[/], 1 messages pending
14/07/25 17:51:28 INFO storage.ShuffleBlockManager: Deleted all files for 
shuffle 0
14/07/25 17:51:37 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
742
14/07/25 17:51:37 INFO executor.Executor: Running task ID 742
14/07/25 17:51:37 INFO storage.BlockManager: Found block broadcast_1 locally
14/07/25 17:51:38 INFO spark.MapOutputTrackerWorker: Updating epoch to 1 and 
clearing cache
14/07/25 17:51:38 INFO spark.SparkRegistry: Using kryo with register
14/07/25 17:51:38 INFO storage.BlockManager: Found block rdd_2_158 locally
14/07/25 17:51:48 INFO spark.SparkRegistry: Using kryo with register
14/07/25 17:51:48 INFO spark.SparkRegistry: Using kryo with register
14/07/25 17:51:48 INFO spark.SparkRegistry: Using kryo with register
14/07/25 17:51:48 INFO executor.Executor: Serialized size of result for 742 is 
752
14/07/25 17:51:48 INFO executor.Executor: Sending result for 742 directly to 
driver
14/07/25 17:51:48 INFO executor.Executor: Finished task ID 742
<—— I have shutdown the App
14/07/25 18:16:36 INFO executor.CoarseGrainedExecutorBackend: Driver commanded 
a shutdown

On Jul 2, 2014, at 0:08, Xiangrui Meng  wrote:

> Try to reduce number of partitions to match the number of cores. We
> will add treeAggregate to reduce the communication cost.
> 
> PR: https://github.com/apache/spark/pull/1110
> 
> -Xiangrui
> 
> On Tue, Jul 1, 2014 at 12:55 AM, Charles Li  wrote:
>> Hi Spark,
>> 
>> I am running LBFGS on our user data. The data size with Kryo serialisation 
>> is about 210G. The weight size is around 1,300,000. I am quite confused that 
>> the performance is very close whether the data is cached or not.
>> 
>> The program is simple:
>> points = sc.hadoopFIle(int, SequenceFileInputFormat.class …..)
>> points.persist(StorageLevel.Memory_AND_DISK_SER()) // comment it if not 
>> cached
>> gradient = new LogisticGrandient();
>> updater = new SquaredL2Updater();
>> initWeight = Vectors.sparse(size, new int[]{}, new double[]{})
>> result = LBFGS.runLBFGS(points.rdd(), grandaunt, updater, numCorrections, 
>> convergeTol, maxIter, regParam, initWeight);
>> 
>> I have 13 machines with 16 cpus, 48G RAM each. Spark is running on its 
>> cluster mode. Below are some arguments I am using:
>> —executor-memory 10G
>> —num-executors 50
>> —executor-cores 2
>> 
>> Storage Using:
>> When caching:
>> Cached Partitions 951
>> Fraction Cached 100%
>> Size in Memory 215.7GB
>> Size in Tachyon 0.0B
>> Size on Disk 1029.7MB
>> 
>> The time cost by every aggregate is around 5 minutes with cache enabled. 
>> Lots of disk IOs can be seen on the hadoop node. I have the same result with 
>> cache disabled.
>> 
>> Should data points caching improve the performance? Should caching decrease 
>> the disk IO?
>> 
>> Thanks in advance.



Hadoop client protocol mismatch with spark 1.0.1, cdh3u5

2014-07-25 Thread Bharath Ravi Kumar
I'm encountering a hadoop client protocol mismatch trying to read from HDFS
(cdh3u5) using the pre-build spark from the downloads page (linked under
"For Hadoop 1 (HDP1, CDH3)"). I've also  followed the instructions at
http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
(i.e. building the app against hadoop-client 0.20.2-cdh3u5), but continue
to see the following error regardless of whether I link the app with the
cdh client:

14/07/25 09:53:43 INFO client.AppClient$ClientActor: Executor updated:
app-20140725095343-0016/1 is now RUNNING
14/07/25 09:53:43 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/07/25 09:53:43 WARN snappy.LoadSnappy: Snappy native library not loaded
Exception in thread "main" org.apache.hadoop.ipc.RPC$VersionMismatch:
Protocol org.apache.hadoop.hdfs.protocol.ClientProtocol version mismatch.
(client = 61, server = 63)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:401)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)


While I can build spark against the exact hadoop distro version, I'd rather
work with the standard prebuilt binaries, making additional changes while
building the app if necessary. Any workarounds/recommendations?

Thanks,
Bharath


Re: Down-scaling Spark on EC2 cluster

2014-07-25 Thread Shubhabrata
Any idea about the probable dates for this implementation. I believe it would
be a wonderful (and essential) functionality to gain more acceptance in the
community.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Down-scaling-Spark-on-EC2-cluster-tp10494p10639.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: GraphX Pragel implementation

2014-07-25 Thread Arun Kumar
Hi

Thanks for the quick response.I am new to scala and some help will be
required

Regards
-Arun


On Fri, Jul 25, 2014 at 10:37 AM, Ankur Dave  wrote:

> On Thu, Jul 24, 2014 at 9:52 AM, Arun Kumar  wrote:
>
>> While using pregel  API for Iterations how to figure out which super step
>> the iteration currently in.
>
>
> The Pregel API doesn't currently expose this, but it's very
> straightforward to modify Pregel.scala
> 
> to do so. Let me know if you'd like help doing this.
>
> Ankur 
>


Re: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-25 Thread Jianshi Huang
I nailed it down to a union operation, here's my code snippet:

val properties: RDD[((String, String, String), Externalizer[KeyValue])]
= vertices.map { ve =>
  val (vertices, dsName) = ve
  val rval = GraphConfig.getRval(datasetConf, Constants.VERTICES,
dsName)
  val (_, rvalAsc, rvalType) = rval

  println(s"Table name: $dsName, Rval: $rval")
  println(vertices.toDebugString)

  vertices.map { v =>
val rk = appendHash(boxId(v.id)).getBytes
val cf = PROP_BYTES
val cq = boxRval(v.rval, rvalAsc, rvalType).getBytes
val value = Serializer.serialize(v.properties)

((new String(rk), new String(cf), new String(cq)),
 Externalizer(put(rk, cf, cq, value)))
  }
}.reduce(_.union(_)).sortByKey(numPartitions = 32)

Basically I read data from multiple tables (Seq[RDD[(key, value)]]) and
they're transformed to the a KeyValue to be insert in HBase, so I need to
do a .reduce(_.union(_)) to combine them into one RDD[(key, value)].

I cannot see what's wrong in my code.

Jianshi



On Fri, Jul 25, 2014 at 12:24 PM, Jianshi Huang 
wrote:

> I can successfully run my code in local mode using spark-submit (--master
> local[4]), but I got ExceptionInInitializerError errors in Yarn-client mode.
>
> Any hints what is the problem? Is it a closure serialization problem? How
> can I debug it? Your answers would be very helpful.
>
> 14/07/25 11:48:14 WARN scheduler.TaskSetManager: Loss was due to
> java.lang.ExceptionInInitializerError
> java.lang.ExceptionInInitializerError
> at
> com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scal
> a:40)
> at
> com.paypal.risk.rds.granada.storage.hbase.HBaseStore$$anonfun$1$$anonfun$apply$1.apply(HBaseStore.scal
> a:36)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1016)
> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)
> at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:847)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1080)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/