Re: Serialization Problem in Spark Program

2015-03-25 Thread Akhil Das
Try registering your MyObject[] with Kryo.
On 25 Mar 2015 13:17, "donhoff_h" <165612...@qq.com> wrote:

> Hi, experts
>
> I wrote a very simple spark program to test the KryoSerialization
> function. The codes are as following:
>
> object TestKryoSerialization {
>   def main(args: Array[String]) {
> val conf = new SparkConf()
> conf.set("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer")
> conf.set("spark.kryo.registrationRequired","true")  //I use this
> statement to force checking registration.
> conf.registerKryoClasses(Array(classOf[MyObject]))
>
> val sc = new SparkContext(conf)
> val rdd =
> sc.textFile("hdfs://dhao.hrb:8020/user/spark/tstfiles/charset/A_utf8.txt")
> val objs = rdd.map(new MyObject(_,1)).collect()
> for (x <- objs ) {
>   x.printMyObject
> }
> }
>
> The class MyObject is also a very simple Class, which is only used to test
> the serialization function:
> class MyObject  {
>   var myStr : String = ""
>   var myInt : Int = 0
>   def this(inStr : String, inInt : Int) {
> this()
> this.myStr = inStr
> this.myInt = inInt
>   }
>   def printMyObject {
> println("MyString is : "+myStr+"\tMyInt is : "+myInt)
>   }
> }
>
> But when I ran the application, it reported the following error:
> java.lang.IllegalArgumentException: Class is not registered:
> dhao.test.Serialization.MyObject[]
> Note: To register this class use:
> kryo.register(dhao.test.Serialization.MyObject[].class);
> at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
> at
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565)
> at
> org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:161)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> I don't understand what cause this problem. I have used the
> "conf.registerKryoClasses" to register my class. Could anyone help me ?
> Thanks
>
> By the way, the spark version is 1.3.0.
>
>


Re: How to troubleshoot server.TransportChannelHandler Exception

2015-03-25 Thread Akhil Das
Whats your spark version? Not quiet sure, but you could be hitting this
issue https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4516
On 26 Mar 2015 11:01, "Xi Shen"  wrote:

> Hi,
>
> My environment is Windows 64bit, Spark + YARN. I had a job that takes a
> long time. It starts well, but it ended with below exception:
>
> 15/03/25 12:39:09 WARN server.TransportChannelHandler: Exception in
> connection from
> headnode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.68.34:58507
> java.io.IOException: An existing connection was forcibly closed by the
> remote host
> at sun.nio.ch.SocketDispatcher.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
> at
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
> at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
> at
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
> at
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
> at java.lang.Thread.run(Thread.java:745)
> 15/03/25 12:39:09 ERROR executor.CoarseGrainedExecutorBackend: Driver
> Disassociated [akka.tcp://
> sparkexecu...@workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net:65469]
> -> [akka.tcp://
> sparkdri...@headnode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net:58467]
> disassociated! Shutting down.
> 15/03/25 12:39:09 WARN remote.ReliableDeliverySupervisor: Association with
> remote system [akka.tcp://
> sparkdri...@headnode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net:58467]
> has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
>
> Interestingly, the job is shown as Succeeded in the RM. I checked the
> application log, it is miles long, and this is the only exception I found.
> And it is no very useful to help me pin point the problem.
>
> Any idea what would be the cause?
>
>
> Thanks,
>
>
> [image: --]
> Xi Shen
> [image: http://]about.me/davidshen
> 
>   
>


Re: OOM for HiveFromSpark example

2015-03-25 Thread Akhil Das
Try to give the complete path to the file kv1.txt.
On 26 Mar 2015 11:48, "ÐΞ€ρ@Ҝ (๏̯͡๏)"  wrote:

> I am now seeing this error.
>
>
>
>
>
> 15/03/25 19:44:03 ERROR yarn.ApplicationMaster: User class threw
> exception: FAILED: SemanticException Line 1:23 Invalid path
> ''examples/src/main/resources/kv1.txt'': No files matching path
> file:/hadoop/10/scratch/local/usercache/dvasthimal/appcache/application_1426715280024_89893/container_1426715280024_89893_01_02/examples/src/main/resources/kv1.txt
>
> org.apache.spark.sql.execution.QueryExecutionException: FAILED:
> SemanticException Line 1:23 Invalid path
> ''examples/src/main/resources/kv1.txt'': No files matching path
> file:/hadoop/10/scratch/local/usercache/dvasthimal/appcache/application_1426715280024_89893/container_1426715280024_89893_01_02/examples/src/main/resources/kv1.txt
>
> at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:312)
>
> at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:280)
>
>
>
>
> -sh-4.1$ pwd
>
> /home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4
>
> -sh-4.1$ ls examples/src/main/resources/kv1.txt
>
> examples/src/main/resources/kv1.txt
>
> -sh-4.1$
>
>
>
> On Thu, Mar 26, 2015 at 8:08 AM, Zhan Zhang 
> wrote:
>
>>  You can do it in $SPARK_HOME/conf/spark-defaults.con
>>
>>  spark.driver.extraJavaOptions -XX:MaxPermSize=512m
>>
>>  Thanks.
>>
>>  Zhan Zhang
>>
>>
>>  On Mar 25, 2015, at 7:25 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:
>>
>>  Where and how do i pass this or other JVM argument ?
>> -XX:MaxPermSize=512m
>>
>> On Wed, Mar 25, 2015 at 11:36 PM, Zhan Zhang 
>> wrote:
>>
>>> I solve this by  increase the PermGen memory size in driver.
>>>
>>>  -XX:MaxPermSize=512m
>>>
>>>  Thanks.
>>>
>>>  Zhan Zhang
>>>
>>>  On Mar 25, 2015, at 10:54 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) 
>>> wrote:
>>>
>>>  I am facing same issue, posted a new thread. Please respond.
>>>
>>> On Wed, Jan 14, 2015 at 4:38 AM, Zhan Zhang 
>>> wrote:
>>>
 Hi Folks,

 I am trying to run hive context in yarn-cluster mode, but met some
 error. Does anybody know what cause the issue.

 I use following cmd to build the distribution:

  ./make-distribution.sh -Phive -Phive-thriftserver  -Pyarn  -Phadoop-2.4

 15/01/13 17:59:42 INFO cluster.YarnClusterScheduler:
 YarnClusterScheduler.postStartHook done
 15/01/13 17:59:42 INFO storage.BlockManagerMasterActor: Registering
 block manager cn122-10.l42scl.hortonworks.com:56157 with 1589.8 MB
 RAM, BlockManagerId(2, cn122-10.l42scl.hortonworks.com, 56157)
 15/01/13 17:59:43 INFO parse.ParseDriver: Parsing command: CREATE TABLE
 IF NOT EXISTS src (key INT, value STRING)
 15/01/13 17:59:43 INFO parse.ParseDriver: Parse Completed
 15/01/13 17:59:44 INFO metastore.HiveMetaStore: 0: Opening raw store
 with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
 15/01/13 17:59:44 INFO metastore.ObjectStore: ObjectStore, initialize
 called
 15/01/13 17:59:44 INFO DataNucleus.Persistence: Property
 datanucleus.cache.level2 unknown - will be ignored
 15/01/13 17:59:44 INFO DataNucleus.Persistence: Property
 hive.metastore.integral.jdo.pushdown unknown - will be ignored
 15/01/13 17:59:44 WARN DataNucleus.Connection: BoneCP specified but not
 present in CLASSPATH (or one of dependencies)
 15/01/13 17:59:44 WARN DataNucleus.Connection: BoneCP specified but not
 present in CLASSPATH (or one of dependencies)
 15/01/13 17:59:52 INFO metastore.ObjectStore: Setting MetaStore object
 pin classes with
 hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
 15/01/13 17:59:52 INFO metastore.MetaStoreDirectSql: MySQL check
 failed, assuming we are not on mysql: Lexical error at line 1, column 5.
 Encountered: "@" (64), after : "".
 15/01/13 17:59:53 INFO DataNucleus.Datastore: The class
 "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
 "embedded-only" so does not have its own datastore table.
 15/01/13 17:59:53 INFO DataNucleus.Datastore: The class
 "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
 "embedded-only" so does not have its own datastore table.
 15/01/13 17:59:59 INFO DataNucleus.Datastore: The class
 "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
 "embedded-only" so does not have its own datastore table.
 15/01/13 17:59:59 INFO DataNucleus.Datastore: The class
 "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
 "embedded-only" so does not have its own datastore table.
 15/01/13 18:00:00 INFO metastore.ObjectStore: Initialized ObjectStore
 15/01/13 18:00:00 WARN metastore.ObjectStore: Version information not
 found in metastore. hive.metastore.schema.verification is not enabled so
 recording the schema version 0.13.1aa
 15/01/13 18:00:01 INFO metastore.HiveMetaStore: Added admin role in
>>>

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-25 Thread Nick Pentreath
>From a quick look at this link -
http://accumulo.apache.org/1.6/accumulo_user_manual.html#_mapreduce - it
seems you need to call some static methods on AccumuloInputFormat in order
to set the auth, table, and range settings. Try setting these config
options first and then call newAPIHadoopRDD?

On Thu, Mar 26, 2015 at 2:34 AM, David Holiday 
wrote:

>  hi Irfan,
>
>  thanks for getting back to me - i'll try the accumulo list to be sure.
> what is the normal use case for spark though? I'm surprised that hooking it
> into something as common and popular as accumulo isn't more of an every-day
> task.
>
> DAVID HOLIDAY
>  Software Engineer
>  760 607 3300 | Office
>  312 758 8385 | Mobile
>  dav...@annaisystems.com 
>
>
>
> www.AnnaiSystems.com
>
>  On Mar 25, 2015, at 5:27 PM, Irfan Ahmad  wrote:
>
>  Hmmm this seems very accumulo-specific, doesn't it? Not sure how to
> help with that.
>
>
>  *Irfan Ahmad*
> CTO | Co-Founder | *CloudPhysics* 
> Best of VMworld Finalist
>  Best Cloud Management Award
>  NetworkWorld 10 Startups to Watch
> EMA Most Notable Vendor
>
> On Tue, Mar 24, 2015 at 4:09 PM, David Holiday 
> wrote:
>
>> hi all,
>>
>>  got a vagrant image with spark notebook, spark, accumulo, and hadoop
>> all running. from notebook I can manually create a scanner and pull test
>> data from a table I created using one of the accumulo examples:
>>
>> val instanceNameS = "accumulo"val zooServersS = "localhost:2181"val 
>> instance: Instance = new ZooKeeperInstance(instanceNameS, zooServersS)val 
>> connector: Connector = instance.getConnector( "root", new 
>> PasswordToken("password"))val auths = new Authorizations("exampleVis")val 
>> scanner = connector.createScanner("batchtest1", auths)
>>
>> scanner.setRange(new Range("row_00", "row_10"))
>> for(entry: Entry[Key, Value] <- scanner) {
>>   println(entry.getKey + " is " + entry.getValue)}
>>
>> will give the first ten rows of table data. when I try to create the RDD
>> thusly:
>>
>> val rdd2 =
>>   sparkContext.newAPIHadoopRDD (
>> new Configuration(),
>> classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
>> classOf[org.apache.accumulo.core.data.Key],
>> classOf[org.apache.accumulo.core.data.Value]
>>   )
>>
>> I get an RDD returned to me that I can't do much with due to the
>> following error:
>>
>> java.io.IOException: Input info has not been set. at
>> org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
>> at
>> org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343)
>> at
>> org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538)
>> at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98)
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at
>> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at
>> scala.Option.getOrElse(Option.scala:120) at
>> org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at
>> org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at
>> org.apache.spark.rdd.RDD.count(RDD.scala:927)
>>
>> which totally makes sense in light of the fact that I haven't specified
>> any parameters as to which table to connect with, what the auths are, etc.
>>
>> so my question is: what do I need to do from here to get those first ten
>> rows of table data into my RDD?
>>
>>
>>
>>  DAVID HOLIDAY
>>  Software Engineer
>>  760 607 3300 | Office
>>  312 758 8385 | Mobile
>>  dav...@annaisystems.com 
>>
>>
>> 
>> www.AnnaiSystems.com 
>>
>>   On Mar 19, 2015, at 11:25 AM, David Holiday 
>> wrote:
>>
>>  kk - I'll put something together and get back to you with more :-)
>>
>> DAVID HOLIDAY
>>  Software Engineer
>>  760 607 3300 | Office
>>  312 758 8385 | Mobile
>>  dav...@annaisystems.com 
>>
>>
>> 
>> www.AnnaiSystems.com 
>>
>>  On Mar 19, 2015, at 10:59 AM, Irfan Ahmad 
>> wrote:
>>
>>  Once you setup spark-notebook, it'll handle the submits for interactive
>> work. Non-interactive is not handled by it. For that spark-kernel could be
>> used.
>>
>>  Give it a shot ... it only takes 5 minutes to get it running in
>> local-mode.
>>
>>
>>  *Irfan Ahmad*
>> CTO | Co-Founder | *CloudPhysics* 
>> Best of VMworld Finalist
>>  Best Cloud Management Award
>>  NetworkWorld 10 Startups to Watch
>> EMA Most Notable Vendor
>>
>> On Thu, Mar 19, 2015 at 9:51 AM, David Holiday 
>> wrote:
>>
>>> hi all - thx for the alacritous replies! so regarding how to get things
>>> from notebook to spark and back, am I correct that spark-submit is the way
>>> to go?
>>>
>>> DAVID HOLIDAY
>>>  Software Engineer
>>>  760 607 3300 | Office
>>>  312 758 8385 | Mobile
>>>  dav...@annaisystems.com 
>>>
>>>
>>> 
>>> www.AnnaiSystems.com 
>>>
>>>  On Mar 19, 20

Re: OOM for HiveFromSpark example

2015-03-25 Thread ๏̯͡๏
I am now seeing this error.





15/03/25 19:44:03 ERROR yarn.ApplicationMaster: User class threw exception:
FAILED: SemanticException Line 1:23 Invalid path
''examples/src/main/resources/kv1.txt'': No files matching path
file:/hadoop/10/scratch/local/usercache/dvasthimal/appcache/application_1426715280024_89893/container_1426715280024_89893_01_02/examples/src/main/resources/kv1.txt

org.apache.spark.sql.execution.QueryExecutionException: FAILED:
SemanticException Line 1:23 Invalid path
''examples/src/main/resources/kv1.txt'': No files matching path
file:/hadoop/10/scratch/local/usercache/dvasthimal/appcache/application_1426715280024_89893/container_1426715280024_89893_01_02/examples/src/main/resources/kv1.txt

at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:312)

at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:280)




-sh-4.1$ pwd

/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4

-sh-4.1$ ls examples/src/main/resources/kv1.txt

examples/src/main/resources/kv1.txt

-sh-4.1$



On Thu, Mar 26, 2015 at 8:08 AM, Zhan Zhang  wrote:

>  You can do it in $SPARK_HOME/conf/spark-defaults.con
>
>  spark.driver.extraJavaOptions -XX:MaxPermSize=512m
>
>  Thanks.
>
>  Zhan Zhang
>
>
>  On Mar 25, 2015, at 7:25 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:
>
>  Where and how do i pass this or other JVM argument ?
> -XX:MaxPermSize=512m
>
> On Wed, Mar 25, 2015 at 11:36 PM, Zhan Zhang 
> wrote:
>
>> I solve this by  increase the PermGen memory size in driver.
>>
>>  -XX:MaxPermSize=512m
>>
>>  Thanks.
>>
>>  Zhan Zhang
>>
>>  On Mar 25, 2015, at 10:54 AM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:
>>
>>  I am facing same issue, posted a new thread. Please respond.
>>
>> On Wed, Jan 14, 2015 at 4:38 AM, Zhan Zhang 
>> wrote:
>>
>>> Hi Folks,
>>>
>>> I am trying to run hive context in yarn-cluster mode, but met some
>>> error. Does anybody know what cause the issue.
>>>
>>> I use following cmd to build the distribution:
>>>
>>>  ./make-distribution.sh -Phive -Phive-thriftserver  -Pyarn  -Phadoop-2.4
>>>
>>> 15/01/13 17:59:42 INFO cluster.YarnClusterScheduler:
>>> YarnClusterScheduler.postStartHook done
>>> 15/01/13 17:59:42 INFO storage.BlockManagerMasterActor: Registering
>>> block manager cn122-10.l42scl.hortonworks.com:56157 with 1589.8 MB RAM,
>>> BlockManagerId(2, cn122-10.l42scl.hortonworks.com, 56157)
>>> 15/01/13 17:59:43 INFO parse.ParseDriver: Parsing command: CREATE TABLE
>>> IF NOT EXISTS src (key INT, value STRING)
>>> 15/01/13 17:59:43 INFO parse.ParseDriver: Parse Completed
>>> 15/01/13 17:59:44 INFO metastore.HiveMetaStore: 0: Opening raw store
>>> with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
>>> 15/01/13 17:59:44 INFO metastore.ObjectStore: ObjectStore, initialize
>>> called
>>> 15/01/13 17:59:44 INFO DataNucleus.Persistence: Property
>>> datanucleus.cache.level2 unknown - will be ignored
>>> 15/01/13 17:59:44 INFO DataNucleus.Persistence: Property
>>> hive.metastore.integral.jdo.pushdown unknown - will be ignored
>>> 15/01/13 17:59:44 WARN DataNucleus.Connection: BoneCP specified but not
>>> present in CLASSPATH (or one of dependencies)
>>> 15/01/13 17:59:44 WARN DataNucleus.Connection: BoneCP specified but not
>>> present in CLASSPATH (or one of dependencies)
>>> 15/01/13 17:59:52 INFO metastore.ObjectStore: Setting MetaStore object
>>> pin classes with
>>> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
>>> 15/01/13 17:59:52 INFO metastore.MetaStoreDirectSql: MySQL check failed,
>>> assuming we are not on mysql: Lexical error at line 1, column 5.
>>> Encountered: "@" (64), after : "".
>>> 15/01/13 17:59:53 INFO DataNucleus.Datastore: The class
>>> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
>>> "embedded-only" so does not have its own datastore table.
>>> 15/01/13 17:59:53 INFO DataNucleus.Datastore: The class
>>> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
>>> "embedded-only" so does not have its own datastore table.
>>> 15/01/13 17:59:59 INFO DataNucleus.Datastore: The class
>>> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
>>> "embedded-only" so does not have its own datastore table.
>>> 15/01/13 17:59:59 INFO DataNucleus.Datastore: The class
>>> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
>>> "embedded-only" so does not have its own datastore table.
>>> 15/01/13 18:00:00 INFO metastore.ObjectStore: Initialized ObjectStore
>>> 15/01/13 18:00:00 WARN metastore.ObjectStore: Version information not
>>> found in metastore. hive.metastore.schema.verification is not enabled so
>>> recording the schema version 0.13.1aa
>>> 15/01/13 18:00:01 INFO metastore.HiveMetaStore: Added admin role in
>>> metastore
>>> 15/01/13 18:00:01 INFO metastore.HiveMetaStore: Added public role in
>>> metastore
>>> 15/01/13 18:00:01 INFO metastore.HiveMetaStore: No user is added in
>>> admin role, since config is empty
>>> 15/01/13 18:00:01 INFO session

Re: Which OutputCommitter to use for S3?

2015-03-25 Thread Pei-Lun Lee
I updated the PR for SPARK-6352 to be more like SPARK-3595.
I added a new setting "spark.sql.parquet.output.committer.class" in hadoop
configuration to allow custom implementation of ParquetOutputCommitter.
Can someone take a look at the PR?

On Mon, Mar 16, 2015 at 5:23 PM, Pei-Lun Lee  wrote:

> Hi,
>
> I created a JIRA and PR for supporting a s3 friendly output committer for
> saveAsParquetFile:
> https://issues.apache.org/jira/browse/SPARK-6352
> https://github.com/apache/spark/pull/5042
>
> My approach is add a DirectParquetOutputCommitter class in spark-sql
> package and use a boolean config variable
> spark.sql.parquet.useDirectParquetOutputCommitter to choose between default
> output committer.
> This may not be the smartest solution but it works for me.
> Tested on spark 1.1, 1.3 with hadoop 1.0.4.
>
>
> On Thu, Mar 5, 2015 at 4:32 PM, Aaron Davidson  wrote:
>
>> Yes, unfortunately that direct dependency makes this injection much more
>> difficult for saveAsParquetFile.
>>
>> On Thu, Mar 5, 2015 at 12:28 AM, Pei-Lun Lee  wrote:
>>
>>> Thanks for the DirectOutputCommitter example.
>>> However I found it only works for saveAsHadoopFile. What about
>>> saveAsParquetFile?
>>> It looks like SparkSQL is using ParquetOutputCommitter, which is subclass
>>> of FileOutputCommitter.
>>>
>>> On Fri, Feb 27, 2015 at 1:52 AM, Thomas Demoor <
>>> thomas.dem...@amplidata.com>
>>> wrote:
>>>
>>> > FYI. We're currently addressing this at the Hadoop level in
>>> > https://issues.apache.org/jira/browse/HADOOP-9565
>>> >
>>> >
>>> > Thomas Demoor
>>> >
>>> > On Mon, Feb 23, 2015 at 10:16 PM, Darin McBeath <
>>> > ddmcbe...@yahoo.com.invalid> wrote:
>>> >
>>> >> Just to close the loop in case anyone runs into the same problem I
>>> had.
>>> >>
>>> >> By setting --hadoop-major-version=2 when using the ec2 scripts,
>>> >> everything worked fine.
>>> >>
>>> >> Darin.
>>> >>
>>> >>
>>> >> - Original Message -
>>> >> From: Darin McBeath 
>>> >> To: Mingyu Kim ; Aaron Davidson <
>>> ilike...@gmail.com>
>>> >> Cc: "user@spark.apache.org" 
>>> >> Sent: Monday, February 23, 2015 3:16 PM
>>> >> Subject: Re: Which OutputCommitter to use for S3?
>>> >>
>>> >> Thanks.  I think my problem might actually be the other way around.
>>> >>
>>> >> I'm compiling with hadoop 2,  but when I startup Spark, using the ec2
>>> >> scripts, I don't specify a
>>> >> -hadoop-major-version and the default is 1.   I'm guessing that if I
>>> make
>>> >> that a 2 that it might work correctly.  I'll try it and post a
>>> response.
>>> >>
>>> >>
>>> >> - Original Message -
>>> >> From: Mingyu Kim 
>>> >> To: Darin McBeath ; Aaron Davidson <
>>> >> ilike...@gmail.com>
>>> >> Cc: "user@spark.apache.org" 
>>> >> Sent: Monday, February 23, 2015 3:06 PM
>>> >> Subject: Re: Which OutputCommitter to use for S3?
>>> >>
>>> >> Cool, we will start from there. Thanks Aaron and Josh!
>>> >>
>>> >> Darin, it¹s likely because the DirectOutputCommitter is compiled with
>>> >> Hadoop 1 classes and you¹re running it with Hadoop 2.
>>> >> org.apache.hadoop.mapred.JobContext used to be a class in Hadoop 1,
>>> and it
>>> >> became an interface in Hadoop 2.
>>> >>
>>> >> Mingyu
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On 2/23/15, 11:52 AM, "Darin McBeath" 
>>> >> wrote:
>>> >>
>>> >> >Aaron.  Thanks for the class. Since I'm currently writing Java based
>>> >> >Spark applications, I tried converting your class to Java (it seemed
>>> >> >pretty straightforward).
>>> >> >
>>> >> >I set up the use of the class as follows:
>>> >> >
>>> >> >SparkConf conf = new SparkConf()
>>> >> >.set("spark.hadoop.mapred.output.committer.class",
>>> >> >"com.elsevier.common.DirectOutputCommitter");
>>> >> >
>>> >> >And I then try and save a file to S3 (which I believe should use the
>>> old
>>> >> >hadoop apis).
>>> >> >
>>> >> >JavaPairRDD newBaselineRDDWritable =
>>> >> >reducedhsfPairRDD.mapToPair(new ConvertToWritableTypes());
>>> >> >newBaselineRDDWritable.saveAsHadoopFile(baselineOutputBucketFile,
>>> >> >Text.class, Text.class, SequenceFileOutputFormat.class,
>>> >> >org.apache.hadoop.io.compress.GzipCodec.class);
>>> >> >
>>> >> >But, I get the following error message.
>>> >> >
>>> >> >Exception in thread "main" java.lang.IncompatibleClassChangeError:
>>> Found
>>> >> >class org.apache.hadoop.mapred.JobContext, but interface was expected
>>> >> >at
>>> >>
>>> >>
>>> >com.elsevier.common.DirectOutputCommitter.commitJob(DirectOutputCommitter.
>>> >> >java:68)
>>> >> >at
>>> >>
>>> >org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:127)
>>> >> >at
>>> >>
>>> >>
>>> >org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions
>>> >> >.scala:1075)
>>> >> >at
>>> >>
>>> >>
>>> >org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.sc
>>> >> >ala:940)
>>> >> >at
>>> >>
>>> >>
>>> >org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.sc
>>> >> >ala:902)
>>> >> >at
>>> >>
>>> >>
>>> >org.apache.spark.api.java.

How to troubleshoot server.TransportChannelHandler Exception

2015-03-25 Thread Xi Shen
Hi,

My environment is Windows 64bit, Spark + YARN. I had a job that takes a
long time. It starts well, but it ended with below exception:

15/03/25 12:39:09 WARN server.TransportChannelHandler: Exception in
connection from
headnode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.68.34:58507
java.io.IOException: An existing connection was forcibly closed by the
remote host
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
15/03/25 12:39:09 ERROR executor.CoarseGrainedExecutorBackend: Driver
Disassociated [akka.tcp://
sparkexecu...@workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net:65469]
-> [akka.tcp://
sparkdri...@headnode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net:58467]
disassociated! Shutting down.
15/03/25 12:39:09 WARN remote.ReliableDeliverySupervisor: Association with
remote system [akka.tcp://
sparkdri...@headnode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net:58467]
has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

Interestingly, the job is shown as Succeeded in the RM. I checked the
application log, it is miles long, and this is the only exception I found.
And it is no very useful to help me pin point the problem.

Any idea what would be the cause?


Thanks,


[image: --]
Xi Shen
[image: http://]about.me/davidshen

  


Re: How to deploy binary dependencies to workers?

2015-03-25 Thread Jörn Franke
You probably need to add the dll directory to the path (not  classpath!)
environment variable on all nodes.
Le 26 mars 2015 06:23, "Xi Shen"  a écrit :

> Not of course...all machines in HDInsight are Windows 64bit server. And I
> have made sure all my DLLs are for 64bit machines. I have managed to get
> those DLLs loade on my local machine which is also Windows 64bit.
>
>
>
>
> [image: --]
> Xi Shen
> [image: http://]about.me/davidshen
> 
>   
>
> On Thu, Mar 26, 2015 at 11:11 AM, DB Tsai  wrote:
>
>> Are you deploying the windows dll to linux machine?
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> Blog: https://www.dbtsai.com
>>
>>
>> On Wed, Mar 25, 2015 at 3:57 AM, Xi Shen  wrote:
>> > I think you meant to use the "--files" to deploy the DLLs. I gave a
>> try, but
>> > it did not work.
>> >
>> > From the Spark UI, Environment tab, I can see
>> >
>> > spark.yarn.dist.files
>> >
>> >
>> file:/c:/openblas/libgcc_s_seh-1.dll,file:/c:/openblas/libblas3.dll,file:/c:/openblas/libgfortran-3.dll,file:/c:/openblas/liblapack3.dll,file:/c:/openblas/libquadmath-0.dll
>> >
>> > I think my DLLs are all deployed. But I still got the warn message that
>> > native BLAS library cannot be load.
>> >
>> > And idea?
>> >
>> >
>> > Thanks,
>> > David
>> >
>> >
>> > On Wed, Mar 25, 2015 at 5:40 AM DB Tsai  wrote:
>> >>
>> >> I would recommend to upload those jars to HDFS, and use add jars
>> >> option in spark-submit with URI from HDFS instead of URI from local
>> >> filesystem. Thus, it can avoid the problem of fetching jars from
>> >> driver which can be a bottleneck.
>> >>
>> >> Sincerely,
>> >>
>> >> DB Tsai
>> >> ---
>> >> Blog: https://www.dbtsai.com
>> >>
>> >>
>> >> On Tue, Mar 24, 2015 at 4:13 AM, Xi Shen 
>> wrote:
>> >> > Hi,
>> >> >
>> >> > I am doing ML using Spark mllib. However, I do not have full control
>> to
>> >> > the
>> >> > cluster. I am using Microsoft Azure HDInsight
>> >> >
>> >> > I want to deploy the BLAS or whatever required dependencies to
>> >> > accelerate
>> >> > the computation. But I don't know how to deploy those DLLs when I
>> submit
>> >> > my
>> >> > JAR to the cluster.
>> >> >
>> >> > I know how to pack those DLLs into a jar. The real challenge is how
>> to
>> >> > let
>> >> > the system find them...
>> >> >
>> >> >
>> >> > Thanks,
>> >> > David
>> >> >
>>
>
>


RE: JavaKinesisWordCountASLYARN Example not working on EMR

2015-03-25 Thread Ankur Jain
I had installed spark via bootstrap in EMR.

https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark

However when I run spark without yarn (local) and that one is working fine…..

Thanks
Ankur

From: Arush Kharbanda [mailto:ar...@sigmoidanalytics.com]
Sent: Wednesday, March 25, 2015 7:31 PM
To: Ankur Jain
Cc: user@spark.apache.org
Subject: Re: JavaKinesisWordCountASLYARN Example not working on EMR

Did you built for kineses using profile -Pkinesis-asl

On Wed, Mar 25, 2015 at 7:18 PM, ankur.jain 
mailto:ankur.j...@yash.com>> wrote:
Hi,
I am trying to run a Spark on YARN program provided by Spark in the examples
directory using Amazon Kinesis on EMR cluster :
I am using Spark 1.3.0 and EMR AMI: 3.5.0

I've setup the Credentials
export AWS_ACCESS_KEY_ID=XX
export AWS_SECRET_KEY=XXX

*A) This is the Kinesis Word Count Producer which ran Successfully : *
run-example org.apache.spark.examples.streaming.KinesisWordCountProducerASL
mySparkStream https://kinesis.us-east-1.amazonaws.com 1 5

*B) This one is the Normal Consumer using Spark Streaming which also ran
Successfully: *
run-example org.apache.spark.examples.streaming.JavaKinesisWordCountASL
mySparkStream https://kinesis.us-east-1.amazonaws.com

*C) And this is the YARN based program which is NOT WORKING: *
run-example org.apache.spark.examples.streaming.JavaKinesisWordCountASLYARN
mySparkStream https://kinesis.us-east-1.amazonaws.com\
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
15/03/25 11:52:45 INFO spark.SparkContext: Running Spark version 1.3.0
15/03/25 11:52:45 WARN spark.SparkConf:
SPARK_CLASSPATH was detected (set to
'/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar').
This is deprecated in Spark 1.0+.
Please instead use:
•   ./spark-submit with --driver-class-path to augment the driver classpath
•   spark.executor.extraClassPath to augment the executor classpath
15/03/25 11:52:45 WARN spark.SparkConf: Setting
'spark.executor.extraClassPath' to
'/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar'
as a work-around.
15/03/25 11:52:45 WARN spark.SparkConf: Setting
'spark.driver.extraClassPath' to
'/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar'
as a work-around.
15/03/25 11:52:46 INFO spark.SecurityManager: Changing view acls to: hadoop
15/03/25 11:52:46 INFO spark.SecurityManager: Changing modify acls to:
hadoop
15/03/25 11:52:46 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(hadoop); users with modify permissions: Set(hadoop)
15/03/25 11:52:47 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/03/25 11:52:48 INFO Remoting: Starting remoting
15/03/25 11:52:48 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriver@ip-10-80-175-92.ec2.internal:59504]
15/03/25 11:52:48 INFO util.Utils: Successfully started service
'sparkDriver' on port 59504.
15/03/25 11:52:48 INFO spark.SparkEnv: Registering MapOutputTracker
15/03/25 11:52:48 INFO spark.SparkEnv: Registering BlockManagerMaster
15/03/25 11:52:48 INFO storage.DiskBlockManager: Created local directory at
/mnt/spark/spark-120befbc-6dae-4751-b41f-dbf7b3d97616/blockmgr-d339d180-36f5-465f-bda3-cecccb23b1d3
15/03/25 11:52:48 INFO storage.MemoryStore: MemoryStore started with
capacity 265.4 MB
15/03/25 11:52:48 INFO spark.HttpFileServer: HTTP File server directory is
/mnt/spark/spark-85e88478-3dad-4fcf-a43a-efd15166bef3/httpd-6115870a-0d90-44df-aa7c-a6bd1a47e107
15/03/25 11:52:48 INFO spark.HttpServer: Starting HTTP Server
15/03/25 11:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/03/25 11:52:49 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:44879
15/03/25 11:52:49 INFO util.Utils: Successfully started service 'HTTP file
server' on port 44879.
15/03/25 11:52:49 INFO spark.SparkEnv: Registering OutputCommitCoordinator
15/03/25 11:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/03/25 11:52:49 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
15/03/25 11:52:49 INFO util.Utils: Successfully started service 'SparkUI' on
port 4040.
15/03/25 11:52:49 INFO ui.SparkUI: Started SparkUI at
http://ip-10-80-175-92.ec2.internal:4040
15/03/25 11:52:50 INFO spark.SparkContext: Added JAR
file:/home/hadoop/spark/lib/spark-examples-1.3.0-hadoop2.4.0.jar at
http://10.80.175.92:44879/jars/spark-examples-1.3.0-hadoop2.4.0.jar with
timestamp 1427284370358
15/03/25 11:52:50 INFO cluster.YarnClusterSche

Re: How to deploy binary dependencies to workers?

2015-03-25 Thread Xi Shen
Not of course...all machines in HDInsight are Windows 64bit server. And I
have made sure all my DLLs are for 64bit machines. I have managed to get
those DLLs loade on my local machine which is also Windows 64bit.




[image: --]
Xi Shen
[image: http://]about.me/davidshen

  

On Thu, Mar 26, 2015 at 11:11 AM, DB Tsai  wrote:

> Are you deploying the windows dll to linux machine?
>
> Sincerely,
>
> DB Tsai
> ---
> Blog: https://www.dbtsai.com
>
>
> On Wed, Mar 25, 2015 at 3:57 AM, Xi Shen  wrote:
> > I think you meant to use the "--files" to deploy the DLLs. I gave a try,
> but
> > it did not work.
> >
> > From the Spark UI, Environment tab, I can see
> >
> > spark.yarn.dist.files
> >
> >
> file:/c:/openblas/libgcc_s_seh-1.dll,file:/c:/openblas/libblas3.dll,file:/c:/openblas/libgfortran-3.dll,file:/c:/openblas/liblapack3.dll,file:/c:/openblas/libquadmath-0.dll
> >
> > I think my DLLs are all deployed. But I still got the warn message that
> > native BLAS library cannot be load.
> >
> > And idea?
> >
> >
> > Thanks,
> > David
> >
> >
> > On Wed, Mar 25, 2015 at 5:40 AM DB Tsai  wrote:
> >>
> >> I would recommend to upload those jars to HDFS, and use add jars
> >> option in spark-submit with URI from HDFS instead of URI from local
> >> filesystem. Thus, it can avoid the problem of fetching jars from
> >> driver which can be a bottleneck.
> >>
> >> Sincerely,
> >>
> >> DB Tsai
> >> ---
> >> Blog: https://www.dbtsai.com
> >>
> >>
> >> On Tue, Mar 24, 2015 at 4:13 AM, Xi Shen  wrote:
> >> > Hi,
> >> >
> >> > I am doing ML using Spark mllib. However, I do not have full control
> to
> >> > the
> >> > cluster. I am using Microsoft Azure HDInsight
> >> >
> >> > I want to deploy the BLAS or whatever required dependencies to
> >> > accelerate
> >> > the computation. But I don't know how to deploy those DLLs when I
> submit
> >> > my
> >> > JAR to the cluster.
> >> >
> >> > I know how to pack those DLLs into a jar. The real challenge is how to
> >> > let
> >> > the system find them...
> >> >
> >> >
> >> > Thanks,
> >> > David
> >> >
>


SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-25 Thread Pei-Lun Lee
Hi,

When I save parquet file with SaveMode.Overwrite, it never generate
_common_metadata. Whether it overwrites an existing dir or not.
Is this expected behavior?
And what is the benefit of _common_metadata? Will reading performs better
when it is present?

Thanks,
--
Pei-Lun


Re: [SparkSQL] How to calculate stddev on a DataFrame?

2015-03-25 Thread Denny Lee
Perhaps this email reference may be able to help from a DataFrame
perspective:
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201503.mbox/%3CCALte62ztepahF=5hk9rcfbnyk4z43wkcq4fkdcbwmgf_3_o...@mail.gmail.com%3E


On Wed, Mar 25, 2015 at 7:29 PM Haopu Wang  wrote:

>  Hi,
>
>
>
> I have a DataFrame object and I want to do types of aggregations like
> count, sum, variance, stddev, etc.
>
>
>
> DataFrame has DSL to do simple aggregations like count and sum.
>
>
>
> How about variance and stddev?
>
>
>
> Thank you for any suggestions!
>
>
>


Re: [SparkSQL] How to calculate stddev on a DataFrame?

2015-03-25 Thread Corey Nolet
I would do sum square. This would allow you to keep an ongoing value as an
associative operation (in an aggregator) and then calculate the variance &
std deviation after the fact.

On Wed, Mar 25, 2015 at 10:28 PM, Haopu Wang  wrote:

>  Hi,
>
>
>
> I have a DataFrame object and I want to do types of aggregations like
> count, sum, variance, stddev, etc.
>
>
>
> DataFrame has DSL to do simple aggregations like count and sum.
>
>
>
> How about variance and stddev?
>
>
>
> Thank you for any suggestions!
>
>
>


Re: Spark-sql query got exception.Help

2015-03-25 Thread Saisai Shao
Would you mind running again to see if this exception can be reproduced
again, since exception in MapOutputTracker seldom occurs, maybe some other
exceptions which lead to this error.

Thanks
Jerry

2015-03-26 10:55 GMT+08:00 李铖 :

> One more exception.How to fix it .Anybody help me ,please.
>
>
> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
> location for shuffle 0
> at
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:386)
> at
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:383)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at
> org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:382)
> at
> org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:178)
> at
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
> at
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:120)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
> at
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
> at
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
> at
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
>
>
> 2015-03-26 10:39 GMT+08:00 李铖 :
>
>> Yes, it works after I append the two properties in spark-defaults.conf.
>>
>> As I  use python programing on spark platform,the python api does not
>> have SparkConf api.
>>
>> Thanks.
>>
>> 2015-03-25 21:07 GMT+08:00 Cheng Lian :
>>
>>>  Oh, just noticed that you were calling sc.setSystemProperty. Actually
>>> you need to set this property in SparkConf or in spark-defaults.conf. And
>>> there are two configurations related to Kryo buffer size,
>>>
>>>- spark.kryoserializer.buffer.mb, which is the initial size, and
>>>- spark.kryoserializer.buffer.max.mb, which is the max buffer size.
>>>
>>> Make sure the 2nd one is larger (it seems that Kryo doesn’t check for
>>> it).
>>>
>>> Cheng
>>>
>>> On 3/25/15 7:31 PM, 李铖 wrote:
>>>
>>>   Here is the full track
>>>
>>>  15/03/25 17:48:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID
>>> 1, cloud1): com.esotericsoftware.kryo.KryoException: Buffer overflow.
>>> Available: 0, required: 39135
>>>  at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
>>>  at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
>>>  at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
>>>  at
>>> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
>>>  at
>>> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
>>>  at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
>>>  at
>>> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:312)
>>>  at
>>> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
>>>  at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
>>>  at
>>> org.apache.spark.serializer.Kr

Re: Spark-sql query got exception.Help

2015-03-25 Thread 李铖
One more exception.How to fix it .Anybody help me ,please.


org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 0
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:386)
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:383)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:382)
at
org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:178)
at
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
at
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:120)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
at
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)


2015-03-26 10:39 GMT+08:00 李铖 :

> Yes, it works after I append the two properties in spark-defaults.conf.
>
> As I  use python programing on spark platform,the python api does not have
> SparkConf api.
>
> Thanks.
>
> 2015-03-25 21:07 GMT+08:00 Cheng Lian :
>
>>  Oh, just noticed that you were calling sc.setSystemProperty. Actually
>> you need to set this property in SparkConf or in spark-defaults.conf. And
>> there are two configurations related to Kryo buffer size,
>>
>>- spark.kryoserializer.buffer.mb, which is the initial size, and
>>- spark.kryoserializer.buffer.max.mb, which is the max buffer size.
>>
>> Make sure the 2nd one is larger (it seems that Kryo doesn’t check for it).
>>
>> Cheng
>>
>> On 3/25/15 7:31 PM, 李铖 wrote:
>>
>>   Here is the full track
>>
>>  15/03/25 17:48:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID
>> 1, cloud1): com.esotericsoftware.kryo.KryoException: Buffer overflow.
>> Available: 0, required: 39135
>>  at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
>>  at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
>>  at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
>>  at
>> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
>>  at
>> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
>>  at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
>>  at
>> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:312)
>>  at
>> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
>>  at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
>>  at
>> org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:165)
>>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206)
>>  at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>  at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>  at java.lang.Thread.run(Thread.java:745)
>>
>> 2015-03-25 19:05 G

The dreaded bradcast error Error: Failed to get broadcast_0_piece0 of broadcast_0

2015-03-25 Thread rkgurram
val transArray:RDD[Flow] < Flow is my custom class, it has the following
methods 
  getFlowStart() <---returns a
Double (start time)
  getFlowEnd()   <---returns a
Double (end time)
  addTransaction(tranID:Int)
<--- adds the input tranID to a Array in Flow obj (keeping track of
transactions this flow belongs to )

My problem,
1)  I need to go to each Flow object in the RDD and pick out its "FlowStart"
and "FlowEnd" times
2) Each "FlowStart" and "FlowEnd" time I treat as 1 transaction 
3) Any other flow in the same RDD that has its (startTime >=
currFlow.startTime and endTime <= currFlow.endTime) , I treat as belonging
to the same transaction (Basically take the rdd and group all Flows into
"Transactions", each transaction might have 1 or more flows, also note that
a flow might belong to more than 1 transaction)

I am trying to do the following but it fails...

/* Go thro the RDD and spit out each "startTime" "endTime" pair 
 * and collect all elements in a array. This shoud give me an array of
 * transactions
 */
val transArray = flowRDD.map {
  case (x) =>
(x.getFlowStart, x.getFlowEnd)
}.collect()

logger.info("tranArray size:" + transArray.size) <-- This prints out
correctly (97144)

/*
 * I loop thro the above array and for each transaction, I go thro the
RDD[Flow]
 * and try to find out which flow objects meet my criteria and call the
"addTransaction"
 * with the transactionID 
 */
var tranID = 0
for((start,end) <- transArray){
  flowRDD.foreach{
case(x) =>
  if ((start >= x.getFlowStart) && (end <= x.getFlowEnd)){
x.addTransaction(tranID)
  }
  }
  tranID += 1
}

I am expecting that at the end of these two loops I get an RDD of flow
objects with each flow object having an Array[Int] that contains all the
transactionIDs that it belongs to

The app runs for a while like 30 mins and fails with the braodcast
errorwhich I believe means spark is giving up. :(

I dread this error cause I have hit it on two other occasions not related to
this problem and have no way around it.

The exact error is

15/03/26 08:01:04 INFO ApplicationDriver$: No. flows detected:[97144]
15/03/26 08:01:05 INFO ApplicationDriver$: tranArray size:97144
15/03/26 08:17:26 ERROR Executor: Exception in task 0.0 in stage 1202.0 (TID
2404)
java.io.IOException: org.apache.spark.SparkException: Failed to get
broadcast_0_piece0 of broadcast_0
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1155)
at
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
at
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:140)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_0_piece0
of broadcast_0
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadc

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Dean Wampler
You could stop the running the processes and run the same processes using
the new version, starting with the master and then the slaves. You would
have to snoop around a bit to get the command-line arguments right, but
it's doable. Use `ps -efw` to find the command-lines used. Be sure to rerun
them as the same user. Or look at what the EC2 scripts do.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Wed, Mar 25, 2015 at 4:54 PM, roni  wrote:

> Is there any way that I can install the new one and remove previous
> version.
> I installed spark 1.3 on my EC2 master and set teh spark home to the new
> one.
> But when I start teh spark-shell I get -
>  java.lang.UnsatisfiedLinkError:
> org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative()V
> at
> org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native
> Method)
>
> Is There no way to upgrade without creating new cluster?
> Thanks
> Roni
>
>
>
> On Wed, Mar 25, 2015 at 1:18 PM, Dean Wampler 
> wrote:
>
>> Yes, that's the problem. The RDD class exists in both binary jar files,
>> but the signatures probably don't match. The bottom line, as always for
>> tools like this, is that you can't mix versions.
>>
>> Dean Wampler, Ph.D.
>> Author: Programming Scala, 2nd Edition
>>  (O'Reilly)
>> Typesafe 
>> @deanwampler 
>> http://polyglotprogramming.com
>>
>> On Wed, Mar 25, 2015 at 3:13 PM, roni  wrote:
>>
>>> My cluster is still on spark 1.2 and in SBT I am using 1.3.
>>> So probably it is compiling with 1.3 but running with 1.2 ?
>>>
>>> On Wed, Mar 25, 2015 at 12:34 PM, Dean Wampler 
>>> wrote:
>>>
 Weird. Are you running using SBT console? It should have the spark-core
 jar on the classpath. Similarly, spark-shell or spark-submit should work,
 but be sure you're using the same version of Spark when running as when
 compiling. Also, you might need to add spark-sql to your SBT dependencies,
 but that shouldn't be this issue.

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
  (O'Reilly)
 Typesafe 
 @deanwampler 
 http://polyglotprogramming.com

 On Wed, Mar 25, 2015 at 12:09 PM, roni  wrote:

> Thanks Dean and Nick.
> So, I removed the ADAM and H2o from my SBT as I was not using them.
> I got the code to compile  - only for fail while running with -
> SparkContext: Created broadcast 1 from textFile at
> kmerIntersetion.scala:21
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/spark/rdd/RDD$
> at preDefKmerIntersection$.main(kmerIntersetion.scala:26)
>
> This line is where I do a "JOIN" operation.
> val hgPair = hgfasta.map(_.split (",")).map(a=> (a(0),
> a(1).trim().toInt))
>  val filtered = hgPair.filter(kv => kv._2 == 1)
>  val bedPair = bedFile.map(_.split (",")).map(a=>
> (a(0), a(1).trim().toInt))
> * val joinRDD = bedPair.join(filtered)   *
> Any idea whats going on?
> I have data on the EC2 so I am avoiding creating the new cluster , but
> just upgrading and changing the code to use 1.3 and Spark SQL
> Thanks
> Roni
>
>
>
> On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler 
> wrote:
>
>> For the Spark SQL parts, 1.3 breaks backwards compatibility, because
>> before 1.3, Spark SQL was considered experimental where API changes were
>> allowed.
>>
>> So, H2O and ADA compatible with 1.2.X might not work with 1.3.
>>
>> dean
>>
>> Dean Wampler, Ph.D.
>> Author: Programming Scala, 2nd Edition
>>  (O'Reilly)
>> Typesafe 
>> @deanwampler 
>> http://polyglotprogramming.com
>>
>> On Wed, Mar 25, 2015 at 9:39 AM, roni  wrote:
>>
>>> Even if H2o and ADA are dependent on 1.2.1 , it should be backword
>>> compatible, right?
>>> So using 1.3 should not break them.
>>> And the code is not using the classes from those libs.
>>> I tried sbt clean compile .. same errror
>>> Thanks
>>> _R
>>>
>>> On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath <
>>> nick.pentre...@gmail.com> wrote:
>>>
 What version of Spark do the other dependencies rely on (Adam and
 H2O?) - that could be it

 Or try sbt clean compile

 —
 Sent from Mailbox 


 On Wed, Mar 25, 2015 at 5:58 PM, roni 
 wrote:
>

Re: Spark-sql query got exception.Help

2015-03-25 Thread 李铖
Yes, it works after I append the two properties in spark-defaults.conf.

As I  use python programing on spark platform,the python api does not have
SparkConf api.

Thanks.

2015-03-25 21:07 GMT+08:00 Cheng Lian :

>  Oh, just noticed that you were calling sc.setSystemProperty. Actually
> you need to set this property in SparkConf or in spark-defaults.conf. And
> there are two configurations related to Kryo buffer size,
>
>- spark.kryoserializer.buffer.mb, which is the initial size, and
>- spark.kryoserializer.buffer.max.mb, which is the max buffer size.
>
> Make sure the 2nd one is larger (it seems that Kryo doesn’t check for it).
>
> Cheng
>
> On 3/25/15 7:31 PM, 李铖 wrote:
>
>   Here is the full track
>
>  15/03/25 17:48:34 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID
> 1, cloud1): com.esotericsoftware.kryo.KryoException: Buffer overflow.
> Available: 0, required: 39135
>  at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
>  at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
>  at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
>  at
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
>  at
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:18)
>  at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
>  at
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:312)
>  at
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
>  at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
>  at
> org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:165)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206)
>  at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  at java.lang.Thread.run(Thread.java:745)
>
> 2015-03-25 19:05 GMT+08:00 Cheng Lian :
>
>>  Could you please provide the full stack trace?
>>
>>
>> On 3/25/15 6:26 PM, 李铖 wrote:
>>
>>  It is ok when I do query data from a small hdfs file.
>> But if the hdfs file is 152m,I got this exception.
>> I try this code
>> .'sc.setSystemProperty("spark.kryoserializer.buffer.mb",'256')'.error
>> still.
>>
>>  ```
>> com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
>> required: 39135
>>  at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
>>  at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:220)
>>  at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:206)
>>  at
>> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:29)
>>  at
>>
>>
>> ```
>>
>>
>>
>​
>


Re: OOM for HiveFromSpark example

2015-03-25 Thread Zhan Zhang
You can do it in $SPARK_HOME/conf/spark-defaults.con

spark.driver.extraJavaOptions -XX:MaxPermSize=512m

Thanks.

Zhan Zhang


On Mar 25, 2015, at 7:25 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) 
mailto:deepuj...@gmail.com>> wrote:

Where and how do i pass this or other JVM argument ?
-XX:MaxPermSize=512m

On Wed, Mar 25, 2015 at 11:36 PM, Zhan Zhang 
mailto:zzh...@hortonworks.com>> wrote:
I solve this by  increase the PermGen memory size in driver.

-XX:MaxPermSize=512m

Thanks.

Zhan Zhang

On Mar 25, 2015, at 10:54 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) 
mailto:deepuj...@gmail.com>> wrote:

I am facing same issue, posted a new thread. Please respond.

On Wed, Jan 14, 2015 at 4:38 AM, Zhan Zhang 
mailto:zzh...@hortonworks.com>> wrote:
Hi Folks,

I am trying to run hive context in yarn-cluster mode, but met some error. Does 
anybody know what cause the issue.

I use following cmd to build the distribution:

 ./make-distribution.sh -Phive -Phive-thriftserver  -Pyarn  -Phadoop-2.4

15/01/13 17:59:42 INFO cluster.YarnClusterScheduler: 
YarnClusterScheduler.postStartHook done
15/01/13 17:59:42 INFO storage.BlockManagerMasterActor: Registering block 
manager 
cn122-10.l42scl.hortonworks.com:56157
 with 1589.8 MB RAM, BlockManagerId(2, 
cn122-10.l42scl.hortonworks.com, 56157)
15/01/13 17:59:43 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF NOT 
EXISTS src (key INT, value STRING)
15/01/13 17:59:43 INFO parse.ParseDriver: Parse Completed
15/01/13 17:59:44 INFO metastore.HiveMetaStore: 0: Opening raw store with 
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/01/13 17:59:44 INFO metastore.ObjectStore: ObjectStore, initialize called
15/01/13 17:59:44 INFO DataNucleus.Persistence: Property 
datanucleus.cache.level2 unknown - will be ignored
15/01/13 17:59:44 INFO DataNucleus.Persistence: Property 
hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/01/13 17:59:44 WARN DataNucleus.Connection: BoneCP specified but not present 
in CLASSPATH (or one of dependencies)
15/01/13 17:59:44 WARN DataNucleus.Connection: BoneCP specified but not present 
in CLASSPATH (or one of dependencies)
15/01/13 17:59:52 INFO metastore.ObjectStore: Setting MetaStore object pin 
classes with 
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
15/01/13 17:59:52 INFO metastore.MetaStoreDirectSql: MySQL check failed, 
assuming we are not on mysql: Lexical error at line 1, column 5.  Encountered: 
"@" (64), after : "".
15/01/13 17:59:53 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
"embedded-only" so does not have its own datastore table.
15/01/13 17:59:53 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so 
does not have its own datastore table.
15/01/13 17:59:59 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
"embedded-only" so does not have its own datastore table.
15/01/13 17:59:59 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so 
does not have its own datastore table.
15/01/13 18:00:00 INFO metastore.ObjectStore: Initialized ObjectStore
15/01/13 18:00:00 WARN metastore.ObjectStore: Version information not found in 
metastore. hive.metastore.schema.verification is not enabled so recording the 
schema version 0.13.1aa
15/01/13 18:00:01 INFO metastore.HiveMetaStore: Added admin role in metastore
15/01/13 18:00:01 INFO metastore.HiveMetaStore: Added public role in metastore
15/01/13 18:00:01 INFO metastore.HiveMetaStore: No user is added in admin role, 
since config is empty
15/01/13 18:00:01 INFO session.SessionState: No Tez session required at this 
point. hive.execution.engine=mr.
15/01/13 18:00:02 INFO log.PerfLogger: 
15/01/13 18:00:02 INFO log.PerfLogger: 
15/01/13 18:00:02 INFO ql.Driver: Concurrency mode is disabled, not creating a 
lock manager
15/01/13 18:00:02 INFO log.PerfLogger: 
15/01/13 18:00:03 INFO log.PerfLogger: 
15/01/13 18:00:03 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF NOT 
EXISTS src (key INT, value STRING)
15/01/13 18:00:03 INFO parse.ParseDriver: Parse Completed
15/01/13 18:00:03 INFO log.PerfLogger: 
15/01/13 18:00:03 INFO log.PerfLogger: 
15/01/13 18:00:03 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
15/01/13 18:00:03 INFO parse.SemanticAnalyzer: Creating table src position=27
15/01/13 18:00:03 INFO metastore.HiveMetaStore: 0: get_table : db=default 
tbl=src
15/01/13 18:00:03 INFO HiveMetaStore.audit: ugi=zzhang  ip=unknown-ip-addr  
cmd=get_table : db=default tbl=src
15/01/13 18:00:03 INFO metastore.HiveMetaStore: 0: get_database: default
15/01/13 18:00:03 INFO HiveMetaStore.audit: ugi=zzhang  ip=unknown-ip-addr  
cmd=get_database: default
15/01/13 18:00:03 INFO ql.Driver: Semantic Analy

[SparkSQL] How to calculate stddev on a DataFrame?

2015-03-25 Thread Haopu Wang
Hi,

 

I have a DataFrame object and I want to do types of aggregations like
count, sum, variance, stddev, etc.

 

DataFrame has DSL to do simple aggregations like count and sum.

 

How about variance and stddev?

 

Thank you for any suggestions!

 



Re: OOM for HiveFromSpark example

2015-03-25 Thread ๏̯͡๏
Where and how do i pass this or other JVM argument ?
-XX:MaxPermSize=512m

On Wed, Mar 25, 2015 at 11:36 PM, Zhan Zhang  wrote:

>  I solve this by  increase the PermGen memory size in driver.
>
>  -XX:MaxPermSize=512m
>
>  Thanks.
>
>  Zhan Zhang
>
>  On Mar 25, 2015, at 10:54 AM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:
>
>  I am facing same issue, posted a new thread. Please respond.
>
> On Wed, Jan 14, 2015 at 4:38 AM, Zhan Zhang 
> wrote:
>
>> Hi Folks,
>>
>> I am trying to run hive context in yarn-cluster mode, but met some error.
>> Does anybody know what cause the issue.
>>
>> I use following cmd to build the distribution:
>>
>>  ./make-distribution.sh -Phive -Phive-thriftserver  -Pyarn  -Phadoop-2.4
>>
>> 15/01/13 17:59:42 INFO cluster.YarnClusterScheduler:
>> YarnClusterScheduler.postStartHook done
>> 15/01/13 17:59:42 INFO storage.BlockManagerMasterActor: Registering block
>> manager cn122-10.l42scl.hortonworks.com:56157 with 1589.8 MB RAM,
>> BlockManagerId(2, cn122-10.l42scl.hortonworks.com, 56157)
>> 15/01/13 17:59:43 INFO parse.ParseDriver: Parsing command: CREATE TABLE
>> IF NOT EXISTS src (key INT, value STRING)
>> 15/01/13 17:59:43 INFO parse.ParseDriver: Parse Completed
>> 15/01/13 17:59:44 INFO metastore.HiveMetaStore: 0: Opening raw store with
>> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
>> 15/01/13 17:59:44 INFO metastore.ObjectStore: ObjectStore, initialize
>> called
>> 15/01/13 17:59:44 INFO DataNucleus.Persistence: Property
>> datanucleus.cache.level2 unknown - will be ignored
>> 15/01/13 17:59:44 INFO DataNucleus.Persistence: Property
>> hive.metastore.integral.jdo.pushdown unknown - will be ignored
>> 15/01/13 17:59:44 WARN DataNucleus.Connection: BoneCP specified but not
>> present in CLASSPATH (or one of dependencies)
>> 15/01/13 17:59:44 WARN DataNucleus.Connection: BoneCP specified but not
>> present in CLASSPATH (or one of dependencies)
>> 15/01/13 17:59:52 INFO metastore.ObjectStore: Setting MetaStore object
>> pin classes with
>> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
>> 15/01/13 17:59:52 INFO metastore.MetaStoreDirectSql: MySQL check failed,
>> assuming we are not on mysql: Lexical error at line 1, column 5.
>> Encountered: "@" (64), after : "".
>> 15/01/13 17:59:53 INFO DataNucleus.Datastore: The class
>> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
>> "embedded-only" so does not have its own datastore table.
>> 15/01/13 17:59:53 INFO DataNucleus.Datastore: The class
>> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
>> "embedded-only" so does not have its own datastore table.
>> 15/01/13 17:59:59 INFO DataNucleus.Datastore: The class
>> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
>> "embedded-only" so does not have its own datastore table.
>> 15/01/13 17:59:59 INFO DataNucleus.Datastore: The class
>> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
>> "embedded-only" so does not have its own datastore table.
>> 15/01/13 18:00:00 INFO metastore.ObjectStore: Initialized ObjectStore
>> 15/01/13 18:00:00 WARN metastore.ObjectStore: Version information not
>> found in metastore. hive.metastore.schema.verification is not enabled so
>> recording the schema version 0.13.1aa
>> 15/01/13 18:00:01 INFO metastore.HiveMetaStore: Added admin role in
>> metastore
>> 15/01/13 18:00:01 INFO metastore.HiveMetaStore: Added public role in
>> metastore
>> 15/01/13 18:00:01 INFO metastore.HiveMetaStore: No user is added in admin
>> role, since config is empty
>> 15/01/13 18:00:01 INFO session.SessionState: No Tez session required at
>> this point. hive.execution.engine=mr.
>> 15/01/13 18:00:02 INFO log.PerfLogger: > from=org.apache.hadoop.hive.ql.Driver>
>> 15/01/13 18:00:02 INFO log.PerfLogger: > from=org.apache.hadoop.hive.ql.Driver>
>> 15/01/13 18:00:02 INFO ql.Driver: Concurrency mode is disabled, not
>> creating a lock manager
>> 15/01/13 18:00:02 INFO log.PerfLogger: > from=org.apache.hadoop.hive.ql.Driver>
>> 15/01/13 18:00:03 INFO log.PerfLogger: > from=org.apache.hadoop.hive.ql.Driver>
>> 15/01/13 18:00:03 INFO parse.ParseDriver: Parsing command: CREATE TABLE
>> IF NOT EXISTS src (key INT, value STRING)
>> 15/01/13 18:00:03 INFO parse.ParseDriver: Parse Completed
>> 15/01/13 18:00:03 INFO log.PerfLogger: > start=1421190003030 end=1421190003031 duration=1
>> from=org.apache.hadoop.hive.ql.Driver>
>> 15/01/13 18:00:03 INFO log.PerfLogger: > from=org.apache.hadoop.hive.ql.Driver>
>> 15/01/13 18:00:03 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
>> 15/01/13 18:00:03 INFO parse.SemanticAnalyzer: Creating table src
>> position=27
>> 15/01/13 18:00:03 INFO metastore.HiveMetaStore: 0: get_table : db=default
>> tbl=src
>> 15/01/13 18:00:03 INFO HiveMetaStore.audit: ugi=zzhang
>> ip=unknown-ip-addr  cmd=get_table : db=default tbl=src
>> 15/01/13 18:00:03 INFO metastore.HiveMetaStore: 0: get_database: default
>> 15/01/13 18:00:03 INFO 

Re: Unable to Hive program from Spark Programming Guide (OutOfMemoryError)

2015-03-25 Thread ๏̯͡๏
Can someone please respond to this ?

On Wed, Mar 25, 2015 at 11:18 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables
>
>
>
> I modified the Hive query but run into same error. (
> http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables)
>
>
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("CREATE TABLE IF NOT EXISTS src_spark (key INT, value
> STRING)")
> sqlContext.sql("LOAD DATA LOCAL INPATH
> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
>
> // Queries are expressed in HiveQL
> sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)
>
>
>
> Command
>
> ./bin/spark-submit -v --master yarn-cluster --jars
> /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar
> --num-executors 3 --driver-memory 8g --executor-memory 2g --executor-cores
> 1 --queue hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp
> spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16
> input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
> subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2
>
>
> Input
>
> -sh-4.1$ ls -l examples/src/main/resources/kv1.txt
> -rw-r--r-- 1 dvasthimal gid-dvasthimal 5812 Mar  5 17:31
> examples/src/main/resources/kv1.txt
> -sh-4.1$ head examples/src/main/resources/kv1.txt
> 238val_238
> 86val_86
> 311val_311
> 27val_27
> 165val_165
> 409val_409
> 255val_255
> 278val_278
> 98val_98
> 484val_484
> -sh-4.1$
>
> Log
>
> /apache/hadoop/bin/yarn logs -applicationId application_1426715280024_82757
>
> …
> …
> …
>
>
> 15/03/25 07:52:44 INFO metastore.HiveMetaStore: No user is added in admin
> role, since config is empty
> 15/03/25 07:52:44 INFO session.SessionState: No Tez session required at
> this point. hive.execution.engine=mr.
> 15/03/25 07:52:47 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF
> NOT EXISTS src_spark (key INT, value STRING)
> 15/03/25 07:52:47 INFO parse.ParseDriver: Parse Completed
> 15/03/25 07:52:48 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/03/25 07:52:48 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/03/25 07:52:48 INFO ql.Driver: Concurrency mode is disabled, not
> creating a lock manager
> 15/03/25 07:52:48 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/03/25 07:52:48 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/03/25 07:52:48 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF
> NOT EXISTS src_spark (key INT, value STRING)
> 15/03/25 07:52:48 INFO parse.ParseDriver: Parse Completed
> 15/03/25 07:52:48 INFO log.PerfLogger:  start=1427295168392 end=1427295168393 duration=1
> from=org.apache.hadoop.hive.ql.Driver>
> 15/03/25 07:52:48 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/03/25 07:52:48 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
> 15/03/25 07:52:48 INFO parse.SemanticAnalyzer: Creating table src_spark
> position=27
> 15/03/25 07:52:48 INFO metastore.HiveMetaStore: 0: get_table : db=default
> tbl=src_spark
> 15/03/25 07:52:48 INFO HiveMetaStore.audit: ugi=dvasthimal
> ip=unknown-ip-addr cmd=get_table : db=default tbl=src_spark
> 15/03/25 07:52:48 INFO metastore.HiveMetaStore: 0: get_database: default
> 15/03/25 07:52:48 INFO HiveMetaStore.audit: ugi=dvasthimal
> ip=unknown-ip-addr cmd=get_database: default
> 15/03/25 07:52:48 INFO ql.Driver: Semantic Analysis Completed
> 15/03/25 07:52:48 INFO log.PerfLogger:  start=1427295168393 end=1427295168595 duration=202
> from=org.apache.hadoop.hive.ql.Driver>
> 15/03/25 07:52:48 INFO ql.Driver: Returning Hive schema:
> Schema(fieldSchemas:null, properties:null)
> 15/03/25 07:52:48 INFO log.PerfLogger:  start=1427295168352 end=1427295168607 duration=255
> from=org.apache.hadoop.hive.ql.Driver>
> 15/03/25 07:52:48 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/03/25 07:52:48 INFO ql.Driver: Starting command: CREATE TABLE IF NOT
> EXISTS src_spark (key INT, value STRING)
> 15/03/25 07:52:48 INFO log.PerfLogger:  start=1427295168349 end=1427295168625 duration=276
> from=org.apache.hadoop.hive.ql.Driver>
> 15/03/25 07:52:48 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/03/25 07:52:48 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/03/25 07:52:51 INFO exec.DDLTask: Default to LazySimpleSerDe for table
> src_spark
> 15/03/25 07:52:52 INFO metastore.HiveMetaStore: 0: create_table:
> Table(tableName:src_spark, dbName:default, owner:dvasthimal, createTime:
> 1427295171, lastAccessTime:0, retention:0,
> sd:StorageDescriptor(cols:[FieldSchema(name:key, type:int, comment:null),
> FieldSchema(name:value, type:str

Re: Spark shell never leaves ACCEPTED state in YARN CDH5

2015-03-25 Thread Dean Chen
We had a similar problem. Turned out that the Spark driver was binding to
the external IP of the CLI node Spark shell was running on, causing
executors to fail to connect to the driver.

The solution was to override "export SPARK_LOCAL_IP=" in
spark-env.sh to the internal IP of the CLI node.


--
Dean Chen

On Wed, Mar 25, 2015 at 12:18 PM, Marcelo Vanzin 
wrote:

> The probably means there are not enough free resources in your cluster
> to run the AM for the Spark job. Check your RM's web ui to see the
> resources you have available.
>
> On Wed, Mar 25, 2015 at 12:08 PM, Khandeshi, Ami
>  wrote:
> > I am seeing the same behavior.  I have enough resources…..  How do I
> resolve
> > it?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Ami
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark shell never leaves ACCEPTED state in YARN CDH5

2015-03-25 Thread Tobias Pfeiffer
Hi,

On Thu, Mar 26, 2015 at 4:08 AM, Khandeshi, Ami <
ami.khande...@fmr.com.invalid> wrote:

> I am seeing the same behavior.  I have enough resources…..
>

CPU *and* memory are sufficient? No previous (unfinished) jobs eating them?

Tobias


Re: Serialization Problem in Spark Program

2015-03-25 Thread Imran Rashid
you also need to register *array*s of MyObject.  so change:

conf.registerKryoClasses(Array(classOf[MyObject]))

to

conf.registerKryoClasses(Array(classOf[MyObject], classOf[Array[MyObject]]))


On Wed, Mar 25, 2015 at 2:44 AM, donhoff_h <165612...@qq.com> wrote:

> Hi, experts
>
> I wrote a very simple spark program to test the KryoSerialization
> function. The codes are as following:
>
> object TestKryoSerialization {
>   def main(args: Array[String]) {
> val conf = new SparkConf()
> conf.set("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer")
> conf.set("spark.kryo.registrationRequired","true")  //I use this
> statement to force checking registration.
> conf.registerKryoClasses(Array(classOf[MyObject]))
>
> val sc = new SparkContext(conf)
> val rdd =
> sc.textFile("hdfs://dhao.hrb:8020/user/spark/tstfiles/charset/A_utf8.txt")
> val objs = rdd.map(new MyObject(_,1)).collect()
> for (x <- objs ) {
>   x.printMyObject
> }
> }
>
> The class MyObject is also a very simple Class, which is only used to test
> the serialization function:
> class MyObject  {
>   var myStr : String = ""
>   var myInt : Int = 0
>   def this(inStr : String, inInt : Int) {
> this()
> this.myStr = inStr
> this.myInt = inInt
>   }
>   def printMyObject {
> println("MyString is : "+myStr+"\tMyInt is : "+myInt)
>   }
> }
>
> But when I ran the application, it reported the following error:
> java.lang.IllegalArgumentException: Class is not registered:
> dhao.test.Serialization.MyObject[]
> Note: To register this class use:
> kryo.register(dhao.test.Serialization.MyObject[].class);
> at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
> at
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565)
> at
> org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:161)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> I don't understand what cause this problem. I have used the
> "conf.registerKryoClasses" to register my class. Could anyone help me ?
> Thanks
>
> By the way, the spark version is 1.3.0.
>
>


Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-25 Thread David Holiday
hi Irfan,

thanks for getting back to me - i'll try the accumulo list to be sure. what is 
the normal use case for spark though? I'm surprised that hooking it into 
something as common and popular as accumulo isn't more of an every-day task.

DAVID HOLIDAY
Software Engineer
760 607 3300 | Office
312 758 8385 | Mobile
dav...@annaisystems.com


[cid:AE39C43E-3FF7-4C90-BCE4-9711C84C4CB8@cld.annailabs.com]
www.AnnaiSystems.com

On Mar 25, 2015, at 5:27 PM, Irfan Ahmad 
mailto:ir...@cloudphysics.com>> wrote:

Hmmm this seems very accumulo-specific, doesn't it? Not sure how to help 
with that.


Irfan Ahmad
CTO | Co-Founder | CloudPhysics
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Tue, Mar 24, 2015 at 4:09 PM, David Holiday 
mailto:dav...@annaisystems.com>> wrote:
hi all,

got a vagrant image with spark notebook, spark, accumulo, and hadoop all 
running. from notebook I can manually create a scanner and pull test data from 
a table I created using one of the accumulo examples:

val instanceNameS = "accumulo"
val zooServersS = "localhost:2181"
val instance: Instance = new ZooKeeperInstance(instanceNameS, zooServersS)
val connector: Connector = instance.getConnector( "root", new 
PasswordToken("password"))
val auths = new Authorizations("exampleVis")
val scanner = connector.createScanner("batchtest1", auths)

scanner.setRange(new Range("row_00", "row_10"))

for(entry: Entry[Key, Value] <- scanner) {
  println(entry.getKey + " is " + entry.getValue)
}

will give the first ten rows of table data. when I try to create the RDD thusly:

val rdd2 =
  sparkContext.newAPIHadoopRDD (
new Configuration(),
classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
classOf[org.apache.accumulo.core.data.Key],
classOf[org.apache.accumulo.core.data.Value]
  )

I get an RDD returned to me that I can't do much with due to the following 
error:

java.io.IOException: Input info has not been set. at 
org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
 at 
org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343)
 at 
org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538)
 at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at 
scala.Option.getOrElse(Option.scala:120) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at 
org.apache.spark.rdd.RDD.count(RDD.scala:927)

which totally makes sense in light of the fact that I haven't specified any 
parameters as to which table to connect with, what the auths are, etc.

so my question is: what do I need to do from here to get those first ten rows 
of table data into my RDD?



DAVID HOLIDAY
Software Engineer
760 607 3300 | Office
312 758 8385 | Mobile
dav...@annaisystems.com



www.AnnaiSystems.com

On Mar 19, 2015, at 11:25 AM, David Holiday 
mailto:dav...@annaisystems.com>> wrote:

kk - I'll put something together and get back to you with more :-)

DAVID HOLIDAY
Software Engineer
760 607 3300 | Office
312 758 8385 | Mobile
dav...@annaisystems.com



www.AnnaiSystems.com

On Mar 19, 2015, at 10:59 AM, Irfan Ahmad 
mailto:ir...@cloudphysics.com>> wrote:

Once you setup spark-notebook, it'll handle the submits for interactive work. 
Non-interactive is not handled by it. For that spark-kernel could be used.

Give it a shot ... it only takes 5 minutes to get it running in local-mode.


Irfan Ahmad
CTO | Co-Founder | CloudPhysics
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Thu, Mar 19, 2015 at 9:51 AM, David Holiday 
mailto:dav...@annaisystems.com>> wrote:
hi all - thx for the alacritous replies! so regarding how to get things from 
notebook to spark and back, am I correct that spark-submit is the way to go?

DAVID HOLIDAY
Software Engineer
760 607 3300 | Office
312 758 8385 | Mobile
dav...@annaisystems.com



www.AnnaiSystems.com

On Mar 19, 2015, at 1:14 AM, Paolo Platter 
mailto:paolo.plat...@agilelab.it>> wrote:

Yes, I would suggest spark-notebook too.
It's very simple to setup and it's growing pretty fast.

Paolo

Inviata dal mio Windows Phone

Da: Irfan Ahmad
Inviato: ‎19/‎03/‎2015 04:05
A: davidh
Cc: user@spark.a

Re: LogisticRegressionWithLBFGS shows ERRORs

2015-03-25 Thread DB Tsai
We fixed couple issues in breeze LBFGS implementation. Can you try
Spark 1.3 and see if they still exist? Thanks.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Mon, Mar 16, 2015 at 12:48 PM, Chang-Jia Wang  wrote:
> I just used random numbers.
>
> (My ML lib was spark-mllib_2.10-1.2.1)
>
> Please see the attached log.  In the middle of the log, I dumped the data
> set before feeding into LogisticRegressionWithLBFGS.  The first column
> false/true was the label (attribute “a”), and columns 2-5 (attributes “x”,
> “y”, “z”, and “i”) were the features.  The 6th column was just row ID and
> was not used.
>
> The relationship was arbitrarily: a = (0.3 * x + 0.5 * y - 0.2 *z > 0.4)
>
> After that you can find LBFGS was doing its job and then pumped out the
> error messages.
>
> The model showed coefficients:
>
> 396.57624765427323, x
> 662.7969020937115, y
> -259.0975519038385, z
> 12.352037503257826, i
> -538.8516249699426, @a
>
> The last one was the intercept.  As you can see, the model seemed close
> enough.
>
> After that I fed the same data back to the model to see how the predictions
> worked.   (here attribute “a” was the prediction and “aa” was the original
> label)  I only displayed 20 rows.
>
> The error rate showed 2 errors out of 1000.
>
> count(INTEGER), errorRate(DOUBLE), countDiff(INTEGER)
> key=[], rows=1
> 1000, 0.002000949949026, 2
>
> So, the algorithm worked, just spitting out the errors was kind of annoying.
> If this is not result affecting, maybe it should be warning or info.
>
> C.J.
>
>
>
>
>
>
>
> On Mar 15, 2015, at 12:42 AM, DB Tsai  wrote:
>
> In LBFGS version of logistic regression, the data is properly
> standardized, so this should not happen. Can you provide a copy of
> your dataset to us so we can test it? If the dataset can not be
> public, can you have just send me a copy so I can dig into this? I'm
> the author of LORWithLBFGS. Thanks.
>
> Sincerely,
>
> DB Tsai
> ---
> Blog: https://www.dbtsai.com
>
>
> On Fri, Mar 13, 2015 at 2:41 PM, cjwang  wrote:
>
> I am running LogisticRegressionWithLBFGS.  I got these lines on my console:
>
> 2015-03-12 17:38:03,897 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to 0.5
> 2015-03-12 17:38:03,967 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to 0.25
> 2015-03-12 17:38:04,036 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to 0.125
> 2015-03-12 17:38:04,105 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to
> 0.0625
> 2015-03-12 17:38:04,176 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to
> 0.03125
> 2015-03-12 17:38:04,247 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to
> 0.015625
> 2015-03-12 17:38:04,317 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to
> 0.0078125
> 2015-03-12 17:38:04,461 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to
> 0.005859375
> 2015-03-12 17:38:04,605 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:04,672 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:04,747 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:04,818 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:04,890 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:04,962 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:05,038 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:05,107 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:05,186 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:05,256 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:05,257 ERROR breeze.optimize.LBFGS | Failure! Resetting
> history: breeze.optimize.FirstOrderException: Line search zoom failed
>
>
> What causes them and how do I fix them?
>
> I checked my data and there seemed nothing out of the ordinary.  The
> resulting prediction model seemed acceptable to me.  So, are these ERRORs
> actu

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-25 Thread Irfan Ahmad
Hmmm this seems very accumulo-specific, doesn't it? Not sure how to
help with that.


*Irfan Ahmad*
CTO | Co-Founder | *CloudPhysics* 
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Tue, Mar 24, 2015 at 4:09 PM, David Holiday 
wrote:

>  hi all,
>
>  got a vagrant image with spark notebook, spark, accumulo, and hadoop all
> running. from notebook I can manually create a scanner and pull test data
> from a table I created using one of the accumulo examples:
>
> val instanceNameS = "accumulo"val zooServersS = "localhost:2181"val instance: 
> Instance = new ZooKeeperInstance(instanceNameS, zooServersS)val connector: 
> Connector = instance.getConnector( "root", new PasswordToken("password"))val 
> auths = new Authorizations("exampleVis")val scanner = 
> connector.createScanner("batchtest1", auths)
>
> scanner.setRange(new Range("row_00", "row_10"))
> for(entry: Entry[Key, Value] <- scanner) {
>   println(entry.getKey + " is " + entry.getValue)}
>
> will give the first ten rows of table data. when I try to create the RDD
> thusly:
>
> val rdd2 =
>   sparkContext.newAPIHadoopRDD (
> new Configuration(),
> classOf[org.apache.accumulo.core.client.mapreduce.AccumuloInputFormat],
> classOf[org.apache.accumulo.core.data.Key],
> classOf[org.apache.accumulo.core.data.Value]
>   )
>
> I get an RDD returned to me that I can't do much with due to the following
> error:
>
> java.io.IOException: Input info has not been set. at
> org.apache.accumulo.core.client.mapreduce.lib.impl.InputConfigurator.validateOptions(InputConfigurator.java:630)
> at
> org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.validateOptions(AbstractInputFormat.java:343)
> at
> org.apache.accumulo.core.client.mapreduce.AbstractInputFormat.getSplits(AbstractInputFormat.java:538)
> at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:98)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at
> scala.Option.getOrElse(Option.scala:120) at
> org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at
> org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at
> org.apache.spark.rdd.RDD.count(RDD.scala:927)
>
> which totally makes sense in light of the fact that I haven't specified
> any parameters as to which table to connect with, what the auths are, etc.
>
> so my question is: what do I need to do from here to get those first ten
> rows of table data into my RDD?
>
>
>
>  DAVID HOLIDAY
>  Software Engineer
>  760 607 3300 | Office
>  312 758 8385 | Mobile
>  dav...@annaisystems.com 
>
>
>
> www.AnnaiSystems.com
>
>  On Mar 19, 2015, at 11:25 AM, David Holiday 
> wrote:
>
>  kk - I'll put something together and get back to you with more :-)
>
> DAVID HOLIDAY
>  Software Engineer
>  760 607 3300 | Office
>  312 758 8385 | Mobile
>  dav...@annaisystems.com 
>
>
> 
> www.AnnaiSystems.com 
>
>  On Mar 19, 2015, at 10:59 AM, Irfan Ahmad  wrote:
>
>  Once you setup spark-notebook, it'll handle the submits for interactive
> work. Non-interactive is not handled by it. For that spark-kernel could be
> used.
>
>  Give it a shot ... it only takes 5 minutes to get it running in
> local-mode.
>
>
>  *Irfan Ahmad*
> CTO | Co-Founder | *CloudPhysics* 
> Best of VMworld Finalist
>  Best Cloud Management Award
>  NetworkWorld 10 Startups to Watch
> EMA Most Notable Vendor
>
> On Thu, Mar 19, 2015 at 9:51 AM, David Holiday 
> wrote:
>
>> hi all - thx for the alacritous replies! so regarding how to get things
>> from notebook to spark and back, am I correct that spark-submit is the way
>> to go?
>>
>> DAVID HOLIDAY
>>  Software Engineer
>>  760 607 3300 | Office
>>  312 758 8385 | Mobile
>>  dav...@annaisystems.com 
>>
>>
>> 
>> www.AnnaiSystems.com 
>>
>>  On Mar 19, 2015, at 1:14 AM, Paolo Platter 
>> wrote:
>>
>>   Yes, I would suggest spark-notebook too.
>> It's very simple to setup and it's growing pretty fast.
>>
>> Paolo
>>
>> Inviata dal mio Windows Phone
>>  --
>> Da: Irfan Ahmad 
>> Inviato: ‎19/‎03/‎2015 04:05
>> A: davidh 
>> Cc: user@spark.apache.org
>> Oggetto: Re: iPython Notebook + Spark + Accumulo -- best practice?
>>
>>  I forgot to mention that there is also Zeppelin and jove-notebook but I
>> haven't got any experience with those yet.
>>
>>
>>  *Irfan Ahmad*
>> CTO | Co-Founder | *CloudPhysics* 
>> Best of VMworld Finalist
>>  Best Cloud Management Award
>>  NetworkWorld 10 Startups to Watch
>> EMA Most Notable Vendor
>>
>> On Wed, Mar 18, 2015 at 8:01 PM, Irfan Ahmad 
>> wrote:
>>
>>> Hi David,
>>>
>>>  W00t indeed and great questions. On the notebook front, there are two
>>> options depending on what you are looking for. You can either go with
>>> iPython 3

Re: How to deploy binary dependencies to workers?

2015-03-25 Thread DB Tsai
Are you deploying the windows dll to linux machine?

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Wed, Mar 25, 2015 at 3:57 AM, Xi Shen  wrote:
> I think you meant to use the "--files" to deploy the DLLs. I gave a try, but
> it did not work.
>
> From the Spark UI, Environment tab, I can see
>
> spark.yarn.dist.files
>
> file:/c:/openblas/libgcc_s_seh-1.dll,file:/c:/openblas/libblas3.dll,file:/c:/openblas/libgfortran-3.dll,file:/c:/openblas/liblapack3.dll,file:/c:/openblas/libquadmath-0.dll
>
> I think my DLLs are all deployed. But I still got the warn message that
> native BLAS library cannot be load.
>
> And idea?
>
>
> Thanks,
> David
>
>
> On Wed, Mar 25, 2015 at 5:40 AM DB Tsai  wrote:
>>
>> I would recommend to upload those jars to HDFS, and use add jars
>> option in spark-submit with URI from HDFS instead of URI from local
>> filesystem. Thus, it can avoid the problem of fetching jars from
>> driver which can be a bottleneck.
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> Blog: https://www.dbtsai.com
>>
>>
>> On Tue, Mar 24, 2015 at 4:13 AM, Xi Shen  wrote:
>> > Hi,
>> >
>> > I am doing ML using Spark mllib. However, I do not have full control to
>> > the
>> > cluster. I am using Microsoft Azure HDInsight
>> >
>> > I want to deploy the BLAS or whatever required dependencies to
>> > accelerate
>> > the computation. But I don't know how to deploy those DLLs when I submit
>> > my
>> > JAR to the cluster.
>> >
>> > I know how to pack those DLLs into a jar. The real challenge is how to
>> > let
>> > the system find them...
>> >
>> >
>> > Thanks,
>> > David
>> >

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: filter expression in API document for DataFrame

2015-03-25 Thread Michael Armbrust
Yeah sorry, this is already fixed but we need to republish the docs.  I'll
add both of the following do work:

people.filter("age > 30")
people.filter(people("age") > 30)



On Tue, Mar 24, 2015 at 7:11 PM, SK  wrote:

>
>
> The following statement appears in the Scala API example at
>
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame
>
> people.filter("age" > 30).
>
> I tried this example and it gave a compilation error. I think this needs to
> be changed to people.filter(people("age") > 30)
>
> Also, it would be good to add some examples for the new equality operator
> for columns (e.g. (people("age") === 30) ). The programming guide does not
> have an example for this in the DataFrame Operations section and it was not
> very obvious that we need to be using a different equality operator for
> columns.
>
>
> thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/filter-expression-in-API-document-for-DataFrame-tp22213.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: trouble with jdbc df in python

2015-03-25 Thread Michael Armbrust
Thanks for following up.  I'll fix the docs.

On Wed, Mar 25, 2015 at 4:04 PM, elliott cordo 
wrote:

> Thanks!.. the below worked:
>
> db = sqlCtx.load(source="jdbc",
> url="jdbc:postgresql://localhost/x?user=x&password=x",dbtable="mstr.d_customer")
>
> Note that
> https://spark.apache.org/docs/latest/sql-programming-guide.html#dataframe-operations
> needs to be updated:
>
> [image: Inline image 1]
>
> On Wed, Mar 25, 2015 at 6:12 PM, Michael Armbrust 
> wrote:
>
>> Try:
>>
>> db = sqlContext.load(source="jdbc", url="jdbc:postgresql://localhost/xx",
>> dbtables="mstr.d_customer")
>>
>>
>> On Wed, Mar 25, 2015 at 2:19 PM, elliott cordo 
>> wrote:
>>
>>> if i run the following:
>>>
>>> db = sqlContext.load("jdbc", url="jdbc:postgresql://localhost/xx",
>>> dbtables="mstr.d_customer")
>>>
>>> i get the error:
>>>
>>> py4j.protocol.Py4JJavaError: An error occurred while calling o28.load.
>>>
>>> : java.io.FileNotFoundException: File file:/Users/elliottcordo/jdbc does
>>> not exist
>>>
>>> Seems to think i'm trying to load a file called jdbc?  i'm setting the
>>> postgres driver in the SPARK_CLASSPATH as well (that doesn't seem to be the
>>> problem)..
>>>
>>
>>
>


Re: How to specify the port for AM Actor ...

2015-03-25 Thread Shixiong Zhu
There is no configuration for it now.

Best Regards,
Shixiong Zhu

2015-03-26 7:13 GMT+08:00 Manoj Samel :

> There may be firewall rules limiting the ports between host running spark
> and the hadoop cluster. In that case, not all ports are allowed.
>
> Can it be a range of ports that can be specified ?
>
> On Wed, Mar 25, 2015 at 4:06 PM, Shixiong Zhu  wrote:
>
>> It's a random port to avoid port conflicts, since multiple AMs can run in
>> the same machine. Why do you need a fixed port?
>>
>> Best Regards,
>> Shixiong Zhu
>>
>> 2015-03-26 6:49 GMT+08:00 Manoj Samel :
>>
>>> Spark 1.3, Hadoop 2.5, Kerbeors
>>>
>>> When running spark-shell in yarn client mode, it shows following message
>>> with a random port every time (44071 in example below). Is there a way to
>>> specify that port to a specific port ? It does not seem to be part of ports
>>> specified in http://spark.apache.org/docs/latest/configuration.html
>>> spark.xxx.port ...
>>>
>>> Thanks,
>>>
>>> 15/03/25 22:27:10 INFO Client: Application report for
>>> application_1427316153428_0014 (state: ACCEPTED)
>>> 15/03/25 22:27:10 INFO YarnClientSchedulerBackend: ApplicationMaster
>>> registered as Actor[akka.tcp://sparkYarnAM@xyz
>>> :44071/user/YarnAM#-1989273896]
>>>
>>
>>
>


Re: How to specify the port for AM Actor ...

2015-03-25 Thread Manoj Samel
There may be firewall rules limiting the ports between host running spark
and the hadoop cluster. In that case, not all ports are allowed.

Can it be a range of ports that can be specified ?

On Wed, Mar 25, 2015 at 4:06 PM, Shixiong Zhu  wrote:

> It's a random port to avoid port conflicts, since multiple AMs can run in
> the same machine. Why do you need a fixed port?
>
> Best Regards,
> Shixiong Zhu
>
> 2015-03-26 6:49 GMT+08:00 Manoj Samel :
>
>> Spark 1.3, Hadoop 2.5, Kerbeors
>>
>> When running spark-shell in yarn client mode, it shows following message
>> with a random port every time (44071 in example below). Is there a way to
>> specify that port to a specific port ? It does not seem to be part of ports
>> specified in http://spark.apache.org/docs/latest/configuration.html
>> spark.xxx.port ...
>>
>> Thanks,
>>
>> 15/03/25 22:27:10 INFO Client: Application report for
>> application_1427316153428_0014 (state: ACCEPTED)
>> 15/03/25 22:27:10 INFO YarnClientSchedulerBackend: ApplicationMaster
>> registered as Actor[akka.tcp://sparkYarnAM@xyz
>> :44071/user/YarnAM#-1989273896]
>>
>
>


Re: How to specify the port for AM Actor ...

2015-03-25 Thread Shixiong Zhu
It's a random port to avoid port conflicts, since multiple AMs can run in
the same machine. Why do you need a fixed port?

Best Regards,
Shixiong Zhu

2015-03-26 6:49 GMT+08:00 Manoj Samel :

> Spark 1.3, Hadoop 2.5, Kerbeors
>
> When running spark-shell in yarn client mode, it shows following message
> with a random port every time (44071 in example below). Is there a way to
> specify that port to a specific port ? It does not seem to be part of ports
> specified in http://spark.apache.org/docs/latest/configuration.html
> spark.xxx.port ...
>
> Thanks,
>
> 15/03/25 22:27:10 INFO Client: Application report for
> application_1427316153428_0014 (state: ACCEPTED)
> 15/03/25 22:27:10 INFO YarnClientSchedulerBackend: ApplicationMaster
> registered as Actor[akka.tcp://sparkYarnAM@xyz
> :44071/user/YarnAM#-1989273896]
>


Re: trouble with jdbc df in python

2015-03-25 Thread elliott cordo
Thanks!.. the below worked:

db = sqlCtx.load(source="jdbc",
url="jdbc:postgresql://localhost/x?user=x&password=x",dbtable="mstr.d_customer")

Note that
https://spark.apache.org/docs/latest/sql-programming-guide.html#dataframe-operations
needs to be updated:

[image: Inline image 1]

On Wed, Mar 25, 2015 at 6:12 PM, Michael Armbrust 
wrote:

> Try:
>
> db = sqlContext.load(source="jdbc", url="jdbc:postgresql://localhost/xx",
> dbtables="mstr.d_customer")
>
>
> On Wed, Mar 25, 2015 at 2:19 PM, elliott cordo 
> wrote:
>
>> if i run the following:
>>
>> db = sqlContext.load("jdbc", url="jdbc:postgresql://localhost/xx",
>> dbtables="mstr.d_customer")
>>
>> i get the error:
>>
>> py4j.protocol.Py4JJavaError: An error occurred while calling o28.load.
>>
>> : java.io.FileNotFoundException: File file:/Users/elliottcordo/jdbc does
>> not exist
>>
>> Seems to think i'm trying to load a file called jdbc?  i'm setting the
>> postgres driver in the SPARK_CLASSPATH as well (that doesn't seem to be the
>> problem)..
>>
>
>


How to specify the port for AM Actor ...

2015-03-25 Thread Manoj Samel
Spark 1.3, Hadoop 2.5, Kerbeors

When running spark-shell in yarn client mode, it shows following message
with a random port every time (44071 in example below). Is there a way to
specify that port to a specific port ? It does not seem to be part of ports
specified in http://spark.apache.org/docs/latest/configuration.html
spark.xxx.port ...

Thanks,

15/03/25 22:27:10 INFO Client: Application report for
application_1427316153428_0014 (state: ACCEPTED)
15/03/25 22:27:10 INFO YarnClientSchedulerBackend: ApplicationMaster
registered as Actor[akka.tcp://sparkYarnAM@xyz
:44071/user/YarnAM#-1989273896]


Cross-compatibility of YARN shuffle service

2015-03-25 Thread Matt Cheah
Hi everyone,

I am considering moving from Spark-Standalone to YARN. The context is that
there are multiple Spark applications that are using different versions of
Spark that all want to use the same YARN cluster.

My question is: if I use a single Spark YARN shuffle service jar on the Node
Manager, will the service work properly with all of the Spark applications,
regardless of the specific versions of the applications? Or, is it it the
case that, if I want to use the external shuffle service, I need to have all
of my applications using the same version of Spark?

Thanks,

-Matt Cheah




smime.p7s
Description: S/MIME cryptographic signature


Re: Spark ML Pipeline inaccessible types

2015-03-25 Thread zapletal-martin
Thanks Peter,



I ended up doing something similar. I however consider both the approaches 
you mentioned bad practices which is why I was looking for a solution 
directly supported by the current code.




I can work with that now, but it does not seem to be the proper solution.




Regards,

Martin





-- Původní zpráva --
Od: Peter Rudenko 
Komu: zapletal-mar...@email.cz, Sean Owen 
Datum: 25. 3. 2015 13:28:38
Předmět: Re: Spark ML Pipeline inaccessible types

"


 Hi Martin, here’s 2 possibilities to overcome this:

 1) Put your logic into org.apache.spark package in your project - then 
 everything would be accessible.
 2) Dirty trick:

 object SparkVector 
extends HashingTF {
  val VectorUDT: DataType = 
outputDataType
}


 then you can do like this:

 StructType("vectorTypeColumn", 
SparkVector.VectorUDT, false))


 Thanks,
 Peter Rudenko

 On 2015-03-25 13:14, zapletal-mar...@email.cz
 (mailto:zapletal-mar...@email.cz) wrote:

 



"Sean, 



thanks for your response. I am familiar with NoSuchMethodException in 
general, but I think it is not the case this time. The code actually 
attempts to get parameter by name using val m = this.getClass.getMethodName
(paramName).




This may be a bug, but it is only a side effect caused by the real problem I
am facing. My issue is that VectorUDT is not accessible by user code and 
therefore it is not possible to use custom ML pipeline with the existing 
Predictors (see the last two paragraphs in my first email).




Best Regards,

Martin



-- Původní zpráva --
Od: Sean Owen (mailto:so...@cloudera.com)
Komu: zapletal-mar...@email.cz(mailto:zapletal-mar...@email.cz)
Datum: 25. 3. 2015 11:05:54
Předmět: Re: Spark ML Pipeline inaccessible types

"NoSuchMethodError in general means that your runtime and compile-time
environments are different. I think you need to first make sure you
don't have mismatching versions of Spark.

On Wed, Mar 25, 2015 at 11:00 AM, 
(mailto:zapletal-mar...@email.cz) wrote:
> Hi,
>
> I have started implementing a machine learning pipeline using Spark 1.3.0
> and the new pipelining API and DataFrames. I got to a point where I have 
my
> training data set prepared using a sequence of Transformers, but I am
> struggling to actually train a model and use it for predictions.
>
> I am getting a java.lang.NoSuchMethodException:
> org.apache.spark.ml.regression.LinearRegression.myFeaturesColumnName()
> exception thrown at checkInputColumn method in Params trait when using a
> Predictor (LinearRegression in my case, but that should not matter). This
> looks like a bug - the exception is thrown when executing getParam
(colName)
> when the require(actualDataType.equals(datatype), ...) requirement is not
> met so the expected requirement failed exception is not thrown and is 
hidden
> by the unexpected NoSuchMethodException instead. I can raise a bug if this
> really is an issue and I am not using something incorrectly.
>
> The problem I am facing however is that the Predictor expects features to
> have VectorUDT type as defined in Predictor class (protected def
> featuresDataType: DataType = new VectorUDT). But since this type is
> private[spark] my Transformer can not prepare features with this type 
which
> then correctly results in the exception above when I use a different type.
>
> Is there a way to define a custom Pipeline that would be able to use the
> existing Predictors without having to bypass the access modifiers or
> reimplement something or is the pipelining API not yet expected to be used
> in this way?
>
> Thanks,
> Martin
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
(mailto:user-unsubscr...@spark.apache.org)
For additional commands, e-mail: user-h...@spark.apache.org
(mailto:user-h...@spark.apache.org)" 
" 



 

​


"


Re: trouble with jdbc df in python

2015-03-25 Thread Michael Armbrust
Try:

db = sqlContext.load(source="jdbc", url="jdbc:postgresql://localhost/xx",
dbtables="mstr.d_customer")


On Wed, Mar 25, 2015 at 2:19 PM, elliott cordo 
wrote:

> if i run the following:
>
> db = sqlContext.load("jdbc", url="jdbc:postgresql://localhost/xx",
> dbtables="mstr.d_customer")
>
> i get the error:
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o28.load.
>
> : java.io.FileNotFoundException: File file:/Users/elliottcordo/jdbc does
> not exist
>
> Seems to think i'm trying to load a file called jdbc?  i'm setting the
> postgres driver in the SPARK_CLASSPATH as well (that doesn't seem to be the
> problem)..
>


Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
Is there any way that I can install the new one and remove previous version.
I installed spark 1.3 on my EC2 master and set teh spark home to the new
one.
But when I start teh spark-shell I get -
 java.lang.UnsatisfiedLinkError:
org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative()V
at
org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native
Method)

Is There no way to upgrade without creating new cluster?
Thanks
Roni



On Wed, Mar 25, 2015 at 1:18 PM, Dean Wampler  wrote:

> Yes, that's the problem. The RDD class exists in both binary jar files,
> but the signatures probably don't match. The bottom line, as always for
> tools like this, is that you can't mix versions.
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
>  (O'Reilly)
> Typesafe 
> @deanwampler 
> http://polyglotprogramming.com
>
> On Wed, Mar 25, 2015 at 3:13 PM, roni  wrote:
>
>> My cluster is still on spark 1.2 and in SBT I am using 1.3.
>> So probably it is compiling with 1.3 but running with 1.2 ?
>>
>> On Wed, Mar 25, 2015 at 12:34 PM, Dean Wampler 
>> wrote:
>>
>>> Weird. Are you running using SBT console? It should have the spark-core
>>> jar on the classpath. Similarly, spark-shell or spark-submit should work,
>>> but be sure you're using the same version of Spark when running as when
>>> compiling. Also, you might need to add spark-sql to your SBT dependencies,
>>> but that shouldn't be this issue.
>>>
>>> Dean Wampler, Ph.D.
>>> Author: Programming Scala, 2nd Edition
>>>  (O'Reilly)
>>> Typesafe 
>>> @deanwampler 
>>> http://polyglotprogramming.com
>>>
>>> On Wed, Mar 25, 2015 at 12:09 PM, roni  wrote:
>>>
 Thanks Dean and Nick.
 So, I removed the ADAM and H2o from my SBT as I was not using them.
 I got the code to compile  - only for fail while running with -
 SparkContext: Created broadcast 1 from textFile at
 kmerIntersetion.scala:21
 Exception in thread "main" java.lang.NoClassDefFoundError:
 org/apache/spark/rdd/RDD$
 at preDefKmerIntersection$.main(kmerIntersetion.scala:26)

 This line is where I do a "JOIN" operation.
 val hgPair = hgfasta.map(_.split (",")).map(a=> (a(0),
 a(1).trim().toInt))
  val filtered = hgPair.filter(kv => kv._2 == 1)
  val bedPair = bedFile.map(_.split (",")).map(a=>
 (a(0), a(1).trim().toInt))
 * val joinRDD = bedPair.join(filtered)   *
 Any idea whats going on?
 I have data on the EC2 so I am avoiding creating the new cluster , but
 just upgrading and changing the code to use 1.3 and Spark SQL
 Thanks
 Roni



 On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler 
 wrote:

> For the Spark SQL parts, 1.3 breaks backwards compatibility, because
> before 1.3, Spark SQL was considered experimental where API changes were
> allowed.
>
> So, H2O and ADA compatible with 1.2.X might not work with 1.3.
>
> dean
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
>  (O'Reilly)
> Typesafe 
> @deanwampler 
> http://polyglotprogramming.com
>
> On Wed, Mar 25, 2015 at 9:39 AM, roni  wrote:
>
>> Even if H2o and ADA are dependent on 1.2.1 , it should be backword
>> compatible, right?
>> So using 1.3 should not break them.
>> And the code is not using the classes from those libs.
>> I tried sbt clean compile .. same errror
>> Thanks
>> _R
>>
>> On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath <
>> nick.pentre...@gmail.com> wrote:
>>
>>> What version of Spark do the other dependencies rely on (Adam and
>>> H2O?) - that could be it
>>>
>>> Or try sbt clean compile
>>>
>>> —
>>> Sent from Mailbox 
>>>
>>>
>>> On Wed, Mar 25, 2015 at 5:58 PM, roni  wrote:
>>>
 I have a EC2 cluster created using spark version 1.2.1.
 And I have a SBT project .
 Now I want to upgrade to spark 1.3 and use the new features.
 Below are issues .
 Sorry for the long post.
 Appreciate your help.
 Thanks
 -Roni

 Question - Do I have to create a new cluster using spark 1.3?

 Here is what I did -

 In my SBT file I  changed to -
 libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"

 But then I started getting compilation error. along with
 Here are some of the libraries that were evicted:
 [warn]  * org.apache.spark:spark-core_2.10:1.2.0 -> 1.3.0
 [warn]  * org.apac

Re: Can LBFGS be used on streaming data?

2015-03-25 Thread DB Tsai
Hi Arunkumar,

I think L-BFGS will not work since L-BFGS algorithm assumes that the
objective function will be always the same (i.e., the data is the
same) for entire optimization process to construct the approximated
Hessian matrix. In the streaming case, the data will be changing, so
it will cause problem for the algorithm.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Mon, Mar 16, 2015 at 3:19 PM, EcoMotto Inc.  wrote:
> Hello,
>
> I am new to spark streaming API.
>
> I wanted to ask if I can apply LBFGS (with LeastSquaresGradient) on
> streaming data? Currently I am using forecahRDD for parsing through DStream
> and I am generating a model based on each RDD. Am I doing anything logically
> wrong here?
> Thank you.
>
> Sample Code:
>
> val algorithm = new LBFGS(new LeastSquaresGradient(), new SimpleUpdater())
> var initialWeights =
> Vectors.dense(Array.fill(numFeatures)(scala.util.Random.nextDouble()))
> var isFirst = true
> var model = new LinearRegressionModel(null,1.0)
>
> parsedData.foreachRDD{rdd =>
>   if(isFirst) {
> val weights = algorithm.optimize(rdd, initialWeights)
> val w = weights.toArray
> val intercept = w.head
> model = new LinearRegressionModel(Vectors.dense(w.drop(1)), intercept)
> isFirst = false
>   }else{
> var ab = ArrayBuffer[Double]()
> ab.insert(0, model.intercept)
> ab.appendAll( model.weights.toArray)
> print("Intercept = "+model.intercept+" :: modelWeights =
> "+model.weights)
> initialWeights = Vectors.dense(ab.toArray)
> print("Initial Weights: "+ initialWeights)
> val weights = algorithm.optimize(rdd, initialWeights)
> val w = weights.toArray
> val intercept = w.head
> model = new LinearRegressionModel(Vectors.dense(w.drop(1)), intercept)
>   }
>
>
>
> Best Regards,
> Arunkumar

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Exception Failed to add a datanode. User may turn off this feature by setting dfs.client.block.write.replace-datanode-on-failure.policy in configuration

2015-03-25 Thread varvind
Hi,I am running spark in mesos and getting this error. Can anyone help me
resolve this?Thanks15/03/25 21:05:00 ERROR scheduler.LiveListenerBus:
Listener EventLoggingListener threw an
exceptionjava.lang.reflect.InvocationTargetExceptionat
sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:203)
at
org.apache.spark.util.FileLogger$$anonfun$flush$2.apply(FileLogger.scala:203)
at scala.Option.foreach(Option.scala:236)   at
org.apache.spark.util.FileLogger.flush(FileLogger.scala:203)at
org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:90)
at
org.apache.spark.scheduler.EventLoggingListener.onUnpersistRDD(EventLoggingListener.scala:121)
at
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$11.apply(SparkListenerBus.scala:66)
at
org.apache.spark.scheduler.SparkListenerBus$$anonfun$postToAll$11.apply(SparkListenerBus.scala:66)
at
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:83)
at
org.apache.spark.scheduler.SparkListenerBus$$anonfun$foreachListener$1.apply(SparkListenerBus.scala:81)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)   at
org.apache.spark.scheduler.SparkListenerBus$class.foreachListener(SparkListenerBus.scala:81)
at
org.apache.spark.scheduler.SparkListenerBus$class.postToAll(SparkListenerBus.scala:66)
at
org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:32)
at
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
at
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:56)
at scala.Option.foreach(Option.scala:236)   at
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:56)
at
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
at
org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply(LiveListenerBus.scala:47)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460) at
org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:46)Caused
by: java.io.IOException: Failed to add a datanode.  User may turn off this
feature by setting dfs.client.block.write.replace-datanode-on-failure.policy
in configuration, where the current policy is DEFAULT.  (Nodes:
current=[10.250.100.81:50010, 10.250.100.82:50010],
original=[10.250.100.81:50010, 10.250.100.82:50010])at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:792)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:852)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:958)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:755)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:424)15/03/25
21:05:00 INFO scheduler.ReceivedBlockTracker: Deleting batches
ArrayBuffer()15/03/25 21:05:00 INFO scheduler.ReceivedBlockTracker: Deleting
batches ArrayBuffer(142731690 ms)15/03/25 21:05:00 INFO
scheduler.JobGenerator: Checkpointing graph for time 142731750
ms15/03/25 21:05:00 INFO streaming.DStreamGraph: Updating checkpoint data
for time 142731750 ms15/03/25 21:05:00 INFO streaming.DStreamGraph:
Updated checkpoint data for time 142731750 ms



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Exception-Failed-to-add-a-datanode-User-may-turn-off-this-feature-by-setting-dfs-client-block-write-n-tp22231.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

trouble with jdbc df in python

2015-03-25 Thread elliott cordo
if i run the following:

db = sqlContext.load("jdbc", url="jdbc:postgresql://localhost/xx",
dbtables="mstr.d_customer")

i get the error:

py4j.protocol.Py4JJavaError: An error occurred while calling o28.load.

: java.io.FileNotFoundException: File file:/Users/elliottcordo/jdbc does
not exist

Seems to think i'm trying to load a file called jdbc?  i'm setting the
postgres driver in the SPARK_CLASSPATH as well (that doesn't seem to be the
problem)..


Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Dean Wampler
Yes, that's the problem. The RDD class exists in both binary jar files, but
the signatures probably don't match. The bottom line, as always for tools
like this, is that you can't mix versions.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Wed, Mar 25, 2015 at 3:13 PM, roni  wrote:

> My cluster is still on spark 1.2 and in SBT I am using 1.3.
> So probably it is compiling with 1.3 but running with 1.2 ?
>
> On Wed, Mar 25, 2015 at 12:34 PM, Dean Wampler 
> wrote:
>
>> Weird. Are you running using SBT console? It should have the spark-core
>> jar on the classpath. Similarly, spark-shell or spark-submit should work,
>> but be sure you're using the same version of Spark when running as when
>> compiling. Also, you might need to add spark-sql to your SBT dependencies,
>> but that shouldn't be this issue.
>>
>> Dean Wampler, Ph.D.
>> Author: Programming Scala, 2nd Edition
>>  (O'Reilly)
>> Typesafe 
>> @deanwampler 
>> http://polyglotprogramming.com
>>
>> On Wed, Mar 25, 2015 at 12:09 PM, roni  wrote:
>>
>>> Thanks Dean and Nick.
>>> So, I removed the ADAM and H2o from my SBT as I was not using them.
>>> I got the code to compile  - only for fail while running with -
>>> SparkContext: Created broadcast 1 from textFile at
>>> kmerIntersetion.scala:21
>>> Exception in thread "main" java.lang.NoClassDefFoundError:
>>> org/apache/spark/rdd/RDD$
>>> at preDefKmerIntersection$.main(kmerIntersetion.scala:26)
>>>
>>> This line is where I do a "JOIN" operation.
>>> val hgPair = hgfasta.map(_.split (",")).map(a=> (a(0),
>>> a(1).trim().toInt))
>>>  val filtered = hgPair.filter(kv => kv._2 == 1)
>>>  val bedPair = bedFile.map(_.split (",")).map(a=> (a(0),
>>> a(1).trim().toInt))
>>> * val joinRDD = bedPair.join(filtered)   *
>>> Any idea whats going on?
>>> I have data on the EC2 so I am avoiding creating the new cluster , but
>>> just upgrading and changing the code to use 1.3 and Spark SQL
>>> Thanks
>>> Roni
>>>
>>>
>>>
>>> On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler 
>>> wrote:
>>>
 For the Spark SQL parts, 1.3 breaks backwards compatibility, because
 before 1.3, Spark SQL was considered experimental where API changes were
 allowed.

 So, H2O and ADA compatible with 1.2.X might not work with 1.3.

 dean

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
  (O'Reilly)
 Typesafe 
 @deanwampler 
 http://polyglotprogramming.com

 On Wed, Mar 25, 2015 at 9:39 AM, roni  wrote:

> Even if H2o and ADA are dependent on 1.2.1 , it should be backword
> compatible, right?
> So using 1.3 should not break them.
> And the code is not using the classes from those libs.
> I tried sbt clean compile .. same errror
> Thanks
> _R
>
> On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath <
> nick.pentre...@gmail.com> wrote:
>
>> What version of Spark do the other dependencies rely on (Adam and
>> H2O?) - that could be it
>>
>> Or try sbt clean compile
>>
>> —
>> Sent from Mailbox 
>>
>>
>> On Wed, Mar 25, 2015 at 5:58 PM, roni  wrote:
>>
>>> I have a EC2 cluster created using spark version 1.2.1.
>>> And I have a SBT project .
>>> Now I want to upgrade to spark 1.3 and use the new features.
>>> Below are issues .
>>> Sorry for the long post.
>>> Appreciate your help.
>>> Thanks
>>> -Roni
>>>
>>> Question - Do I have to create a new cluster using spark 1.3?
>>>
>>> Here is what I did -
>>>
>>> In my SBT file I  changed to -
>>> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
>>>
>>> But then I started getting compilation error. along with
>>> Here are some of the libraries that were evicted:
>>> [warn]  * org.apache.spark:spark-core_2.10:1.2.0 -> 1.3.0
>>> [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) ->
>>> 2.6.0
>>> [warn] Run 'evicted' to see detailed eviction warnings
>>>
>>>  constructor cannot be instantiated to expected type;
>>> [error]  found   : (T1, T2)
>>> [error]  required: org.apache.spark.sql.catalyst.expressions.Row
>>> [error] val ty =
>>> joinRDD.map{case(word, (file1Counts, file2Counts)) => KmerIntesect(word,
>>> file1Counts,"xyz")}
>>> [error]  ^
>>>
>>> Here is my SBT and code --
>>> SBT -
>>>
>>> version := "1.0"
>>

Re: Date and decimal datatype not working

2015-03-25 Thread Dean Wampler
Recall that the input isn't actually read until to do something that forces
evaluation, like call saveAsTextFile. You didn't show the whole stack trace
here, but it probably occurred while parsing an input line where one of
your long fields is actually an empty string.

Because this is such a common problem, I usually define a "parse" method
that converts input text to the desired schema. It catches parse exceptions
like this and reports the bad line at least. If you can return a default
long in this case, say 0, that makes it easier to return something.

dean



Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Wed, Mar 25, 2015 at 11:48 AM, BASAK, ANANDA  wrote:

>  Thanks. This library is only available with Spark 1.3. I am using
> version 1.2.1. Before I upgrade to 1.3, I want to try what can be done in
> 1.2.1.
>
>
>
> So I am using following:
>
> val MyDataset = sqlContext.sql("my select query”)
>
>
>
> MyDataset.map(t =>
> t(0)+"|"+t(1)+"|"+t(2)+"|"+t(3)+"|"+t(4)+"|"+t(5)).saveAsTextFile("/my_destination_path")
>
>
>
> But it is giving following error:
>
> 15/03/24 17:05:51 ERROR Executor: Exception in task 1.0 in stage 13.0 (TID
> 106)
>
> java.lang.NumberFormatException: For input string: ""
>
> at
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>
> at java.lang.Long.parseLong(Long.java:453)
>
> at java.lang.Long.parseLong(Long.java:483)
>
> at
> scala.collection.immutable.StringLike$class.toLong(StringLike.scala:230)
>
>
>
> is there something wrong with the TSTAMP field which is Long datatype?
>
>
>
> Thanks & Regards
>
> ---
>
> Ananda Basak
>
> Ph: 425-213-7092
>
>
>
> *From:* Yin Huai [mailto:yh...@databricks.com]
> *Sent:* Monday, March 23, 2015 8:55 PM
>
> *To:* BASAK, ANANDA
> *Cc:* user@spark.apache.org
> *Subject:* Re: Date and decimal datatype not working
>
>
>
> To store to csv file, you can use Spark-CSV
>  library.
>
>
>
> On Mon, Mar 23, 2015 at 5:35 PM, BASAK, ANANDA  wrote:
>
>  Thanks. This worked well as per your suggestions. I had to run following:
>
> val TABLE_A =
> sc.textFile("/Myhome/SPARK/files/table_a_file.txt").map(_.split("|")).map(p
> => ROW_A(p(0).trim.toLong, p(1), p(2).trim.toInt, p(3), BigDecimal(p(4)),
> BigDecimal(p(5)), BigDecimal(p(6
>
>
>
> Now I am stuck at another step. I have run a SQL query, where I am
> Selecting from all the fields with some where clause , TSTAMP filtered with
> date range and order by TSTAMP clause. That is running fine.
>
>
>
> Then I am trying to store the output in a CSV file. I am using
> saveAsTextFile(“filename”) function. But it is giving error. Can you please
> help me to write a proper syntax to store output in a CSV file?
>
>
>
>
>
> Thanks & Regards
>
> ---
>
> Ananda Basak
>
> Ph: 425-213-7092
>
>
>
> *From:* BASAK, ANANDA
> *Sent:* Tuesday, March 17, 2015 3:08 PM
> *To:* Yin Huai
> *Cc:* user@spark.apache.org
> *Subject:* RE: Date and decimal datatype not working
>
>
>
> Ok, thanks for the suggestions. Let me try and will confirm all.
>
>
>
> Regards
>
> Ananda
>
>
>
> *From:* Yin Huai [mailto:yh...@databricks.com]
> *Sent:* Tuesday, March 17, 2015 3:04 PM
> *To:* BASAK, ANANDA
> *Cc:* user@spark.apache.org
> *Subject:* Re: Date and decimal datatype not working
>
>
>
> p(0) is a String. So, you need to explicitly convert it to a Long. e.g.
> p(0).trim.toLong. You also need to do it for p(2). For those BigDecimals
> value, you need to create BigDecimal objects from your String values.
>
>
>
> On Tue, Mar 17, 2015 at 5:55 PM, BASAK, ANANDA  wrote:
>
>   Hi All,
>
> I am very new in Spark world. Just started some test coding from last
> week. I am using spark-1.2.1-bin-hadoop2.4 and scala coding.
>
> I am having issues while using Date and decimal data types. Following is
> my code that I am simply running on scala prompt. I am trying to define a
> table and point that to my flat file containing raw data (pipe delimited
> format). Once that is done, I will run some SQL queries and put the output
> data in to another flat file with pipe delimited format.
>
>
>
> ***
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>
> import sqlContext.createSchemaRDD
>
>
>
>
>
> // Define row and table
>
> case class ROW_A(
>
>   TSTAMP:   Long,
>
>   USIDAN: String,
>
>   SECNT:Int,
>
>   SECT:   String,
>
>   BLOCK_NUM:BigDecimal,
>
>   BLOCK_DEN:BigDecimal,
>
>   BLOCK_PCT:BigDecimal)
>
>
>
> val TABLE_A =
> sc.textFile("/Myhome/SPARK/files/table_a_file.txt").map(_.split("|")).map(p
> => ROW_A(p(0), p(1), p(2), p(3), p(4), p(5), p(6)))
>
>
>
> TABLE_A.registerTempTable("TABLE_

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
My cluster is still on spark 1.2 and in SBT I am using 1.3.
So probably it is compiling with 1.3 but running with 1.2 ?

On Wed, Mar 25, 2015 at 12:34 PM, Dean Wampler 
wrote:

> Weird. Are you running using SBT console? It should have the spark-core
> jar on the classpath. Similarly, spark-shell or spark-submit should work,
> but be sure you're using the same version of Spark when running as when
> compiling. Also, you might need to add spark-sql to your SBT dependencies,
> but that shouldn't be this issue.
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
>  (O'Reilly)
> Typesafe 
> @deanwampler 
> http://polyglotprogramming.com
>
> On Wed, Mar 25, 2015 at 12:09 PM, roni  wrote:
>
>> Thanks Dean and Nick.
>> So, I removed the ADAM and H2o from my SBT as I was not using them.
>> I got the code to compile  - only for fail while running with -
>> SparkContext: Created broadcast 1 from textFile at
>> kmerIntersetion.scala:21
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/spark/rdd/RDD$
>> at preDefKmerIntersection$.main(kmerIntersetion.scala:26)
>>
>> This line is where I do a "JOIN" operation.
>> val hgPair = hgfasta.map(_.split (",")).map(a=> (a(0), a(1).trim().toInt))
>>  val filtered = hgPair.filter(kv => kv._2 == 1)
>>  val bedPair = bedFile.map(_.split (",")).map(a=> (a(0),
>> a(1).trim().toInt))
>> * val joinRDD = bedPair.join(filtered)   *
>> Any idea whats going on?
>> I have data on the EC2 so I am avoiding creating the new cluster , but
>> just upgrading and changing the code to use 1.3 and Spark SQL
>> Thanks
>> Roni
>>
>>
>>
>> On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler 
>> wrote:
>>
>>> For the Spark SQL parts, 1.3 breaks backwards compatibility, because
>>> before 1.3, Spark SQL was considered experimental where API changes were
>>> allowed.
>>>
>>> So, H2O and ADA compatible with 1.2.X might not work with 1.3.
>>>
>>> dean
>>>
>>> Dean Wampler, Ph.D.
>>> Author: Programming Scala, 2nd Edition
>>>  (O'Reilly)
>>> Typesafe 
>>> @deanwampler 
>>> http://polyglotprogramming.com
>>>
>>> On Wed, Mar 25, 2015 at 9:39 AM, roni  wrote:
>>>
 Even if H2o and ADA are dependent on 1.2.1 , it should be backword
 compatible, right?
 So using 1.3 should not break them.
 And the code is not using the classes from those libs.
 I tried sbt clean compile .. same errror
 Thanks
 _R

 On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath <
 nick.pentre...@gmail.com> wrote:

> What version of Spark do the other dependencies rely on (Adam and
> H2O?) - that could be it
>
> Or try sbt clean compile
>
> —
> Sent from Mailbox 
>
>
> On Wed, Mar 25, 2015 at 5:58 PM, roni  wrote:
>
>> I have a EC2 cluster created using spark version 1.2.1.
>> And I have a SBT project .
>> Now I want to upgrade to spark 1.3 and use the new features.
>> Below are issues .
>> Sorry for the long post.
>> Appreciate your help.
>> Thanks
>> -Roni
>>
>> Question - Do I have to create a new cluster using spark 1.3?
>>
>> Here is what I did -
>>
>> In my SBT file I  changed to -
>> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
>>
>> But then I started getting compilation error. along with
>> Here are some of the libraries that were evicted:
>> [warn]  * org.apache.spark:spark-core_2.10:1.2.0 -> 1.3.0
>> [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) ->
>> 2.6.0
>> [warn] Run 'evicted' to see detailed eviction warnings
>>
>>  constructor cannot be instantiated to expected type;
>> [error]  found   : (T1, T2)
>> [error]  required: org.apache.spark.sql.catalyst.expressions.Row
>> [error] val ty =
>> joinRDD.map{case(word, (file1Counts, file2Counts)) => KmerIntesect(word,
>> file1Counts,"xyz")}
>> [error]  ^
>>
>> Here is my SBT and code --
>> SBT -
>>
>> version := "1.0"
>>
>> scalaVersion := "2.10.4"
>>
>> resolvers += "Sonatype OSS Snapshots" at "
>> https://oss.sonatype.org/content/repositories/snapshots";;
>> resolvers += "Maven Repo1" at "https://repo1.maven.org/maven2";;
>> resolvers += "Maven Repo" at "
>> https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/
>> ";
>>
>> /* Dependencies - %% appends Scala version to artifactId */
>> libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.6.0"
>> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
>> libraryDepende

writing DStream RDDs to the same file

2015-03-25 Thread Adrian Mocanu
Hi
Is there a way to write all RDDs in a DStream to the same file?
I tried this and got an empty file. I think it's bc the file is not closed i.e. 
ESMinibatchFunctions.writer.close() executes before the stream is created.

Here's my code
  myStream.foreachRDD(rdd => {
rdd.foreach(x => {
  ESMinibatchFunctions.writer.append(rdd.collect()(0).toString()+" the 
data ")})
//localRdd = localRdd.union(rdd)
//localArray = localArray ++ rdd.collect()
  } )

ESMinibatchFunctions.writer.close()

object ESMinibatchFunctions {
  val writer = new PrintWriter("c:/delme/exxx.txt")
}


Re: newbie quesiton - spark with mesos

2015-03-25 Thread Dean Wampler
I think the problem is the use the loopback address:

export SPARK_LOCAL_IP=127.0.0.1

In the stack trace from the slave, you see this:

...  Reason: Connection refused: localhost/127.0.0.1:51849
akka.actor.ActorNotFound: Actor not found for:
ActorSelection[Anchor(akka.tcp://sparkDriver@localhost:51849/),
Path(/user/MapOutputTracker)]

It's trying to connect to an Akka actor on itself, using the loopback
address.

Try changing SPARK_LOCAL_IP to the publicly routable IP address.

dean


Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Mon, Mar 23, 2015 at 7:37 PM, Anirudha Jadhav 
wrote:

> My bad there, I was using the correct link for docs. The spark shell runs
> correctly, the framework is registered fine on mesos.
>
> is there some setting i am missing:
> this is my spark-env.sh>>>
>
> export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
> export SPARK_EXECUTOR_URI=http://100.125.5.93/sparkx.tgz
> export SPARK_LOCAL_IP=127.0.0.1
>
>
>
> here is what i see on the slave node.
> 
> less
> 20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/stderr
> >
>
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0324 02:30:29.389225 27755 fetcher.cpp:76] Fetching URI '
> http://100.125.5.93/sparkx.tgz'
> I0324 02:30:29.389361 27755 fetcher.cpp:126] Downloading '
> http://100.125.5.93/sparkx.tgz' to
> '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz'
> I0324 02:30:35.353446 27755 fetcher.cpp:64] Extracted resource
> '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz'
> into
> '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56'
> Spark assembly has been built with Hive, including Datanucleus jars on
> classpath
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered signal handlers
> for [TERM, HUP, INT]
> I0324 02:30:37.071077 27863 exec.cpp:132] Version: 0.21.1
> I0324 02:30:37.080971 27885 exec.cpp:206] Executor registered on slave
> 20150226-160708-78932-5050-8971-S0
> 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered with Mesos as
> executor ID 20150226-160708-78932-5050-8971-S0 with 1 cpus
> 15/03/24 02:30:37 INFO SecurityManager: Changing view acls to: ubuntu
> 15/03/24 02:30:37 INFO SecurityManager: Changing modify acls to: ubuntu
> 15/03/24 02:30:37 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(ubuntu); users
> with modify permissions: Set(ubuntu)
> 15/03/24 02:30:37 INFO Slf4jLogger: Slf4jLogger started
> 15/03/24 02:30:37 INFO Remoting: Starting remoting
> 15/03/24 02:30:38 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sparkexecu...@mesos-si2.dny1.bcpc.bloomberg.com:50542]
> 15/03/24 02:30:38 INFO Utils: Successfully started service 'sparkExecutor'
> on port 50542.
> 15/03/24 02:30:38 INFO AkkaUtils: Connecting to MapOutputTracker:
> akka.tcp://sparkDriver@localhost:51849/user/MapOutputTracker
> 15/03/24 02:30:38 WARN Remoting: Tried to associate with unreachable
> remote address [akka.tcp://sparkDriver@localhost:51849]. Address is now
> gated for 5000 ms, all messages to this address will be delivered to dead
> letters. Reason: Connection refused: localhost/127.0.0.1:51849
> akka.actor.ActorNotFound: Actor not found for:
> ActorSelection[Anchor(akka.tcp://sparkDriver@localhost:51849/),
> Path(/user/MapOutputTracker)]
> at
> akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
> at
> akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
> at
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
> at
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
> at
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
> at
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
> at
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
> at
> akka.dispatch.Ba

Re: Can LBFGS be used on streaming data?

2015-03-25 Thread EcoMotto Inc.
Hello Jeremy,

Sorry for the delayed reply!

First issue was resolved, I believe it was just production and consumption
rate problem.

Regarding the second question, I am streaming the data from the file and
there are about 38k records. I am sending the streams in the same sequence
as I am reading from the file, but I am getting different weights each time
may be something to do with how the DStreams are being processed.

Can you suggest me some solution for this case? my requirement is that my
program must generate the same weights for both static and streaming data?
Thank you for your help!

Best Regards,
Arunkumar


On Thu, Mar 19, 2015 at 9:25 PM, Jeremy Freeman 
wrote:

> Regarding the first question, can you say more about how you are loading
> your data? And what is the size of the data set? And is that the only error
> you see, and do you only see it in the streaming version?
>
> For the second question, there are a couple reasons the weights might
> slightly differ, it depends on exactly how you set up the comparison. When
> you split it into 5, were those the same 5 chunks of data you used for the
> streaming case? And were they presented to the optimizer in the same order?
> Difference in either could produce small differences in the resulting
> weights, but that doesn’t mean it’s doing anything wrong.
>
> -
> jeremyfreeman.net
> @thefreemanlab
>
> On Mar 17, 2015, at 6:19 PM, EcoMotto Inc.  wrote:
>
> Hello Jeremy,
>
> Thank you for your reply.
>
> When I am running this code on the local machine on a streaming data, it
> keeps giving me this error:
> *WARN TaskSetManager: Lost task 2.0 in stage 211.0 (TID 4138, localhost):
> java.io.FileNotFoundException:
> /tmp/spark-local-20150316165742-9ac0/27/shuffle_102_2_0.data (No such file
> or directory) *
>
> And when I execute the same code on a static data after randomly splitting
> it into 5 sets, it gives me a little bit different weights (difference is
> in decimals). I am still trying to analyse why would this be happening.
> Any inputs, on why would this be happening?
>
> Best Regards,
> Arunkumar
>
>
> On Tue, Mar 17, 2015 at 11:32 AM, Jeremy Freeman  > wrote:
>
>> Hi Arunkumar,
>>
>> That looks like it should work. Logically, it’s similar to the
>> implementation used by StreamingLinearRegression and
>> StreamingLogisticRegression, see this class:
>>
>>
>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearAlgorithm.scala
>>
>> which exposes the kind of operation your describing (for any linear
>> method).
>>
>> The nice thing about the gradient-based methods is that they can use
>> existing MLLib optimization routines in this fairly direct way. Other
>> methods (such as KMeans) require a bit more reengineering.
>>
>> — Jeremy
>>
>> -
>> jeremyfreeman.net
>> @thefreemanlab
>>
>> On Mar 16, 2015, at 6:19 PM, EcoMotto Inc.  wrote:
>>
>> Hello,
>>
>> I am new to spark streaming API.
>>
>> I wanted to ask if I can apply LBFGS (with LeastSquaresGradient) on
>> streaming data? Currently I am using forecahRDD for parsing through DStream
>> and I am generating a model based on each RDD. Am I doing anything
>> logically wrong here?
>> Thank you.
>>
>> Sample Code:
>>
>> val algorithm = new LBFGS(new LeastSquaresGradient(), new SimpleUpdater())
>> var initialWeights = 
>> Vectors.dense(Array.fill(numFeatures)(scala.util.Random.nextDouble()))
>> var isFirst = true
>> var model = new LinearRegressionModel(null,1.0)
>>
>> parsedData.foreachRDD{rdd =>
>>   if(isFirst) {
>> val weights = algorithm.optimize(rdd, initialWeights)
>> val w = weights.toArray
>> val intercept = w.head
>> model = new LinearRegressionModel(Vectors.dense(w.drop(1)), intercept)
>> isFirst = false
>>   }else{
>> var ab = ArrayBuffer[Double]()
>> ab.insert(0, model.intercept)
>> ab.appendAll( model.weights.toArray)
>> print("Intercept = "+model.intercept+" :: modelWeights = "+model.weights)
>> initialWeights = Vectors.dense(ab.toArray)
>> print("Initial Weights: "+ initialWeights)
>> val weights = algorithm.optimize(rdd, initialWeights)
>> val w = weights.toArray
>> val intercept = w.head
>> model = new LinearRegressionModel(Vectors.dense(w.drop(1)), intercept)
>>   }
>>
>>
>>
>> Best Regards,
>> Arunkumar
>>
>>
>>
>
>


Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Dean Wampler
Weird. Are you running using SBT console? It should have the spark-core jar
on the classpath. Similarly, spark-shell or spark-submit should work, but
be sure you're using the same version of Spark when running as when
compiling. Also, you might need to add spark-sql to your SBT dependencies,
but that shouldn't be this issue.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Wed, Mar 25, 2015 at 12:09 PM, roni  wrote:

> Thanks Dean and Nick.
> So, I removed the ADAM and H2o from my SBT as I was not using them.
> I got the code to compile  - only for fail while running with -
> SparkContext: Created broadcast 1 from textFile at kmerIntersetion.scala:21
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/spark/rdd/RDD$
> at preDefKmerIntersection$.main(kmerIntersetion.scala:26)
>
> This line is where I do a "JOIN" operation.
> val hgPair = hgfasta.map(_.split (",")).map(a=> (a(0), a(1).trim().toInt))
>  val filtered = hgPair.filter(kv => kv._2 == 1)
>  val bedPair = bedFile.map(_.split (",")).map(a=> (a(0),
> a(1).trim().toInt))
> * val joinRDD = bedPair.join(filtered)   *
> Any idea whats going on?
> I have data on the EC2 so I am avoiding creating the new cluster , but
> just upgrading and changing the code to use 1.3 and Spark SQL
> Thanks
> Roni
>
>
>
> On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler 
> wrote:
>
>> For the Spark SQL parts, 1.3 breaks backwards compatibility, because
>> before 1.3, Spark SQL was considered experimental where API changes were
>> allowed.
>>
>> So, H2O and ADA compatible with 1.2.X might not work with 1.3.
>>
>> dean
>>
>> Dean Wampler, Ph.D.
>> Author: Programming Scala, 2nd Edition
>>  (O'Reilly)
>> Typesafe 
>> @deanwampler 
>> http://polyglotprogramming.com
>>
>> On Wed, Mar 25, 2015 at 9:39 AM, roni  wrote:
>>
>>> Even if H2o and ADA are dependent on 1.2.1 , it should be backword
>>> compatible, right?
>>> So using 1.3 should not break them.
>>> And the code is not using the classes from those libs.
>>> I tried sbt clean compile .. same errror
>>> Thanks
>>> _R
>>>
>>> On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath <
>>> nick.pentre...@gmail.com> wrote:
>>>
 What version of Spark do the other dependencies rely on (Adam and H2O?)
 - that could be it

 Or try sbt clean compile

 —
 Sent from Mailbox 


 On Wed, Mar 25, 2015 at 5:58 PM, roni  wrote:

> I have a EC2 cluster created using spark version 1.2.1.
> And I have a SBT project .
> Now I want to upgrade to spark 1.3 and use the new features.
> Below are issues .
> Sorry for the long post.
> Appreciate your help.
> Thanks
> -Roni
>
> Question - Do I have to create a new cluster using spark 1.3?
>
> Here is what I did -
>
> In my SBT file I  changed to -
> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
>
> But then I started getting compilation error. along with
> Here are some of the libraries that were evicted:
> [warn]  * org.apache.spark:spark-core_2.10:1.2.0 -> 1.3.0
> [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) ->
> 2.6.0
> [warn] Run 'evicted' to see detailed eviction warnings
>
>  constructor cannot be instantiated to expected type;
> [error]  found   : (T1, T2)
> [error]  required: org.apache.spark.sql.catalyst.expressions.Row
> [error] val ty =
> joinRDD.map{case(word, (file1Counts, file2Counts)) => KmerIntesect(word,
> file1Counts,"xyz")}
> [error]  ^
>
> Here is my SBT and code --
> SBT -
>
> version := "1.0"
>
> scalaVersion := "2.10.4"
>
> resolvers += "Sonatype OSS Snapshots" at "
> https://oss.sonatype.org/content/repositories/snapshots";;
> resolvers += "Maven Repo1" at "https://repo1.maven.org/maven2";;
> resolvers += "Maven Repo" at "
> https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/";;
>
> /* Dependencies - %% appends Scala version to artifactId */
> libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.6.0"
> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
> libraryDependencies += "org.bdgenomics.adam" % "adam-core" % "0.16.0"
> libraryDependencies += "ai.h2o" % "sparkling-water-core_2.10" %
> "0.2.10"
>
>
> CODE --
> import org.apache.spark.{SparkConf, SparkContext}
> case class KmerIntesect(kmer: String, kCount: Int, fileName: String)
>
> object preDefKme

Re: What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Michael Armbrust
My example is a totally reasonable way to do it, it just requires
constructing strings

In many cases you can also do it with column objects

df[df.name == "test"].collect()

Out[15]: [Row(name=u'test')]


You should check out:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column
and
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

On Wed, Mar 25, 2015 at 11:39 AM, Stuart Layton 
wrote:

> Thanks for the response, I was using IN as an example of the type of
> operation I need to do. Is there another way to do this that lines up more
> naturally with the way things are supposed to be done in SparkSQL?
>
> On Wed, Mar 25, 2015 at 2:29 PM, Michael Armbrust 
> wrote:
>
>> The only way to do "in" using python currently is to use the string based
>> filter API (where you pass us an expression as a string, and we parse it
>> using our SQL parser).
>>
>> from pyspark.sql import Row
>> from pyspark.sql.functions import *
>>
>> df = sc.parallelize([Row(name="test")]).toDF()
>> df.filter("name in ('a', 'b')").collect()
>> Out[1]: []
>>
>> df.filter("name in ('test')").collect()
>> Out[2]: [Row(name=u'test')]
>>
>> In general you want to avoid lambda functions whenever you can do the
>> same thing a dataframe expression.  This is because your lambda function is
>> a black box that we cannot optimize (though you should certainly use them
>> for the advanced stuff that expressions can't handle).
>>
>> I opened SPARK-6536  to
>> provide a nicer interface for this.
>>
>>
>> On Wed, Mar 25, 2015 at 7:41 AM, Stuart Layton 
>> wrote:
>>
>>> I have a SparkSQL dataframe with a a few billion rows that I need to
>>> quickly filter down to a few hundred thousand rows, using an operation like
>>> (syntax may not be correct)
>>>
>>> df = df[ df.filter(lambda x: x.key_col in approved_keys)]
>>>
>>> I was thinking about serializing the data using parquet and saving it to
>>> S3, however as I want to optimize for filtering speed I'm not sure this is
>>> the best option.
>>>
>>> --
>>> Stuart Layton
>>>
>>
>>
>
>
> --
> Stuart Layton
>


Re: Spark shell never leaves ACCEPTED state in YARN CDH5

2015-03-25 Thread Marcelo Vanzin
The probably means there are not enough free resources in your cluster
to run the AM for the Spark job. Check your RM's web ui to see the
resources you have available.

On Wed, Mar 25, 2015 at 12:08 PM, Khandeshi, Ami
 wrote:
> I am seeing the same behavior.  I have enough resources…..  How do I resolve
> it?
>
>
>
> Thanks,
>
>
>
> Ami



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can a DataFrame be saved to s3 directly using Parquet?

2015-03-25 Thread Michael Armbrust
Until then you can try

sql("SET spark.sql.parquet.useDataSourceApi=false")

On Wed, Mar 25, 2015 at 12:15 PM, Michael Armbrust 
wrote:

> This will be fixed in Spark 1.3.1:
> https://issues.apache.org/jira/browse/SPARK-6351
>
> and is fixed in master/branch-1.3 if you want to compile from source
>
> On Wed, Mar 25, 2015 at 11:59 AM, Stuart Layton 
> wrote:
>
>> I'm trying to save a dataframe to s3 as a parquet file but I'm getting
>> Wrong FS errors
>>
>> >>> df.saveAsParquetFile(parquetFile)
>> 15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(46645) called
>> with curMem=82744, maxMem=278302556
>> 15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5 stored as
>> values in memory (estimated size 45.6 KB, free 265.3 MB)
>> 15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(7078) called
>> with curMem=129389, maxMem=278302556
>> 15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5_piece0
>> stored as bytes in memory (estimated size 6.9 KB, free 265.3 MB)
>> 15/03/25 18:56:10 INFO storage.BlockManagerInfo: Added broadcast_5_piece0
>> in memory on ip-172-31-1-219.ec2.internal:58280 (size: 6.9 KB, free: 265.4
>> MB)
>> 15/03/25 18:56:10 INFO storage.BlockManagerMaster: Updated info of block
>> broadcast_5_piece0
>> 15/03/25 18:56:10 INFO spark.SparkContext: Created broadcast 5 from
>> textFile at JSONRelation.scala:98
>> Traceback (most recent call last):
>>   File "", line 1, in 
>>   File "/root/spark/python/pyspark/sql/dataframe.py", line 121, in
>> saveAsParquetFile
>> self._jdf.saveAsParquetFile(path)
>>   File
>> "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line
>> 538, in __call__
>>   File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>> line 300, in get_return_value
>> py4j.protocol.Py4JJavaError: An error occurred while calling
>> o22.saveAsParquetFile.
>> : java.lang.IllegalArgumentException: Wrong FS:
>> s3n://com.my.bucket/spark-testing/, expected: hdfs://
>> ec2-52-0-159-113.compute-1.amazonaws.com:9000
>>
>>
>> Is it possible to save a dataframe to s3 directly using parquet?
>>
>> --
>> Stuart Layton
>>
>
>


Re: Can a DataFrame be saved to s3 directly using Parquet?

2015-03-25 Thread Michael Armbrust
This will be fixed in Spark 1.3.1:
https://issues.apache.org/jira/browse/SPARK-6351

and is fixed in master/branch-1.3 if you want to compile from source

On Wed, Mar 25, 2015 at 11:59 AM, Stuart Layton 
wrote:

> I'm trying to save a dataframe to s3 as a parquet file but I'm getting
> Wrong FS errors
>
> >>> df.saveAsParquetFile(parquetFile)
> 15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(46645) called
> with curMem=82744, maxMem=278302556
> 15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5 stored as
> values in memory (estimated size 45.6 KB, free 265.3 MB)
> 15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(7078) called
> with curMem=129389, maxMem=278302556
> 15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5_piece0
> stored as bytes in memory (estimated size 6.9 KB, free 265.3 MB)
> 15/03/25 18:56:10 INFO storage.BlockManagerInfo: Added broadcast_5_piece0
> in memory on ip-172-31-1-219.ec2.internal:58280 (size: 6.9 KB, free: 265.4
> MB)
> 15/03/25 18:56:10 INFO storage.BlockManagerMaster: Updated info of block
> broadcast_5_piece0
> 15/03/25 18:56:10 INFO spark.SparkContext: Created broadcast 5 from
> textFile at JSONRelation.scala:98
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/root/spark/python/pyspark/sql/dataframe.py", line 121, in
> saveAsParquetFile
> self._jdf.saveAsParquetFile(path)
>   File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>   File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o22.saveAsParquetFile.
> : java.lang.IllegalArgumentException: Wrong FS:
> s3n://com.my.bucket/spark-testing/, expected: hdfs://
> ec2-52-0-159-113.compute-1.amazonaws.com:9000
>
>
> Is it possible to save a dataframe to s3 directly using parquet?
>
> --
> Stuart Layton
>


Re:

2015-03-25 Thread Himanish Kushary
It will only give (A,B). I am generating the pair from combinations of the
the strings A,B,C and D, so the pairs (ignoring order) would be

(A,B),(A,C),(A,D),(B,C),(B,D),(C,D)

On successful filtering using the original condition it will transform to
(A,B) and (C,D)

On Wed, Mar 25, 2015 at 3:00 PM, Nathan Kronenfeld <
nkronenfeld@uncharted.software> wrote:

> What would it do with the following dataset?
>
> (A, B)
> (A, C)
> (B, D)
>
>
> On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary 
> wrote:
>
>> Hi,
>>
>> I have a RDD of pairs of strings like below :
>>
>> (A,B)
>> (B,C)
>> (C,D)
>> (A,D)
>> (E,F)
>> (B,F)
>>
>> I need to transform/filter this into a RDD of pairs that does not repeat
>> a string once it has been used once. So something like ,
>>
>> (A,B)
>> (C,D)
>> (E,F)
>>
>> (B,C) is out because B has already ben used in (A,B), (A,D) is out
>> because A (and D) has been used etc.
>>
>> I was thinking of a option of using a shared variable to keep track of
>> what has already been used but that may only work for a single partition
>> and would not scale for larger dataset.
>>
>> Is there any other efficient way to accomplish this ?
>>
>> --
>> Thanks & Regards
>> Himanish
>>
>
>


-- 
Thanks & Regards
Himanish


Spark shell never leaves ACCEPTED state in YARN CDH5

2015-03-25 Thread Khandeshi, Ami
I am seeing the same behavior.  I have enough resources.  How do I resolve 
it?

Thanks,

Ami


Re: column expression in left outer join for DataFrame

2015-03-25 Thread Michael Armbrust
Thats a good question.  In this particular example, it is really only
internal implementation details that make it ambiguous.  However, fixing
this was a very large change so we have defered it to Spark 1.4 and instead
print a warning now when you construct trivially equal expressions.  I can
try to explain a little bit about why solving this generally is (mostly)
impossible.  Consider the following:

val df = sqlContext.load(...)

val df1 = df
val df2 = df

df1.join(df2, df1("a") === df2("a"))

Compared with

"SELECT * FROM df df1 JOIN df df2 WHERE df1.a = df2.a"

In the first example, the assigning of df to df1 and df2 is completely
transparent to the catalyst optimizer as it is happening in Scala code.
This means that df1("a") and df2("a") are completely indistinguishable to
us (at least without crazy macro magic).  In contrast, the aliasing is
visible to the optimizer when are doing it in SQL instead of Scala and thus
we can differentiate.

In your case you are doing transformations, and we could assign new unique
ids each time a transformation is done.  However, we don't do this today,
and its a pretty big change.  There is a JIRA for this: SPARK-6231


On Wed, Mar 25, 2015 at 11:47 AM, S Krishna  wrote:

> Hi,
>
> Thanks for your response.  I am not clear about why the query is ambiguous.
>
> val both = df_2.join(df_1, df_2("country")===df_1("country"),
> "left_outer")
>
> I thought df_2("country")===df_1("country") indicates that the country
> field in the 2 dataframes should match and df_2("country") is the
> equivalent of df_2.country in SQL, while  df_1("country") is the
> equivalent of df_1.country in SQL. So I am not sure why it is ambiguous. In
> Spark 1.2.0 I have used the same logic using SparkSQL  and Tables ( e.g.
>  "WHERE tab1.country = tab2.country")  and had no problems getting the
> correct result.
>
> thanks
>
>
>
>
>
> On Wed, Mar 25, 2015 at 11:05 AM, Michael Armbrust  > wrote:
>
>> Unfortunately you are now hitting a bug (that is fixed in master and will
>> be released in 1.3.1 hopefully next week).  However, even with that your
>> query is still ambiguous and you will need to use aliases:
>>
>> val df_1 = df.filter( df("event") === 0)
>>   . select("country", "cnt").as("a")
>> val df_2 = df.filter( df("event") === 3)
>>   . select("country", "cnt").as("b")
>> val both = df_2.join(df_1, $"a.country" === $"b.country"), "left_outer")
>>
>>
>>
>> On Tue, Mar 24, 2015 at 11:57 PM, S Krishna 
>> wrote:
>>
>>> Hi,
>>>
>>> Thanks for your response. I modified my code as per your suggestion, but
>>> now I am getting a runtime error. Here's my code:
>>>
>>> val df_1 = df.filter( df("event") === 0)
>>>   . select("country", "cnt")
>>>
>>> val df_2 = df.filter( df("event") === 3)
>>>   . select("country", "cnt")
>>>
>>> df_1.show()
>>> //produces the following output :
>>> // countrycnt
>>> //   tw   3000
>>> //   uk   2000
>>> //   us   1000
>>>
>>> df_2.show()
>>> //produces the following output :
>>> // countrycnt
>>> //   tw   25
>>> //   uk   200
>>> //   us   95
>>>
>>> val both = df_2.join(df_1, df_2("country")===df_1("country"),
>>> "left_outer")
>>>
>>> I am getting the following error when executing the join statement:
>>>
>>> java.util.NoSuchElementException: next on empty iterator.
>>>
>>> This error seems to be originating at DataFrame.join (line 133 in
>>> DataFrame.scala).
>>>
>>> The show() results show that both dataframes do have columns named
>>> "country" and that they are non-empty. I also tried the simpler join ( i.e.
>>> df_2.join(df_1) ) and got the same error stated above.
>>>
>>> I would like to know what is wrong with the join statement above.
>>>
>>> thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
 You need to use `===`, so that you are constructing a column expression
 instead of evaluating the standard scala equality method.  Calling methods
 to access columns (i.e. df.county is only supported in python).

 val join_df =  df1.join( df2, df1("country") === df2("country"),
 "left_outer")

 On Tue, Mar 24, 2015 at 5:50 PM, SK  wrote:

> Hi,
>
> I am trying to port some code that was working in Spark 1.2.0 on the
> latest
> version, Spark 1.3.0. This code involves a left outer join between two
> SchemaRDDs which I am now trying to change to a left outer join
> between 2
> DataFrames. I followed the example  for left outer join of DataFrame at
>
> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>
> Here's my code, where df1 and df2 are the 2 dataframes I am joining on
> the
> "country" field:
>
>  v

Re:

2015-03-25 Thread Nathan Kronenfeld
What would it do with the following dataset?

(A, B)
(A, C)
(B, D)

On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary 
wrote:

> Hi,
>
> I have a RDD of pairs of strings like below :
>
> (A,B)
> (B,C)
> (C,D)
> (A,D)
> (E,F)
> (B,F)
>
> I need to transform/filter this into a RDD of pairs that does not repeat a
> string once it has been used once. So something like ,
>
> (A,B)
> (C,D)
> (E,F)
>
> (B,C) is out because B has already ben used in (A,B), (A,D) is out because
> A (and D) has been used etc.
>
> I was thinking of a option of using a shared variable to keep track of
> what has already been used but that may only work for a single partition
> and would not scale for larger dataset.
>
> Is there any other efficient way to accomplish this ?
>
> --
> Thanks & Regards
> Himanish
>


Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
Thanks Dean and Nick.
So, I removed the ADAM and H2o from my SBT as I was not using them.
I got the code to compile  - only for fail while running with -
SparkContext: Created broadcast 1 from textFile at kmerIntersetion.scala:21
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/spark/rdd/RDD$
at preDefKmerIntersection$.main(kmerIntersetion.scala:26)

This line is where I do a "JOIN" operation.
val hgPair = hgfasta.map(_.split (",")).map(a=> (a(0), a(1).trim().toInt))
 val filtered = hgPair.filter(kv => kv._2 == 1)
 val bedPair = bedFile.map(_.split (",")).map(a=> (a(0),
a(1).trim().toInt))
* val joinRDD = bedPair.join(filtered)   *
Any idea whats going on?
I have data on the EC2 so I am avoiding creating the new cluster , but just
upgrading and changing the code to use 1.3 and Spark SQL
Thanks
Roni



On Wed, Mar 25, 2015 at 9:50 AM, Dean Wampler  wrote:

> For the Spark SQL parts, 1.3 breaks backwards compatibility, because
> before 1.3, Spark SQL was considered experimental where API changes were
> allowed.
>
> So, H2O and ADA compatible with 1.2.X might not work with 1.3.
>
> dean
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
>  (O'Reilly)
> Typesafe 
> @deanwampler 
> http://polyglotprogramming.com
>
> On Wed, Mar 25, 2015 at 9:39 AM, roni  wrote:
>
>> Even if H2o and ADA are dependent on 1.2.1 , it should be backword
>> compatible, right?
>> So using 1.3 should not break them.
>> And the code is not using the classes from those libs.
>> I tried sbt clean compile .. same errror
>> Thanks
>> _R
>>
>> On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath > > wrote:
>>
>>> What version of Spark do the other dependencies rely on (Adam and H2O?)
>>> - that could be it
>>>
>>> Or try sbt clean compile
>>>
>>> —
>>> Sent from Mailbox 
>>>
>>>
>>> On Wed, Mar 25, 2015 at 5:58 PM, roni  wrote:
>>>
 I have a EC2 cluster created using spark version 1.2.1.
 And I have a SBT project .
 Now I want to upgrade to spark 1.3 and use the new features.
 Below are issues .
 Sorry for the long post.
 Appreciate your help.
 Thanks
 -Roni

 Question - Do I have to create a new cluster using spark 1.3?

 Here is what I did -

 In my SBT file I  changed to -
 libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"

 But then I started getting compilation error. along with
 Here are some of the libraries that were evicted:
 [warn]  * org.apache.spark:spark-core_2.10:1.2.0 -> 1.3.0
 [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) ->
 2.6.0
 [warn] Run 'evicted' to see detailed eviction warnings

  constructor cannot be instantiated to expected type;
 [error]  found   : (T1, T2)
 [error]  required: org.apache.spark.sql.catalyst.expressions.Row
 [error] val ty = joinRDD.map{case(word,
 (file1Counts, file2Counts)) => KmerIntesect(word, file1Counts,"xyz")}
 [error]  ^

 Here is my SBT and code --
 SBT -

 version := "1.0"

 scalaVersion := "2.10.4"

 resolvers += "Sonatype OSS Snapshots" at "
 https://oss.sonatype.org/content/repositories/snapshots";;
 resolvers += "Maven Repo1" at "https://repo1.maven.org/maven2";;
 resolvers += "Maven Repo" at "
 https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/";;

 /* Dependencies - %% appends Scala version to artifactId */
 libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.6.0"
 libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
 libraryDependencies += "org.bdgenomics.adam" % "adam-core" % "0.16.0"
 libraryDependencies += "ai.h2o" % "sparkling-water-core_2.10" % "0.2.10"


 CODE --
 import org.apache.spark.{SparkConf, SparkContext}
 case class KmerIntesect(kmer: String, kCount: Int, fileName: String)

 object preDefKmerIntersection {
   def main(args: Array[String]) {

  val sparkConf = new SparkConf().setAppName("preDefKmer-intersect")
  val sc = new SparkContext(sparkConf)
 import sqlContext.createSchemaRDD
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val bedFile = sc.textFile("s3n://a/b/c",40)
  val hgfasta = sc.textFile("hdfs://a/b/c",40)
  val hgPair = hgfasta.map(_.split (",")).map(a=> (a(0),
 a(1).trim().toInt))
  val filtered = hgPair.filter(kv => kv._2 == 1)
  val bedPair = bedFile.map(_.split (",")).map(a=>
 (a(0), a(1).trim().toInt))
  val joinRDD = bedPair.join(filtered)
 val ty = 

Can a DataFrame be saved to s3 directly using Parquet?

2015-03-25 Thread Stuart Layton
I'm trying to save a dataframe to s3 as a parquet file but I'm getting
Wrong FS errors

>>> df.saveAsParquetFile(parquetFile)
15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(46645) called
with curMem=82744, maxMem=278302556
15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5 stored as
values in memory (estimated size 45.6 KB, free 265.3 MB)
15/03/25 18:56:10 INFO storage.MemoryStore: ensureFreeSpace(7078) called
with curMem=129389, maxMem=278302556
15/03/25 18:56:10 INFO storage.MemoryStore: Block broadcast_5_piece0 stored
as bytes in memory (estimated size 6.9 KB, free 265.3 MB)
15/03/25 18:56:10 INFO storage.BlockManagerInfo: Added broadcast_5_piece0
in memory on ip-172-31-1-219.ec2.internal:58280 (size: 6.9 KB, free: 265.4
MB)
15/03/25 18:56:10 INFO storage.BlockManagerMaster: Updated info of block
broadcast_5_piece0
15/03/25 18:56:10 INFO spark.SparkContext: Created broadcast 5 from
textFile at JSONRelation.scala:98
Traceback (most recent call last):
  File "", line 1, in 
  File "/root/spark/python/pyspark/sql/dataframe.py", line 121, in
saveAsParquetFile
self._jdf.saveAsParquetFile(path)
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
o22.saveAsParquetFile.
: java.lang.IllegalArgumentException: Wrong FS:
s3n://com.my.bucket/spark-testing/, expected: hdfs://
ec2-52-0-159-113.compute-1.amazonaws.com:9000


Is it possible to save a dataframe to s3 directly using parquet?

-- 
Stuart Layton


Re: python : Out of memory: Kill process

2015-03-25 Thread Davies Liu
With batchSize = 1, I think it will become even worse.

I'd suggest to go with 1.3, have a taste for the new DataFrame API.

On Wed, Mar 25, 2015 at 11:49 AM, Eduardo Cusa
 wrote:
> Hi Davies, I running 1.1.0.
>
> Now I'm following this thread that recommend use batchsize parameter = 1
>
>
> http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html
>
> if this does not work I will install  1.2.1 or  1.3
>
> Regards
>
>
>
>
>
>
> On Wed, Mar 25, 2015 at 3:39 PM, Davies Liu  wrote:
>>
>> What's the version of Spark you are running?
>>
>> There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3,
>>
>> [1] https://issues.apache.org/jira/browse/SPARK-6055
>>
>> On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa
>>  wrote:
>> > Hi Guys, I running the following function with spark-submmit and de SO
>> > is
>> > killing my process :
>> >
>> >
>> >   def getRdd(self,date,provider):
>> > path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz'
>> > log2= self.sqlContext.jsonFile(path)
>> > log2.registerTempTable('log_test')
>> > log2.cache()
>>
>> You only visit the table once, cache does not help here.
>>
>> > out=self.sqlContext.sql("SELECT user, tax from log_test where
>> > provider =
>> > '"+provider+"'and country <> ''").map(lambda row: (row.user, row.tax))
>> > print "out1"
>> > return  map((lambda (x,y): (x, list(y))),
>> > sorted(out.groupByKey(2000).collect()))
>>
>> 100 partitions (or less) will be enough for 2G dataset.
>>
>> >
>> >
>> > The input dataset has 57 zip files (2 GB)
>> >
>> > The same process with a smaller dataset completed successfully
>> >
>> > Any ideas to debug is welcome.
>> >
>> > Regards
>> > Eduardo
>> >
>> >
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: column expression in left outer join for DataFrame

2015-03-25 Thread S Krishna
Hi,

Thanks for your response.  I am not clear about why the query is ambiguous.

val both = df_2.join(df_1, df_2("country")===df_1("country"), "left_outer")

I thought df_2("country")===df_1("country") indicates that the country
field in the 2 dataframes should match and df_2("country") is the
equivalent of df_2.country in SQL, while  df_1("country") is the equivalent
of df_1.country in SQL. So I am not sure why it is ambiguous. In Spark
1.2.0 I have used the same logic using SparkSQL  and Tables ( e.g.  "WHERE
tab1.country = tab2.country")  and had no problems getting the correct
result.

thanks





On Wed, Mar 25, 2015 at 11:05 AM, Michael Armbrust 
wrote:

> Unfortunately you are now hitting a bug (that is fixed in master and will
> be released in 1.3.1 hopefully next week).  However, even with that your
> query is still ambiguous and you will need to use aliases:
>
> val df_1 = df.filter( df("event") === 0)
>   . select("country", "cnt").as("a")
> val df_2 = df.filter( df("event") === 3)
>   . select("country", "cnt").as("b")
> val both = df_2.join(df_1, $"a.country" === $"b.country"), "left_outer")
>
>
>
> On Tue, Mar 24, 2015 at 11:57 PM, S Krishna  wrote:
>
>> Hi,
>>
>> Thanks for your response. I modified my code as per your suggestion, but
>> now I am getting a runtime error. Here's my code:
>>
>> val df_1 = df.filter( df("event") === 0)
>>   . select("country", "cnt")
>>
>> val df_2 = df.filter( df("event") === 3)
>>   . select("country", "cnt")
>>
>> df_1.show()
>> //produces the following output :
>> // countrycnt
>> //   tw   3000
>> //   uk   2000
>> //   us   1000
>>
>> df_2.show()
>> //produces the following output :
>> // countrycnt
>> //   tw   25
>> //   uk   200
>> //   us   95
>>
>> val both = df_2.join(df_1, df_2("country")===df_1("country"),
>> "left_outer")
>>
>> I am getting the following error when executing the join statement:
>>
>> java.util.NoSuchElementException: next on empty iterator.
>>
>> This error seems to be originating at DataFrame.join (line 133 in
>> DataFrame.scala).
>>
>> The show() results show that both dataframes do have columns named
>> "country" and that they are non-empty. I also tried the simpler join ( i.e.
>> df_2.join(df_1) ) and got the same error stated above.
>>
>> I would like to know what is wrong with the join statement above.
>>
>> thanks
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust > > wrote:
>>
>>> You need to use `===`, so that you are constructing a column expression
>>> instead of evaluating the standard scala equality method.  Calling methods
>>> to access columns (i.e. df.county is only supported in python).
>>>
>>> val join_df =  df1.join( df2, df1("country") === df2("country"),
>>> "left_outer")
>>>
>>> On Tue, Mar 24, 2015 at 5:50 PM, SK  wrote:
>>>
 Hi,

 I am trying to port some code that was working in Spark 1.2.0 on the
 latest
 version, Spark 1.3.0. This code involves a left outer join between two
 SchemaRDDs which I am now trying to change to a left outer join between
 2
 DataFrames. I followed the example  for left outer join of DataFrame at

 https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

 Here's my code, where df1 and df2 are the 2 dataframes I am joining on
 the
 "country" field:

  val join_df =  df1.join( df2,  df1.country == df2.country,
 "left_outer")

 But I got a compilation error that value  country is not a member of
 sql.DataFrame

 I  also tried the following:
  val join_df =  df1.join( df2, df1("country") == df2("country"),
 "left_outer")

 I got a compilation error that it is a Boolean whereas a Column is
 required.

 So what is the correct Column expression I need to provide for joining
 the 2
 dataframes on a specific field ?

 thanks








 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>


Re: python : Out of memory: Kill process

2015-03-25 Thread Eduardo Cusa
Hi Davies, I running 1.1.0.

Now I'm following this thread that recommend use batchsize parameter = 1


http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html

if this does not work I will install  1.2.1 or  1.3

Regards






On Wed, Mar 25, 2015 at 3:39 PM, Davies Liu  wrote:

> What's the version of Spark you are running?
>
> There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3,
>
> [1] https://issues.apache.org/jira/browse/SPARK-6055
>
> On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa
>  wrote:
> > Hi Guys, I running the following function with spark-submmit and de SO is
> > killing my process :
> >
> >
> >   def getRdd(self,date,provider):
> > path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz'
> > log2= self.sqlContext.jsonFile(path)
> > log2.registerTempTable('log_test')
> > log2.cache()
>
> You only visit the table once, cache does not help here.
>
> > out=self.sqlContext.sql("SELECT user, tax from log_test where
> provider =
> > '"+provider+"'and country <> ''").map(lambda row: (row.user, row.tax))
> > print "out1"
> > return  map((lambda (x,y): (x, list(y))),
> > sorted(out.groupByKey(2000).collect()))
>
> 100 partitions (or less) will be enough for 2G dataset.
>
> >
> >
> > The input dataset has 57 zip files (2 GB)
> >
> > The same process with a smaller dataset completed successfully
> >
> > Any ideas to debug is welcome.
> >
> > Regards
> > Eduardo
> >
> >
>


Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-25 Thread Matt Silvey
This is a different kind of error.  Thomas' OOM error was specific to the
kernel refusing to create another thread/process for his application.

Matthew

On Wed, Mar 25, 2015 at 10:51 AM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> I have a YARN cluster where the max memory allowed is 16GB. I set 12G for
> my driver, however i see OutOFMemory error even for this program
> http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables
> . What do you suggest ?
>
> On Wed, Mar 25, 2015 at 8:23 AM, Thomas Gerber 
> wrote:
>
>> So,
>>
>> 1. I reduced my  -XX:ThreadStackSize to 5m (instead of 10m - default is
>> 1m), which is still OK for my need.
>> 2. I reduced the executor memory to 44GB for a 60GB machine (instead of
>> 49GB).
>>
>> This seems to have helped. Thanks to Matthew and Sean.
>>
>> Thomas
>>
>> On Tue, Mar 24, 2015 at 3:49 PM, Matt Silvey 
>> wrote:
>>
>>> My memory is hazy on this but aren't there "hidden" limitations to
>>> Linux-based threads?  I ran into some issues a couple of years ago where,
>>> and here is the fuzzy part, the kernel wants to reserve virtual memory per
>>> thread equal to the stack size.  When the total amount of reserved memory
>>> (not necessarily resident memory) exceeds the memory of the system it
>>> throws an OOM.  I'm looking for material to back this up.  Sorry for the
>>> initial vague response.
>>>
>>> Matthew
>>>
>>> On Tue, Mar 24, 2015 at 12:53 PM, Thomas Gerber <
>>> thomas.ger...@radius.com> wrote:
>>>
 Additional notes:
 I did not find anything wrong with the number of threads (ps -u USER -L
 | wc -l): around 780 on the master and 400 on executors. I am running on
 100 r3.2xlarge.

 On Tue, Mar 24, 2015 at 12:38 PM, Thomas Gerber <
 thomas.ger...@radius.com> wrote:

> Hello,
>
> I am seeing various crashes in spark on large jobs which all share a
> similar exception:
>
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:714)
>
> I increased nproc (i.e. ulimit -u) 10 fold, but it doesn't help.
>
> Does anyone know how to avoid those kinds of errors?
>
> Noteworthy: I added -XX:ThreadStackSize=10m on both driver and
> executor extra java options, which might have amplified the problem.
>
> Thanks for you help,
> Thomas
>


>>>
>>
>
>
> --
> Deepak
>
>


Re: What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Stuart Layton
Thanks for the response, I was using IN as an example of the type of
operation I need to do. Is there another way to do this that lines up more
naturally with the way things are supposed to be done in SparkSQL?

On Wed, Mar 25, 2015 at 2:29 PM, Michael Armbrust 
wrote:

> The only way to do "in" using python currently is to use the string based
> filter API (where you pass us an expression as a string, and we parse it
> using our SQL parser).
>
> from pyspark.sql import Row
> from pyspark.sql.functions import *
>
> df = sc.parallelize([Row(name="test")]).toDF()
> df.filter("name in ('a', 'b')").collect()
> Out[1]: []
>
> df.filter("name in ('test')").collect()
> Out[2]: [Row(name=u'test')]
>
> In general you want to avoid lambda functions whenever you can do the same
> thing a dataframe expression.  This is because your lambda function is a
> black box that we cannot optimize (though you should certainly use them for
> the advanced stuff that expressions can't handle).
>
> I opened SPARK-6536  to
> provide a nicer interface for this.
>
>
> On Wed, Mar 25, 2015 at 7:41 AM, Stuart Layton 
> wrote:
>
>> I have a SparkSQL dataframe with a a few billion rows that I need to
>> quickly filter down to a few hundred thousand rows, using an operation like
>> (syntax may not be correct)
>>
>> df = df[ df.filter(lambda x: x.key_col in approved_keys)]
>>
>> I was thinking about serializing the data using parquet and saving it to
>> S3, however as I want to optimize for filtering speed I'm not sure this is
>> the best option.
>>
>> --
>> Stuart Layton
>>
>
>


-- 
Stuart Layton


Re: python : Out of memory: Kill process

2015-03-25 Thread Davies Liu
What's the version of Spark you are running?

There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3,

[1] https://issues.apache.org/jira/browse/SPARK-6055

On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa
 wrote:
> Hi Guys, I running the following function with spark-submmit and de SO is
> killing my process :
>
>
>   def getRdd(self,date,provider):
> path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz'
> log2= self.sqlContext.jsonFile(path)
> log2.registerTempTable('log_test')
> log2.cache()

You only visit the table once, cache does not help here.

> out=self.sqlContext.sql("SELECT user, tax from log_test where provider =
> '"+provider+"'and country <> ''").map(lambda row: (row.user, row.tax))
> print "out1"
> return  map((lambda (x,y): (x, list(y))),
> sorted(out.groupByKey(2000).collect()))

100 partitions (or less) will be enough for 2G dataset.

>
>
> The input dataset has 57 zip files (2 GB)
>
> The same process with a smaller dataset completed successfully
>
> Any ideas to debug is welcome.
>
> Regards
> Eduardo
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Michael Armbrust
The only way to do "in" using python currently is to use the string based
filter API (where you pass us an expression as a string, and we parse it
using our SQL parser).

from pyspark.sql import Row
from pyspark.sql.functions import *

df = sc.parallelize([Row(name="test")]).toDF()
df.filter("name in ('a', 'b')").collect()
Out[1]: []

df.filter("name in ('test')").collect()
Out[2]: [Row(name=u'test')]

In general you want to avoid lambda functions whenever you can do the same
thing a dataframe expression.  This is because your lambda function is a
black box that we cannot optimize (though you should certainly use them for
the advanced stuff that expressions can't handle).

I opened SPARK-6536  to
provide a nicer interface for this.


On Wed, Mar 25, 2015 at 7:41 AM, Stuart Layton 
wrote:

> I have a SparkSQL dataframe with a a few billion rows that I need to
> quickly filter down to a few hundred thousand rows, using an operation like
> (syntax may not be correct)
>
> df = df[ df.filter(lambda x: x.key_col in approved_keys)]
>
> I was thinking about serializing the data using parquet and saving it to
> S3, however as I want to optimize for filtering speed I'm not sure this is
> the best option.
>
> --
> Stuart Layton
>


Re: OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread Michael Armbrust
You should also try increasing the perm gen size: -XX:MaxPermSize=512m

On Wed, Mar 25, 2015 at 2:37 AM, Ted Yu  wrote:

> Can you try giving Spark driver more heap ?
>
> Cheers
>
>
>
> On Mar 25, 2015, at 2:14 AM, Todd Leo  wrote:
>
> Hi,
>
> I am using *Spark SQL* to query on my *Hive cluster*, following Spark SQL
> and DataFrame Guide
>  step by
> step. However, my HiveQL via sqlContext.sql() fails and
> java.lang.OutOfMemoryError was raised. The expected result of such query is
> considered to be small (by adding limit 1000 clause). My code is shown
> below:
>
> scala> import sqlContext.implicits._
> scala> val df = sqlContext.sql("""select * from some_table where 
> logdate="2015-03-24" limit 1000""")
>
> and the error msg:
>
> [ERROR] [03/25/2015 16:08:22.379] [sparkDriver-scheduler-27] 
> [ActorSystem(sparkDriver)] Uncaught fatal error from thread 
> [sparkDriver-scheduler-27] shutting down ActorSystem [sparkDriver]
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>
> the master heap memory is set by -Xms512m -Xmx512m, while workers set by 
> -Xms4096M
> -Xmx4096M, which I presume sufficient for this trivial query.
>
> Additionally, after restarted the spark-shell and re-run the limit 5 query
> , the df object is returned and can be printed by df.show(), but other
> APIs fails on OutOfMemoryError, namely, df.count(),
> df.select("some_field").show() and so forth.
>
> I understand that the RDD can be collected to master hence further
> transmutations can be applied, as DataFrame has “richer optimizations under
> the hood” and the convention from an R/julia user, I really hope this error
> is able to be tackled, and DataFrame is robust enough to depend.
>
> Thanks in advance!
>
> REGARDS,
> Todd
> ​
>
> --
> View this message in context: OutOfMemoryError when using DataFrame
> created by Spark SQL
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>
>


Re: OOM for HiveFromSpark example

2015-03-25 Thread Zhan Zhang
I solve this by  increase the PermGen memory size in driver.

-XX:MaxPermSize=512m

Thanks.

Zhan Zhang

On Mar 25, 2015, at 10:54 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) 
mailto:deepuj...@gmail.com>> wrote:

I am facing same issue, posted a new thread. Please respond.

On Wed, Jan 14, 2015 at 4:38 AM, Zhan Zhang 
mailto:zzh...@hortonworks.com>> wrote:
Hi Folks,

I am trying to run hive context in yarn-cluster mode, but met some error. Does 
anybody know what cause the issue.

I use following cmd to build the distribution:

 ./make-distribution.sh -Phive -Phive-thriftserver  -Pyarn  -Phadoop-2.4

15/01/13 17:59:42 INFO cluster.YarnClusterScheduler: 
YarnClusterScheduler.postStartHook done
15/01/13 17:59:42 INFO storage.BlockManagerMasterActor: Registering block 
manager 
cn122-10.l42scl.hortonworks.com:56157
 with 1589.8 MB RAM, BlockManagerId(2, 
cn122-10.l42scl.hortonworks.com, 56157)
15/01/13 17:59:43 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF NOT 
EXISTS src (key INT, value STRING)
15/01/13 17:59:43 INFO parse.ParseDriver: Parse Completed
15/01/13 17:59:44 INFO metastore.HiveMetaStore: 0: Opening raw store with 
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/01/13 17:59:44 INFO metastore.ObjectStore: ObjectStore, initialize called
15/01/13 17:59:44 INFO DataNucleus.Persistence: Property 
datanucleus.cache.level2 unknown - will be ignored
15/01/13 17:59:44 INFO DataNucleus.Persistence: Property 
hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/01/13 17:59:44 WARN DataNucleus.Connection: BoneCP specified but not present 
in CLASSPATH (or one of dependencies)
15/01/13 17:59:44 WARN DataNucleus.Connection: BoneCP specified but not present 
in CLASSPATH (or one of dependencies)
15/01/13 17:59:52 INFO metastore.ObjectStore: Setting MetaStore object pin 
classes with 
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
15/01/13 17:59:52 INFO metastore.MetaStoreDirectSql: MySQL check failed, 
assuming we are not on mysql: Lexical error at line 1, column 5.  Encountered: 
"@" (64), after : "".
15/01/13 17:59:53 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
"embedded-only" so does not have its own datastore table.
15/01/13 17:59:53 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so 
does not have its own datastore table.
15/01/13 17:59:59 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
"embedded-only" so does not have its own datastore table.
15/01/13 17:59:59 INFO DataNucleus.Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so 
does not have its own datastore table.
15/01/13 18:00:00 INFO metastore.ObjectStore: Initialized ObjectStore
15/01/13 18:00:00 WARN metastore.ObjectStore: Version information not found in 
metastore. hive.metastore.schema.verification is not enabled so recording the 
schema version 0.13.1aa
15/01/13 18:00:01 INFO metastore.HiveMetaStore: Added admin role in metastore
15/01/13 18:00:01 INFO metastore.HiveMetaStore: Added public role in metastore
15/01/13 18:00:01 INFO metastore.HiveMetaStore: No user is added in admin role, 
since config is empty
15/01/13 18:00:01 INFO session.SessionState: No Tez session required at this 
point. hive.execution.engine=mr.
15/01/13 18:00:02 INFO log.PerfLogger: 
15/01/13 18:00:02 INFO log.PerfLogger: 
15/01/13 18:00:02 INFO ql.Driver: Concurrency mode is disabled, not creating a 
lock manager
15/01/13 18:00:02 INFO log.PerfLogger: 
15/01/13 18:00:03 INFO log.PerfLogger: 
15/01/13 18:00:03 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF NOT 
EXISTS src (key INT, value STRING)
15/01/13 18:00:03 INFO parse.ParseDriver: Parse Completed
15/01/13 18:00:03 INFO log.PerfLogger: 
15/01/13 18:00:03 INFO log.PerfLogger: 
15/01/13 18:00:03 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
15/01/13 18:00:03 INFO parse.SemanticAnalyzer: Creating table src position=27
15/01/13 18:00:03 INFO metastore.HiveMetaStore: 0: get_table : db=default 
tbl=src
15/01/13 18:00:03 INFO HiveMetaStore.audit: ugi=zzhang  ip=unknown-ip-addr  
cmd=get_table : db=default tbl=src
15/01/13 18:00:03 INFO metastore.HiveMetaStore: 0: get_database: default
15/01/13 18:00:03 INFO HiveMetaStore.audit: ugi=zzhang  ip=unknown-ip-addr  
cmd=get_database: default
15/01/13 18:00:03 INFO ql.Driver: Semantic Analysis Completed
15/01/13 18:00:03 INFO log.PerfLogger: 
15/01/13 18:00:03 INFO ql.Driver: Returning Hive schema: 
Schema(fieldSchemas:null, properties:null)
15/01/13 18:00:03 INFO log.PerfLogger: 
15/01/13 18:00:03 INFO log.PerfLogger: 
15/01/13 18:00:03 INFO ql.Driver: Starting command: CREATE TABLE IF NOT EXISTS 
src (key INT, value STRING)
15/01/13 18:00:03 INFO log

Re: column expression in left outer join for DataFrame

2015-03-25 Thread Michael Armbrust
Unfortunately you are now hitting a bug (that is fixed in master and will
be released in 1.3.1 hopefully next week).  However, even with that your
query is still ambiguous and you will need to use aliases:

val df_1 = df.filter( df("event") === 0)
  . select("country", "cnt").as("a")
val df_2 = df.filter( df("event") === 3)
  . select("country", "cnt").as("b")
val both = df_2.join(df_1, $"a.country" === $"b.country"), "left_outer")



On Tue, Mar 24, 2015 at 11:57 PM, S Krishna  wrote:

> Hi,
>
> Thanks for your response. I modified my code as per your suggestion, but
> now I am getting a runtime error. Here's my code:
>
> val df_1 = df.filter( df("event") === 0)
>   . select("country", "cnt")
>
> val df_2 = df.filter( df("event") === 3)
>   . select("country", "cnt")
>
> df_1.show()
> //produces the following output :
> // countrycnt
> //   tw   3000
> //   uk   2000
> //   us   1000
>
> df_2.show()
> //produces the following output :
> // countrycnt
> //   tw   25
> //   uk   200
> //   us   95
>
> val both = df_2.join(df_1, df_2("country")===df_1("country"), "left_outer")
>
> I am getting the following error when executing the join statement:
>
> java.util.NoSuchElementException: next on empty iterator.
>
> This error seems to be originating at DataFrame.join (line 133 in
> DataFrame.scala).
>
> The show() results show that both dataframes do have columns named
> "country" and that they are non-empty. I also tried the simpler join ( i.e.
> df_2.join(df_1) ) and got the same error stated above.
>
> I would like to know what is wrong with the join statement above.
>
> thanks
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust 
> wrote:
>
>> You need to use `===`, so that you are constructing a column expression
>> instead of evaluating the standard scala equality method.  Calling methods
>> to access columns (i.e. df.county is only supported in python).
>>
>> val join_df =  df1.join( df2, df1("country") === df2("country"),
>> "left_outer")
>>
>> On Tue, Mar 24, 2015 at 5:50 PM, SK  wrote:
>>
>>> Hi,
>>>
>>> I am trying to port some code that was working in Spark 1.2.0 on the
>>> latest
>>> version, Spark 1.3.0. This code involves a left outer join between two
>>> SchemaRDDs which I am now trying to change to a left outer join between 2
>>> DataFrames. I followed the example  for left outer join of DataFrame at
>>>
>>> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>>>
>>> Here's my code, where df1 and df2 are the 2 dataframes I am joining on
>>> the
>>> "country" field:
>>>
>>>  val join_df =  df1.join( df2,  df1.country == df2.country, "left_outer")
>>>
>>> But I got a compilation error that value  country is not a member of
>>> sql.DataFrame
>>>
>>> I  also tried the following:
>>>  val join_df =  df1.join( df2, df1("country") == df2("country"),
>>> "left_outer")
>>>
>>> I got a compilation error that it is a Boolean whereas a Column is
>>> required.
>>>
>>> So what is the correct Column expression I need to provide for joining
>>> the 2
>>> dataframes on a specific field ?
>>>
>>> thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: OutOfMemory : Java heap space error

2015-03-25 Thread ๏̯͡๏
I am facing same issue, posted a new thread. Please respond.

On Wed, Jul 9, 2014 at 1:56 AM, Rahul Bhojwani 
wrote:

> Hi,
>
> My code was running properly but then it suddenly gave this error. Can you
> just put some light on it.
>
> ###
> 0 KB, free: 38.7 MB)
> 14/07/09 01:46:12 INFO BlockManagerMaster: Updated info of block rdd_2212_4
> 14/07/09 01:46:13 INFO PythonRDD: Times: total = 1486, boot = 698, init =
> 626, finish = 162
> Exception in thread "stdin writer for python" 14/07/09 01:46:14 INFO
> MemoryStore: ensureFreeSpace(61480) called with cur
> Mem=270794224, maxMem=311387750
> java.lang.OutOfMemoryError: Java heap space
> at java.io.BufferedOutputStream.(Unknown Source)
> at
> org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:62)
> 14/07/09 01:46:15 INFO MemoryStore: Block rdd_2212_0 stored as values to
> memory (estimated size 60.0 KB, free 38.7 MB)
> Exception in thread "stdin writer for python" java.lang.OutOfMemoryError:
> Java heap space
> at java.io.BufferedOutputStream.(Unknown Source)
> at
> org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:62)
> 14/07/09 01:46:18 INFO BlockManagerMasterActor$BlockManagerInfo: Added
> rdd_2212_0 in memory on shawn-PC:51451 (size: 60.
> 0 KB, free: 38.7 MB)
> PySpark worker failed with exception:
> Traceback (most recent call last):
>   File "F:\spark-0.9.1\spark-0.9.1\bin\..\/python/pyspark/worker.py", line
> 50, in main
> split_index = read_int(infile)
>   File "F:\spark-0.9.1\spark-0.9.1\python\pyspark\serializers.py", line
> 328, in read_int
> raise EOFError
> EOFError
>
> 14/07/09 01:46:25 INFO BlockManagerMaster: Updated info of block rdd_2212_0
> Exception in thread "stdin writer for python" java.lang.OutOfMemoryError:
> Java heap space
> PySpark worker failed with exception:
> Traceback (most recent call last):
>   File "F:\spark-0.9.1\spark-0.9.1\bin\..\/python/pyspark/worker.py", line
> 50, in main
> split_index = read_int(infile)
>   File "F:\spark-0.9.1\spark-0.9.1\python\pyspark\serializers.py", line
> 328, in read_int
> raise EOFError
> EOFError
>
> Exception in thread "Executor task launch worker-3"
> java.lang.OutOfMemoryError: Java heap space
>
> Exception: java.lang.OutOfMemoryError thrown from the
> UncaughtExceptionHandler in thread "spark-akka.actor.default-dispa
> tcher-15"
> Exception in thread "Executor task launch worker-1" Exception in thread
> "Executor task launch worker-2" java.lang.OutOfM
> emoryError: Java heap space
> java.lang.OutOfMemoryError: Java heap space
> Exception in thread "Executor task launch worker-0" Exception in thread
> "Executor task launch worker-5" java.lang.OutOfM
> emoryError: Java heap space
> java.lang.OutOfMemoryError: Java heap space
> 14/07/09 01:46:52 WARN BlockManagerMaster: Error sending message to
> BlockManagerMaster in 1 attempts
> akka.pattern.AskTimeoutException:
> Recipient[Actor[akka://spark/user/BlockManagerMaster#920823400]] had
> already been term
> inated.
> at
> akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134)
> at
> org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:161)
> at
> org.apache.spark.storage.BlockManagerMaster.sendHeartBeat(BlockManagerMaster.scala:52)
> at org.apache.spark.storage.BlockManager.org
> $apache$spark$storage$BlockManager$$heartBeat(BlockManager.scala:97)
>
> at
> org.apache.spark.storage.BlockManager$$anonfun$initialize$1.apply$mcV$sp(BlockManager.scala:135)
> at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)
> at java.lang.Thread.run(Unknown Source)
> 14/07/09 01:46:56 WARN BlockManagerMaster: Error sending message to
> BlockManagerMaster in 2 attempts
> akka.pattern.AskTimeoutException:
> Recipient[Actor[akka://spark/user/BlockManagerMaster#920823400]] had
> already been term
> inated.
> at
> akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134)
> at
> org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:161)
> at
> org.apache.spark.storage.BlockManagerMaster.sendHeartBeat(BlockManagerMaster.scala:52)
> at org.apache.spark.storage.BlockManager.org
> $apache$spark$storage$BlockManager$$heartBeat(BlockManager.scala:97)
>
> at
> org.apache.spark.storage.BlockManager$$anonfun$initialize$1.apply$mcV$sp(BlockManager.scala:135)
> at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> Source)
> at java.util.concurrent.ThreadPoolE

Re: OOM for HiveFromSpark example

2015-03-25 Thread ๏̯͡๏
I am facing same issue, posted a new thread. Please respond.

On Wed, Jan 14, 2015 at 4:38 AM, Zhan Zhang  wrote:

> Hi Folks,
>
> I am trying to run hive context in yarn-cluster mode, but met some error.
> Does anybody know what cause the issue.
>
> I use following cmd to build the distribution:
>
>  ./make-distribution.sh -Phive -Phive-thriftserver  -Pyarn  -Phadoop-2.4
>
> 15/01/13 17:59:42 INFO cluster.YarnClusterScheduler:
> YarnClusterScheduler.postStartHook done
> 15/01/13 17:59:42 INFO storage.BlockManagerMasterActor: Registering block
> manager cn122-10.l42scl.hortonworks.com:56157 with 1589.8 MB RAM,
> BlockManagerId(2, cn122-10.l42scl.hortonworks.com, 56157)
> 15/01/13 17:59:43 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF
> NOT EXISTS src (key INT, value STRING)
> 15/01/13 17:59:43 INFO parse.ParseDriver: Parse Completed
> 15/01/13 17:59:44 INFO metastore.HiveMetaStore: 0: Opening raw store with
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 15/01/13 17:59:44 INFO metastore.ObjectStore: ObjectStore, initialize
> called
> 15/01/13 17:59:44 INFO DataNucleus.Persistence: Property
> datanucleus.cache.level2 unknown - will be ignored
> 15/01/13 17:59:44 INFO DataNucleus.Persistence: Property
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 15/01/13 17:59:44 WARN DataNucleus.Connection: BoneCP specified but not
> present in CLASSPATH (or one of dependencies)
> 15/01/13 17:59:44 WARN DataNucleus.Connection: BoneCP specified but not
> present in CLASSPATH (or one of dependencies)
> 15/01/13 17:59:52 INFO metastore.ObjectStore: Setting MetaStore object pin
> classes with
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 15/01/13 17:59:52 INFO metastore.MetaStoreDirectSql: MySQL check failed,
> assuming we are not on mysql: Lexical error at line 1, column 5.
> Encountered: "@" (64), after : "".
> 15/01/13 17:59:53 INFO DataNucleus.Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
> "embedded-only" so does not have its own datastore table.
> 15/01/13 17:59:53 INFO DataNucleus.Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
> "embedded-only" so does not have its own datastore table.
> 15/01/13 17:59:59 INFO DataNucleus.Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
> "embedded-only" so does not have its own datastore table.
> 15/01/13 17:59:59 INFO DataNucleus.Datastore: The class
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as
> "embedded-only" so does not have its own datastore table.
> 15/01/13 18:00:00 INFO metastore.ObjectStore: Initialized ObjectStore
> 15/01/13 18:00:00 WARN metastore.ObjectStore: Version information not
> found in metastore. hive.metastore.schema.verification is not enabled so
> recording the schema version 0.13.1aa
> 15/01/13 18:00:01 INFO metastore.HiveMetaStore: Added admin role in
> metastore
> 15/01/13 18:00:01 INFO metastore.HiveMetaStore: Added public role in
> metastore
> 15/01/13 18:00:01 INFO metastore.HiveMetaStore: No user is added in admin
> role, since config is empty
> 15/01/13 18:00:01 INFO session.SessionState: No Tez session required at
> this point. hive.execution.engine=mr.
> 15/01/13 18:00:02 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/13 18:00:02 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/13 18:00:02 INFO ql.Driver: Concurrency mode is disabled, not
> creating a lock manager
> 15/01/13 18:00:02 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/13 18:00:03 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/13 18:00:03 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF
> NOT EXISTS src (key INT, value STRING)
> 15/01/13 18:00:03 INFO parse.ParseDriver: Parse Completed
> 15/01/13 18:00:03 INFO log.PerfLogger:  start=1421190003030 end=1421190003031 duration=1
> from=org.apache.hadoop.hive.ql.Driver>
> 15/01/13 18:00:03 INFO log.PerfLogger:  from=org.apache.hadoop.hive.ql.Driver>
> 15/01/13 18:00:03 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
> 15/01/13 18:00:03 INFO parse.SemanticAnalyzer: Creating table src
> position=27
> 15/01/13 18:00:03 INFO metastore.HiveMetaStore: 0: get_table : db=default
> tbl=src
> 15/01/13 18:00:03 INFO HiveMetaStore.audit: ugi=zzhang
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=src
> 15/01/13 18:00:03 INFO metastore.HiveMetaStore: 0: get_database: default
> 15/01/13 18:00:03 INFO HiveMetaStore.audit: ugi=zzhang
> ip=unknown-ip-addr  cmd=get_database: default
> 15/01/13 18:00:03 INFO ql.Driver: Semantic Analysis Completed
> 15/01/13 18:00:03 INFO log.PerfLogger:  start=1421190003031 end=1421190003406 duration=375
> from=org.apache.hadoop.hive.ql.Driver>
> 15/01/13 18:00:03 INFO ql.Driver: Returning Hive schema:
> Schema(fieldSchemas:null, properties:null)
> 15/01/13 18:00

Re: java.lang.OutOfMemoryError: unable to create new native thread

2015-03-25 Thread ๏̯͡๏
I have a YARN cluster where the max memory allowed is 16GB. I set 12G for
my driver, however i see OutOFMemory error even for this program
http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables .
What do you suggest ?

On Wed, Mar 25, 2015 at 8:23 AM, Thomas Gerber 
wrote:

> So,
>
> 1. I reduced my  -XX:ThreadStackSize to 5m (instead of 10m - default is
> 1m), which is still OK for my need.
> 2. I reduced the executor memory to 44GB for a 60GB machine (instead of
> 49GB).
>
> This seems to have helped. Thanks to Matthew and Sean.
>
> Thomas
>
> On Tue, Mar 24, 2015 at 3:49 PM, Matt Silvey 
> wrote:
>
>> My memory is hazy on this but aren't there "hidden" limitations to
>> Linux-based threads?  I ran into some issues a couple of years ago where,
>> and here is the fuzzy part, the kernel wants to reserve virtual memory per
>> thread equal to the stack size.  When the total amount of reserved memory
>> (not necessarily resident memory) exceeds the memory of the system it
>> throws an OOM.  I'm looking for material to back this up.  Sorry for the
>> initial vague response.
>>
>> Matthew
>>
>> On Tue, Mar 24, 2015 at 12:53 PM, Thomas Gerber > > wrote:
>>
>>> Additional notes:
>>> I did not find anything wrong with the number of threads (ps -u USER -L
>>> | wc -l): around 780 on the master and 400 on executors. I am running on
>>> 100 r3.2xlarge.
>>>
>>> On Tue, Mar 24, 2015 at 12:38 PM, Thomas Gerber <
>>> thomas.ger...@radius.com> wrote:
>>>
 Hello,

 I am seeing various crashes in spark on large jobs which all share a
 similar exception:

 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:714)

 I increased nproc (i.e. ulimit -u) 10 fold, but it doesn't help.

 Does anyone know how to avoid those kinds of errors?

 Noteworthy: I added -XX:ThreadStackSize=10m on both driver and executor
 extra java options, which might have amplified the problem.

 Thanks for you help,
 Thomas

>>>
>>>
>>
>


-- 
Deepak


Unable to Hive program from Spark Programming Guide (OutOfMemoryError)

2015-03-25 Thread ๏̯͡๏
http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables



I modified the Hive query but run into same error. (
http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#hive-tables)


val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("CREATE TABLE IF NOT EXISTS src_spark (key INT, value
STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH
'examples/src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)



Command

./bin/spark-submit -v --master yarn-cluster --jars
/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar
--num-executors 3 --driver-memory 8g --executor-memory 2g --executor-cores
1 --queue hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp
spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16
input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2


Input

-sh-4.1$ ls -l examples/src/main/resources/kv1.txt
-rw-r--r-- 1 dvasthimal gid-dvasthimal 5812 Mar  5 17:31
examples/src/main/resources/kv1.txt
-sh-4.1$ head examples/src/main/resources/kv1.txt
238val_238
86val_86
311val_311
27val_27
165val_165
409val_409
255val_255
278val_278
98val_98
484val_484
-sh-4.1$

Log

/apache/hadoop/bin/yarn logs -applicationId application_1426715280024_82757

…
…
…


15/03/25 07:52:44 INFO metastore.HiveMetaStore: No user is added in admin
role, since config is empty
15/03/25 07:52:44 INFO session.SessionState: No Tez session required at
this point. hive.execution.engine=mr.
15/03/25 07:52:47 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF
NOT EXISTS src_spark (key INT, value STRING)
15/03/25 07:52:47 INFO parse.ParseDriver: Parse Completed
15/03/25 07:52:48 INFO log.PerfLogger: 
15/03/25 07:52:48 INFO log.PerfLogger: 
15/03/25 07:52:48 INFO ql.Driver: Concurrency mode is disabled, not
creating a lock manager
15/03/25 07:52:48 INFO log.PerfLogger: 
15/03/25 07:52:48 INFO log.PerfLogger: 
15/03/25 07:52:48 INFO parse.ParseDriver: Parsing command: CREATE TABLE IF
NOT EXISTS src_spark (key INT, value STRING)
15/03/25 07:52:48 INFO parse.ParseDriver: Parse Completed
15/03/25 07:52:48 INFO log.PerfLogger: 
15/03/25 07:52:48 INFO log.PerfLogger: 
15/03/25 07:52:48 INFO parse.SemanticAnalyzer: Starting Semantic Analysis
15/03/25 07:52:48 INFO parse.SemanticAnalyzer: Creating table src_spark
position=27
15/03/25 07:52:48 INFO metastore.HiveMetaStore: 0: get_table : db=default
tbl=src_spark
15/03/25 07:52:48 INFO HiveMetaStore.audit: ugi=dvasthimal
ip=unknown-ip-addr cmd=get_table : db=default tbl=src_spark
15/03/25 07:52:48 INFO metastore.HiveMetaStore: 0: get_database: default
15/03/25 07:52:48 INFO HiveMetaStore.audit: ugi=dvasthimal
ip=unknown-ip-addr cmd=get_database: default
15/03/25 07:52:48 INFO ql.Driver: Semantic Analysis Completed
15/03/25 07:52:48 INFO log.PerfLogger: 
15/03/25 07:52:48 INFO ql.Driver: Returning Hive schema:
Schema(fieldSchemas:null, properties:null)
15/03/25 07:52:48 INFO log.PerfLogger: 
15/03/25 07:52:48 INFO log.PerfLogger: 
15/03/25 07:52:48 INFO ql.Driver: Starting command: CREATE TABLE IF NOT
EXISTS src_spark (key INT, value STRING)
15/03/25 07:52:48 INFO log.PerfLogger: 
15/03/25 07:52:48 INFO log.PerfLogger: 
15/03/25 07:52:48 INFO log.PerfLogger: 
15/03/25 07:52:51 INFO exec.DDLTask: Default to LazySimpleSerDe for table
src_spark
15/03/25 07:52:52 INFO metastore.HiveMetaStore: 0: create_table:
Table(tableName:src_spark, dbName:default, owner:dvasthimal,
createTime:1427295171, lastAccessTime:0, retention:0,
sd:StorageDescriptor(cols:[FieldSchema(name:key, type:int, comment:null),
FieldSchema(name:value, type:string, comment:null)], location:null,
inputFormat:org.apache.hadoop.mapred.TextInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
parameters:{serialization.format=1}), bucketCols:[], sortCols:[],
parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[],
skewedColValueLocationMaps:{}), storedAsSubDirectories:false),
partitionKeys:[], parameters:{}, viewOriginalText:null,
viewExpandedText:null, tableType:MANAGED_TABLE)
15/03/25 07:52:52 INFO HiveMetaStore.audit: ugi=dvasthimal
ip=unknown-ip-addr cmd=create_table: Table(tableName:src_spark,
dbName:default, owner:dvasthimal, createTime:1427295171, lastAccessTime:0,
retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:key, type:int,
comment:null), FieldSchema(name:value, type:string, comment:null)],
location:null, inpu

Re: Total size of serialized results is bigger than spark.driver.maxResultSize

2015-03-25 Thread Denny Lee
As you noted, you can change the spark.driver.maxResultSize value in your
Spark Configurations (https://spark.apache.org/docs/1.2.0/configuration.html).
Please reference the Spark Properties section noting that you can modify
these properties via the spark-defaults.conf or via SparkConf().

HTH!



On Wed, Mar 25, 2015 at 8:01 AM Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:

>  Hi
>
>
>
> I ran a spark job and got the following error. Can anybody tell me how to
> work around this problem? For example how can I increase
> spark.driver.maxResultSize? Thanks.
>
>  org.apache.spark.SparkException: Job aborted due to stage failure: Total
> size of serialized results
>
> of 128 tasks (1029.1 MB) is bigger than spark.driver.maxResultSize (1024.0
> MB)
>
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobA
>
> ndIndependentStages(DAGScheduler.scala:1214)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:12
>
> 03)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:12
>
> 02)
>
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler
>
> .scala:696)
>
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler
>
> .scala:696)
>
> at scala.Option.foreach(Option.scala:236)
>
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
>
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(D
>
> AGScheduler.scala:1420)
>
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala
>
> :1375)
>
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>
>at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala
>
> :393)
>
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> 15/03/25 10:48:38 WARN TaskSetManager: Lost task 128.0 in stage 199.0 (TID
> 6324, INT1-CAS01.pcc.lexi
>
> snexis.com): TaskKilled (killed intentionally)
>
>
>
> Ningjun
>
>
>


python : Out of memory: Kill process

2015-03-25 Thread Eduardo Cusa
Hi Guys, I running the following function with spark-submmit and de SO is
killing my process :


  def getRdd(self,date,provider):
path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz'
log2= self.sqlContext.jsonFile(path)
log2.registerTempTable('log_test')
log2.cache()
out=self.sqlContext.sql("SELECT user, tax from log_test where provider
= '"+provider+"'and country <> ''").map(lambda row: (row.user, row.tax))
print "out1"
return  map((lambda (x,y): (x, list(y))),
sorted(out.groupByKey(2000).collect()))



The input dataset has 57 zip files (2 GB)

The same process with a smaller dataset completed successfully

Any ideas to debug is welcome.

Regards
Eduardo


Recovered state for updateStateByKey and incremental streams processing

2015-03-25 Thread Ravi Reddy
I want to use the "restore from checkpoint" to continue from last accumulated
word counts and process new streams of data. This recovery process will keep
accurate state of accumulated counters state (calculated by
updateStateByKey) after "failure/recovery" or "temp shutdown/upgrade to new
code".

However, the recomendation seem to indicate you have to delete the
checkpoint data if you upgrade to new code. How would this work if I change
the word count accumulation logic and still want to continue to work from
last remembered state?. An example would be that the word counters could be
weighted in a logic that is used for streams that are coming from a
point-in-time later.

This is an example but there are quite a few scenarios where one needs to
continue from previous state as rememberd by "updateStateByKey" and apply
new logic. 

 code snippets below 

As in example "RecoverableNetworkWordCount" we should build
"setupInputStreamAndProcessWordCounts" in the context of retrieved
checkpoint only. If "setupInputStreamAndProcessWordCounts" called outside
the "createContext", you will get error "[Receiver-0-1427260249292] is not
unique!".

  def createContext(checkPointDir: String, host: String, port : Int) = {
// If you do not see this printed, that means the StreamingContext has
been loaded
// from the new checkpoint
println("Creating new context")
val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount")

// Create the context with a 1 second batch size
val ssc = new StreamingContext(sparkConf, Seconds(1))
ssc.checkpoint(checkPointDir)

setupInputStreamAndProcessWordCounts(ssc, host, port)

ssc
  }

and invoke in main as

  def main(args: Array[String]) {
val checkPointDir = "./saved_state"

if (args.length < 2) {
  System.err.println("Usage: SavedNetworkWordCount  ")
  System.exit(1)
}

val ssc = StreamingContext.getOrCreate(checkPointDir, () => {
createContext(checkPointDir,args(0), args(1).toInt)
  })

//setupInputStreamAndProcessWordCounts(ssc, args(0), args(1).toInt)

ssc.start()
ssc.awaitTermination()
  }
 }

  def setupInputStreamAndProcessWordCounts(ssc: StreamingContext, hostname:
String, port: Int) {
// InputDStream has to be created inside createContext, else you get an
error
val lines = ssc.socketTextStream(hostname, port)
val words = lines.flatMap(_.split(" "))
val wordDstream = words.map(x => (x, 1))

val updateFunc = (values: Seq[Int], state: Option[Int]) => {
  val currentCount = values.sum
  val previousCount = state.getOrElse(0)
  Some(currentCount + previousCount)
}

// Update and print the cumulative count using updateStateByKey
val countsDstream = wordDstream.updateStateByKey[Int](updateFunc)
countsDstream.print() // print or save to external system
  }
  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Recovered-state-for-updateStateByKey-and-incremental-streams-processing-tp9.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[no subject]

2015-03-25 Thread Himanish Kushary
Hi,

I have a RDD of pairs of strings like below :

(A,B)
(B,C)
(C,D)
(A,D)
(E,F)
(B,F)

I need to transform/filter this into a RDD of pairs that does not repeat a
string once it has been used once. So something like ,

(A,B)
(C,D)
(E,F)

(B,C) is out because B has already ben used in (A,B), (A,D) is out because
A (and D) has been used etc.

I was thinking of a option of using a shared variable to keep track of what
has already been used but that may only work for a single partition and
would not scale for larger dataset.

Is there any other efficient way to accomplish this ?

-- 
Thanks & Regards
Himanish


Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Dean Wampler
For the Spark SQL parts, 1.3 breaks backwards compatibility, because before
1.3, Spark SQL was considered experimental where API changes were allowed.

So, H2O and ADA compatible with 1.2.X might not work with 1.3.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
 (O'Reilly)
Typesafe 
@deanwampler 
http://polyglotprogramming.com

On Wed, Mar 25, 2015 at 9:39 AM, roni  wrote:

> Even if H2o and ADA are dependent on 1.2.1 , it should be backword
> compatible, right?
> So using 1.3 should not break them.
> And the code is not using the classes from those libs.
> I tried sbt clean compile .. same errror
> Thanks
> _R
>
> On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath 
> wrote:
>
>> What version of Spark do the other dependencies rely on (Adam and H2O?) -
>> that could be it
>>
>> Or try sbt clean compile
>>
>> —
>> Sent from Mailbox 
>>
>>
>> On Wed, Mar 25, 2015 at 5:58 PM, roni  wrote:
>>
>>> I have a EC2 cluster created using spark version 1.2.1.
>>> And I have a SBT project .
>>> Now I want to upgrade to spark 1.3 and use the new features.
>>> Below are issues .
>>> Sorry for the long post.
>>> Appreciate your help.
>>> Thanks
>>> -Roni
>>>
>>> Question - Do I have to create a new cluster using spark 1.3?
>>>
>>> Here is what I did -
>>>
>>> In my SBT file I  changed to -
>>> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
>>>
>>> But then I started getting compilation error. along with
>>> Here are some of the libraries that were evicted:
>>> [warn]  * org.apache.spark:spark-core_2.10:1.2.0 -> 1.3.0
>>> [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) ->
>>> 2.6.0
>>> [warn] Run 'evicted' to see detailed eviction warnings
>>>
>>>  constructor cannot be instantiated to expected type;
>>> [error]  found   : (T1, T2)
>>> [error]  required: org.apache.spark.sql.catalyst.expressions.Row
>>> [error] val ty = joinRDD.map{case(word,
>>> (file1Counts, file2Counts)) => KmerIntesect(word, file1Counts,"xyz")}
>>> [error]  ^
>>>
>>> Here is my SBT and code --
>>> SBT -
>>>
>>> version := "1.0"
>>>
>>> scalaVersion := "2.10.4"
>>>
>>> resolvers += "Sonatype OSS Snapshots" at "
>>> https://oss.sonatype.org/content/repositories/snapshots";;
>>> resolvers += "Maven Repo1" at "https://repo1.maven.org/maven2";;
>>> resolvers += "Maven Repo" at "
>>> https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/";;
>>>
>>> /* Dependencies - %% appends Scala version to artifactId */
>>> libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.6.0"
>>> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
>>> libraryDependencies += "org.bdgenomics.adam" % "adam-core" % "0.16.0"
>>> libraryDependencies += "ai.h2o" % "sparkling-water-core_2.10" % "0.2.10"
>>>
>>>
>>> CODE --
>>> import org.apache.spark.{SparkConf, SparkContext}
>>> case class KmerIntesect(kmer: String, kCount: Int, fileName: String)
>>>
>>> object preDefKmerIntersection {
>>>   def main(args: Array[String]) {
>>>
>>>  val sparkConf = new SparkConf().setAppName("preDefKmer-intersect")
>>>  val sc = new SparkContext(sparkConf)
>>> import sqlContext.createSchemaRDD
>>> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>> val bedFile = sc.textFile("s3n://a/b/c",40)
>>>  val hgfasta = sc.textFile("hdfs://a/b/c",40)
>>>  val hgPair = hgfasta.map(_.split (",")).map(a=> (a(0),
>>> a(1).trim().toInt))
>>>  val filtered = hgPair.filter(kv => kv._2 == 1)
>>>  val bedPair = bedFile.map(_.split (",")).map(a=> (a(0),
>>> a(1).trim().toInt))
>>>  val joinRDD = bedPair.join(filtered)
>>> val ty = joinRDD.map{case(word, (file1Counts,
>>> file2Counts)) => KmerIntesect(word, file1Counts,"xyz")}
>>> ty.registerTempTable("KmerIntesect")
>>>
>>> ty.saveAsParquetFile("hdfs://x/y/z/kmerIntersect.parquet")
>>>   }
>>> }
>>>
>>>
>>
>


RE: Date and decimal datatype not working

2015-03-25 Thread BASAK, ANANDA
Thanks. This library is only available with Spark 1.3. I am using version 
1.2.1. Before I upgrade to 1.3, I want to try what can be done in 1.2.1.

So I am using following:
val MyDataset = sqlContext.sql("my select query”)

MyDataset.map(t => 
t(0)+"|"+t(1)+"|"+t(2)+"|"+t(3)+"|"+t(4)+"|"+t(5)).saveAsTextFile("/my_destination_path")

But it is giving following error:
15/03/24 17:05:51 ERROR Executor: Exception in task 1.0 in stage 13.0 (TID 106)
java.lang.NumberFormatException: For input string: ""
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:453)
at java.lang.Long.parseLong(Long.java:483)
at 
scala.collection.immutable.StringLike$class.toLong(StringLike.scala:230)

is there something wrong with the TSTAMP field which is Long datatype?

Thanks & Regards
---
Ananda Basak
Ph: 425-213-7092

From: Yin Huai [mailto:yh...@databricks.com]
Sent: Monday, March 23, 2015 8:55 PM
To: BASAK, ANANDA
Cc: user@spark.apache.org
Subject: Re: Date and decimal datatype not working

To store to csv file, you can use 
Spark-CSV library.

On Mon, Mar 23, 2015 at 5:35 PM, BASAK, ANANDA 
mailto:ab9...@att.com>> wrote:
Thanks. This worked well as per your suggestions. I had to run following:
val TABLE_A = 
sc.textFile("/Myhome/SPARK/files/table_a_file.txt").map(_.split("|")).map(p => 
ROW_A(p(0).trim.toLong, p(1), p(2).trim.toInt, p(3), BigDecimal(p(4)), 
BigDecimal(p(5)), BigDecimal(p(6

Now I am stuck at another step. I have run a SQL query, where I am Selecting 
from all the fields with some where clause , TSTAMP filtered with date range 
and order by TSTAMP clause. That is running fine.

Then I am trying to store the output in a CSV file. I am using 
saveAsTextFile(“filename”) function. But it is giving error. Can you please 
help me to write a proper syntax to store output in a CSV file?


Thanks & Regards
---
Ananda Basak
Ph: 425-213-7092

From: BASAK, ANANDA
Sent: Tuesday, March 17, 2015 3:08 PM
To: Yin Huai
Cc: user@spark.apache.org
Subject: RE: Date and decimal datatype not working

Ok, thanks for the suggestions. Let me try and will confirm all.

Regards
Ananda

From: Yin Huai [mailto:yh...@databricks.com]
Sent: Tuesday, March 17, 2015 3:04 PM
To: BASAK, ANANDA
Cc: user@spark.apache.org
Subject: Re: Date and decimal datatype not working

p(0) is a String. So, you need to explicitly convert it to a Long. e.g. 
p(0).trim.toLong. You also need to do it for p(2). For those BigDecimals value, 
you need to create BigDecimal objects from your String values.

On Tue, Mar 17, 2015 at 5:55 PM, BASAK, ANANDA 
mailto:ab9...@att.com>> wrote:
Hi All,
I am very new in Spark world. Just started some test coding from last week. I 
am using spark-1.2.1-bin-hadoop2.4 and scala coding.
I am having issues while using Date and decimal data types. Following is my 
code that I am simply running on scala prompt. I am trying to define a table 
and point that to my flat file containing raw data (pipe delimited format). 
Once that is done, I will run some SQL queries and put the output data in to 
another flat file with pipe delimited format.

***
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD


// Define row and table
case class ROW_A(
  TSTAMP:   Long,
  USIDAN: String,
  SECNT:Int,
  SECT:   String,
  BLOCK_NUM:BigDecimal,
  BLOCK_DEN:BigDecimal,
  BLOCK_PCT:BigDecimal)

val TABLE_A = 
sc.textFile("/Myhome/SPARK/files/table_a_file.txt").map(_.split("|")).map(p => 
ROW_A(p(0), p(1), p(2), p(3), p(4), p(5), p(6)))

TABLE_A.registerTempTable("TABLE_A")

***

The second last command is giving error, like following:
:17: error: type mismatch;
found   : String
required: Long

Looks like the content from my flat file are considered as String always and 
not as Date or decimal. How can I make Spark to take them as Date or decimal 
types?

Regards
Ananda




Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Nick Pentreath
Ah I see now you are trying to use a spark 1.2 cluster - you will need to be 
running spark 1.3 on your EC2 cluster in order to run programs built against 
spark 1.3.




You will need to terminate and restart your cluster with spark 1.3 



—
Sent from Mailbox

On Wed, Mar 25, 2015 at 6:39 PM, roni  wrote:

> Even if H2o and ADA are dependent on 1.2.1 , it should be backword
> compatible, right?
> So using 1.3 should not break them.
> And the code is not using the classes from those libs.
> I tried sbt clean compile .. same errror
> Thanks
> _R
> On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath 
> wrote:
>> What version of Spark do the other dependencies rely on (Adam and H2O?) -
>> that could be it
>>
>> Or try sbt clean compile
>>
>> —
>> Sent from Mailbox 
>>
>>
>> On Wed, Mar 25, 2015 at 5:58 PM, roni  wrote:
>>
>>> I have a EC2 cluster created using spark version 1.2.1.
>>> And I have a SBT project .
>>> Now I want to upgrade to spark 1.3 and use the new features.
>>> Below are issues .
>>> Sorry for the long post.
>>> Appreciate your help.
>>> Thanks
>>> -Roni
>>>
>>> Question - Do I have to create a new cluster using spark 1.3?
>>>
>>> Here is what I did -
>>>
>>> In my SBT file I  changed to -
>>> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
>>>
>>> But then I started getting compilation error. along with
>>> Here are some of the libraries that were evicted:
>>> [warn]  * org.apache.spark:spark-core_2.10:1.2.0 -> 1.3.0
>>> [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) -> 2.6.0
>>> [warn] Run 'evicted' to see detailed eviction warnings
>>>
>>>  constructor cannot be instantiated to expected type;
>>> [error]  found   : (T1, T2)
>>> [error]  required: org.apache.spark.sql.catalyst.expressions.Row
>>> [error] val ty = joinRDD.map{case(word,
>>> (file1Counts, file2Counts)) => KmerIntesect(word, file1Counts,"xyz")}
>>> [error]  ^
>>>
>>> Here is my SBT and code --
>>> SBT -
>>>
>>> version := "1.0"
>>>
>>> scalaVersion := "2.10.4"
>>>
>>> resolvers += "Sonatype OSS Snapshots" at "
>>> https://oss.sonatype.org/content/repositories/snapshots";;
>>> resolvers += "Maven Repo1" at "https://repo1.maven.org/maven2";;
>>> resolvers += "Maven Repo" at "
>>> https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/";;
>>>
>>> /* Dependencies - %% appends Scala version to artifactId */
>>> libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.6.0"
>>> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
>>> libraryDependencies += "org.bdgenomics.adam" % "adam-core" % "0.16.0"
>>> libraryDependencies += "ai.h2o" % "sparkling-water-core_2.10" % "0.2.10"
>>>
>>>
>>> CODE --
>>> import org.apache.spark.{SparkConf, SparkContext}
>>> case class KmerIntesect(kmer: String, kCount: Int, fileName: String)
>>>
>>> object preDefKmerIntersection {
>>>   def main(args: Array[String]) {
>>>
>>>  val sparkConf = new SparkConf().setAppName("preDefKmer-intersect")
>>>  val sc = new SparkContext(sparkConf)
>>> import sqlContext.createSchemaRDD
>>> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>> val bedFile = sc.textFile("s3n://a/b/c",40)
>>>  val hgfasta = sc.textFile("hdfs://a/b/c",40)
>>>  val hgPair = hgfasta.map(_.split (",")).map(a=> (a(0),
>>> a(1).trim().toInt))
>>>  val filtered = hgPair.filter(kv => kv._2 == 1)
>>>  val bedPair = bedFile.map(_.split (",")).map(a=> (a(0),
>>> a(1).trim().toInt))
>>>  val joinRDD = bedPair.join(filtered)
>>> val ty = joinRDD.map{case(word, (file1Counts,
>>> file2Counts)) => KmerIntesect(word, file1Counts,"xyz")}
>>> ty.registerTempTable("KmerIntesect")
>>> ty.saveAsParquetFile("hdfs://x/y/z/kmerIntersect.parquet")
>>>   }
>>> }
>>>
>>>
>>

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
Even if H2o and ADA are dependent on 1.2.1 , it should be backword
compatible, right?
So using 1.3 should not break them.
And the code is not using the classes from those libs.
I tried sbt clean compile .. same errror
Thanks
_R

On Wed, Mar 25, 2015 at 9:26 AM, Nick Pentreath 
wrote:

> What version of Spark do the other dependencies rely on (Adam and H2O?) -
> that could be it
>
> Or try sbt clean compile
>
> —
> Sent from Mailbox 
>
>
> On Wed, Mar 25, 2015 at 5:58 PM, roni  wrote:
>
>> I have a EC2 cluster created using spark version 1.2.1.
>> And I have a SBT project .
>> Now I want to upgrade to spark 1.3 and use the new features.
>> Below are issues .
>> Sorry for the long post.
>> Appreciate your help.
>> Thanks
>> -Roni
>>
>> Question - Do I have to create a new cluster using spark 1.3?
>>
>> Here is what I did -
>>
>> In my SBT file I  changed to -
>> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
>>
>> But then I started getting compilation error. along with
>> Here are some of the libraries that were evicted:
>> [warn]  * org.apache.spark:spark-core_2.10:1.2.0 -> 1.3.0
>> [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) -> 2.6.0
>> [warn] Run 'evicted' to see detailed eviction warnings
>>
>>  constructor cannot be instantiated to expected type;
>> [error]  found   : (T1, T2)
>> [error]  required: org.apache.spark.sql.catalyst.expressions.Row
>> [error] val ty = joinRDD.map{case(word,
>> (file1Counts, file2Counts)) => KmerIntesect(word, file1Counts,"xyz")}
>> [error]  ^
>>
>> Here is my SBT and code --
>> SBT -
>>
>> version := "1.0"
>>
>> scalaVersion := "2.10.4"
>>
>> resolvers += "Sonatype OSS Snapshots" at "
>> https://oss.sonatype.org/content/repositories/snapshots";;
>> resolvers += "Maven Repo1" at "https://repo1.maven.org/maven2";;
>> resolvers += "Maven Repo" at "
>> https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/";;
>>
>> /* Dependencies - %% appends Scala version to artifactId */
>> libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.6.0"
>> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
>> libraryDependencies += "org.bdgenomics.adam" % "adam-core" % "0.16.0"
>> libraryDependencies += "ai.h2o" % "sparkling-water-core_2.10" % "0.2.10"
>>
>>
>> CODE --
>> import org.apache.spark.{SparkConf, SparkContext}
>> case class KmerIntesect(kmer: String, kCount: Int, fileName: String)
>>
>> object preDefKmerIntersection {
>>   def main(args: Array[String]) {
>>
>>  val sparkConf = new SparkConf().setAppName("preDefKmer-intersect")
>>  val sc = new SparkContext(sparkConf)
>> import sqlContext.createSchemaRDD
>> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>> val bedFile = sc.textFile("s3n://a/b/c",40)
>>  val hgfasta = sc.textFile("hdfs://a/b/c",40)
>>  val hgPair = hgfasta.map(_.split (",")).map(a=> (a(0),
>> a(1).trim().toInt))
>>  val filtered = hgPair.filter(kv => kv._2 == 1)
>>  val bedPair = bedFile.map(_.split (",")).map(a=> (a(0),
>> a(1).trim().toInt))
>>  val joinRDD = bedPair.join(filtered)
>> val ty = joinRDD.map{case(word, (file1Counts,
>> file2Counts)) => KmerIntesect(word, file1Counts,"xyz")}
>> ty.registerTempTable("KmerIntesect")
>> ty.saveAsParquetFile("hdfs://x/y/z/kmerIntersect.parquet")
>>   }
>> }
>>
>>
>


Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Nick Pentreath
What version of Spark do the other dependencies rely on (Adam and H2O?) - that 
could be it




Or try sbt clean compile 



—
Sent from Mailbox

On Wed, Mar 25, 2015 at 5:58 PM, roni  wrote:

> I have a EC2 cluster created using spark version 1.2.1.
> And I have a SBT project .
> Now I want to upgrade to spark 1.3 and use the new features.
> Below are issues .
> Sorry for the long post.
> Appreciate your help.
> Thanks
> -Roni
> Question - Do I have to create a new cluster using spark 1.3?
> Here is what I did -
> In my SBT file I  changed to -
> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
> But then I started getting compilation error. along with
> Here are some of the libraries that were evicted:
> [warn]  * org.apache.spark:spark-core_2.10:1.2.0 -> 1.3.0
> [warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) -> 2.6.0
> [warn] Run 'evicted' to see detailed eviction warnings
>  constructor cannot be instantiated to expected type;
> [error]  found   : (T1, T2)
> [error]  required: org.apache.spark.sql.catalyst.expressions.Row
> [error] val ty = joinRDD.map{case(word,
> (file1Counts, file2Counts)) => KmerIntesect(word, file1Counts,"xyz")}
> [error]  ^
> Here is my SBT and code --
> SBT -
> version := "1.0"
> scalaVersion := "2.10.4"
> resolvers += "Sonatype OSS Snapshots" at "
> https://oss.sonatype.org/content/repositories/snapshots";;
> resolvers += "Maven Repo1" at "https://repo1.maven.org/maven2";;
> resolvers += "Maven Repo" at "
> https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/";;
> /* Dependencies - %% appends Scala version to artifactId */
> libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.6.0"
> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
> libraryDependencies += "org.bdgenomics.adam" % "adam-core" % "0.16.0"
> libraryDependencies += "ai.h2o" % "sparkling-water-core_2.10" % "0.2.10"
> CODE --
> import org.apache.spark.{SparkConf, SparkContext}
> case class KmerIntesect(kmer: String, kCount: Int, fileName: String)
> object preDefKmerIntersection {
>   def main(args: Array[String]) {
>  val sparkConf = new SparkConf().setAppName("preDefKmer-intersect")
>  val sc = new SparkContext(sparkConf)
> import sqlContext.createSchemaRDD
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val bedFile = sc.textFile("s3n://a/b/c",40)
>  val hgfasta = sc.textFile("hdfs://a/b/c",40)
>  val hgPair = hgfasta.map(_.split (",")).map(a=> (a(0),
> a(1).trim().toInt))
>  val filtered = hgPair.filter(kv => kv._2 == 1)
>  val bedPair = bedFile.map(_.split (",")).map(a=> (a(0),
> a(1).trim().toInt))
>  val joinRDD = bedPair.join(filtered)
> val ty = joinRDD.map{case(word, (file1Counts, file2Counts))
> => KmerIntesect(word, file1Counts,"xyz")}
> ty.registerTempTable("KmerIntesect")
> ty.saveAsParquetFile("hdfs://x/y/z/kmerIntersect.parquet")
>   }
> }

upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread roni
I have a EC2 cluster created using spark version 1.2.1.
And I have a SBT project .
Now I want to upgrade to spark 1.3 and use the new features.
Below are issues .
Sorry for the long post.
Appreciate your help.
Thanks
-Roni

Question - Do I have to create a new cluster using spark 1.3?

Here is what I did -

In my SBT file I  changed to -
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"

But then I started getting compilation error. along with
Here are some of the libraries that were evicted:
[warn]  * org.apache.spark:spark-core_2.10:1.2.0 -> 1.3.0
[warn]  * org.apache.hadoop:hadoop-client:(2.5.0-cdh5.2.0, 2.2.0) -> 2.6.0
[warn] Run 'evicted' to see detailed eviction warnings

 constructor cannot be instantiated to expected type;
[error]  found   : (T1, T2)
[error]  required: org.apache.spark.sql.catalyst.expressions.Row
[error] val ty = joinRDD.map{case(word,
(file1Counts, file2Counts)) => KmerIntesect(word, file1Counts,"xyz")}
[error]  ^

Here is my SBT and code --
SBT -

version := "1.0"

scalaVersion := "2.10.4"

resolvers += "Sonatype OSS Snapshots" at "
https://oss.sonatype.org/content/repositories/snapshots";;
resolvers += "Maven Repo1" at "https://repo1.maven.org/maven2";;
resolvers += "Maven Repo" at "
https://s3.amazonaws.com/h2o-release/h2o-dev/master/1056/maven/repo/";;

/* Dependencies - %% appends Scala version to artifactId */
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.6.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"
libraryDependencies += "org.bdgenomics.adam" % "adam-core" % "0.16.0"
libraryDependencies += "ai.h2o" % "sparkling-water-core_2.10" % "0.2.10"


CODE --
import org.apache.spark.{SparkConf, SparkContext}
case class KmerIntesect(kmer: String, kCount: Int, fileName: String)

object preDefKmerIntersection {
  def main(args: Array[String]) {

 val sparkConf = new SparkConf().setAppName("preDefKmer-intersect")
 val sc = new SparkContext(sparkConf)
import sqlContext.createSchemaRDD
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val bedFile = sc.textFile("s3n://a/b/c",40)
 val hgfasta = sc.textFile("hdfs://a/b/c",40)
 val hgPair = hgfasta.map(_.split (",")).map(a=> (a(0),
a(1).trim().toInt))
 val filtered = hgPair.filter(kv => kv._2 == 1)
 val bedPair = bedFile.map(_.split (",")).map(a=> (a(0),
a(1).trim().toInt))
 val joinRDD = bedPair.join(filtered)
val ty = joinRDD.map{case(word, (file1Counts, file2Counts))
=> KmerIntesect(word, file1Counts,"xyz")}
ty.registerTempTable("KmerIntesect")
ty.saveAsParquetFile("hdfs://x/y/z/kmerIntersect.parquet")
  }
}


Re: Spark as a service

2015-03-25 Thread Irfan Ahmad
You're welcome. How did it go?


*Irfan Ahmad*
CTO | Co-Founder | *CloudPhysics* 
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Wed, Mar 25, 2015 at 7:53 AM, Ashish Mukherjee <
ashish.mukher...@gmail.com> wrote:

> Thank you
>
> On Tue, Mar 24, 2015 at 8:40 PM, Irfan Ahmad 
> wrote:
>
>> Also look at the spark-kernel and spark job server projects.
>>
>> Irfan
>> On Mar 24, 2015 5:03 AM, "Todd Nist"  wrote:
>>
>>> Perhaps this project, https://github.com/calrissian/spark-jetty-server,
>>> could help with your requirements.
>>>
>>> On Tue, Mar 24, 2015 at 7:12 AM, Jeffrey Jedele <
>>> jeffrey.jed...@gmail.com> wrote:
>>>
 I don't think there's are general approach to that - the usecases are
 just to different. If you really need it, you probably will have to
 implement yourself in the driver of your application.

 PS: Make sure to use the reply to all button so that the mailing list
 is included in your reply. Otherwise only I will get your mail.

 Regards,
 Jeff

 2015-03-24 12:01 GMT+01:00 Ashish Mukherjee >>> >:

> Hi Jeffrey,
>
> Thanks. Yes, this resolves the SQL problem. My bad - I was looking for
> something which would work for Spark Streaming and other Spark jobs too,
> not just SQL.
>
> Regards,
> Ashish
>
> On Tue, Mar 24, 2015 at 4:07 PM, Jeffrey Jedele <
> jeffrey.jed...@gmail.com> wrote:
>
>> Hi Ashish,
>> this might be what you're looking for:
>>
>>
>> https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server
>>
>> Regards,
>> Jeff
>>
>> 2015-03-24 11:28 GMT+01:00 Ashish Mukherjee <
>> ashish.mukher...@gmail.com>:
>>
>>> Hello,
>>>
>>> As of now, if I have to execute a Spark job, I need to create a jar
>>> and deploy it.  If I need to run a dynamically formed SQL from a Web
>>> application, is there any way of using SparkSQL in this manner? Perhaps,
>>> through a Web Service or something similar.
>>>
>>> Regards,
>>> Ashish
>>>
>>
>>
>

>>>
>


Write Parquet File with spark-streaming with Spark 1.3

2015-03-25 Thread richiesgr
Hi

I've succeed to write kafka stream to parquet file in Spark 1.2 but I can't
make it with spark 1.3
As in streaming I can't use saveAsParquetFile() because I can't add data to
an existing parquet File

I know that it's possible to stream data directly into parquet 
could you help me by providing a little sample what API I need to use ??

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Write-Parquet-File-with-spark-streaming-with-Spark-1-3-tp8.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming - Minimizing batch interval

2015-03-25 Thread Sean Owen
I don't think it's feasible to set a batch interval of 0.25ms. Even at
tens of ms the overhead of the framework is a large factor. Do you
mean 0.25s = 250ms?

Related thoughts, and I don't know if they apply to your case:

If you mean, can you just read off the source that quickly? yes.

Sometimes when people say "I need very low latency streaming because I
need to answer quickly" they are really trying to design a synchronous
API, and I don't think asynchronous streaming is the right
architecture.

Sometimes people really mean "I need to process 400 items per ms on
average", which is different and entirely possible.



On Wed, Mar 25, 2015 at 2:53 PM, RodrigoB  wrote:
> I've been given a feature requirement that means processing events on a
> latency lower than 0.25ms.
>
> Meaning I would have to make sure that Spark streaming gets new events from
> the messaging layer within that period of time. Would anyone have achieve
> such numbers using a Spark cluster? Or would this be even possible, even
> assuming we don't use the write ahead logs...
>
> tnks in advance!
>
> Rod
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Minimizing-batch-interval-tp7.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Total size of serialized results is bigger than spark.driver.maxResultSize

2015-03-25 Thread Wang, Ningjun (LNG-NPV)
Hi

I ran a spark job and got the following error. Can anybody tell me how to work 
around this problem? For example how can I increase spark.driver.maxResultSize? 
Thanks.
org.apache.spark.SparkException: Job aborted due to stage failure: Total size 
of serialized results
of 128 tasks (1029.1 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobA
ndIndependentStages(DAGScheduler.scala:1214)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:12
03)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:12
02)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler
.scala:696)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler
.scala:696)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(D
AGScheduler.scala:1420)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala
:1375)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala
:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/03/25 10:48:38 WARN TaskSetManager: Lost task 128.0 in stage 199.0 (TID 
6324, INT1-CAS01.pcc.lexi
snexis.com): TaskKilled (killed intentionally)

Ningjun



Spark Streaming - Minimizing batch interval

2015-03-25 Thread RodrigoB
I've been given a feature requirement that means processing events on a
latency lower than 0.25ms. 

Meaning I would have to make sure that Spark streaming gets new events from
the messaging layer within that period of time. Would anyone have achieve
such numbers using a Spark cluster? Or would this be even possible, even
assuming we don't use the write ahead logs...

tnks in advance!

Rod



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Minimizing-batch-interval-tp7.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How do you write Dataframes to elasticsearch

2015-03-25 Thread Nick Pentreath
Spark 1.3 is not supported by elasticsearch-hadoop yet but will be very soon: 
https://github.com/elastic/elasticsearch-hadoop/issues/400





However in the meantime you could use df.toRDD.saveToEs - though you may have 
to manipulate the Row object perhaps to extract fields, not sure if it will 
serialize directly to ES JSON...



—
Sent from Mailbox

On Wed, Mar 25, 2015 at 2:07 PM, yamanoj  wrote:

> It seems that elasticsearch-spark_2.10 currently not supporting spart 1.3.
> Could you tell me if there is an alternative way to save Dataframes to
> elasticsearch? 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-write-Dataframes-to-elasticsearch-tp3.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org

What are the best options for quickly filtering a DataFrame on a single column?

2015-03-25 Thread Stuart Layton
I have a SparkSQL dataframe with a a few billion rows that I need to
quickly filter down to a few hundred thousand rows, using an operation like
(syntax may not be correct)

df = df[ df.filter(lambda x: x.key_col in approved_keys)]

I was thinking about serializing the data using parquet and saving it to
S3, however as I want to optimize for filtering speed I'm not sure this is
the best option.

-- 
Stuart Layton


Re: issue while submitting Spark Job as --master yarn-cluster

2015-03-25 Thread Sandy Ryza
Hi Sachin,

It appears that the application master is failing.  To figure out what's
wrong you need to get the logs for the application master.

-Sandy

On Wed, Mar 25, 2015 at 7:05 AM, Sachin Singh 
wrote:

> OS I am using Linux,
> when I will run simply as master yarn, its running fine,
>
> Regards
> Sachin
>
> On Wed, Mar 25, 2015 at 4:25 PM, Xi Shen  wrote:
>
>> What is your environment? I remember I had similar error when "running
>> spark-shell --master yarn-client" in Windows environment.
>>
>>
>> On Wed, Mar 25, 2015 at 9:07 PM sachin Singh 
>> wrote:
>>
>>> Hi ,
>>> when I am submitting spark job in cluster mode getting error as under in
>>> hadoop-yarn  log,
>>> someone has any idea,please suggest,
>>>
>>> 2015-03-25 23:35:22,467 INFO
>>> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
>>> application_1427124496008_0028 State change from FINAL_SAVING to FAILED
>>> 2015-03-25 23:35:22,467 WARN
>>> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hdfs
>>> OPERATION=Application Finished - Failed TARGET=RMAppManager
>>>  RESULT=FAILURE
>>> DESCRIPTION=App failed with state: FAILED   PERMISSIONS=Application
>>> application_1427124496008_0028 failed 2 times due to AM Container for
>>> appattempt_1427124496008_0028_02 exited with  exitCode: 13 due to:
>>> Exception from container-launch.
>>> Container id: container_1427124496008_0028_02_01
>>> Exit code: 13
>>> Stack trace: ExitCodeException exitCode=13:
>>> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>>> at org.apache.hadoop.util.Shell.run(Shell.java:455)
>>> at
>>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(
>>> Shell.java:702)
>>> at
>>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.
>>> launchContainer(DefaultContainerExecutor.java:197)
>>> at
>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.
>>> launcher.ContainerLaunch.call(ContainerLaunch.java:299)
>>> at
>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.
>>> launcher.ContainerLaunch.call(ContainerLaunch.java:81)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(
>>> ThreadPoolExecutor.java:1145)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(
>>> ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>> Container exited with a non-zero exit code 13
>>> .Failing this attempt.. Failing the application.
>>> APPID=application_1427124496008_0028
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/issue-while-submitting-Spark-Job-as-
>>> master-yarn-cluster-tp0.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>


Re: JavaKinesisWordCountASLYARN Example not working on EMR

2015-03-25 Thread Arush Kharbanda
Did you built for kineses using profile *-Pkinesis-asl*

On Wed, Mar 25, 2015 at 7:18 PM, ankur.jain  wrote:

> Hi,
> I am trying to run a Spark on YARN program provided by Spark in the
> examples
> directory using Amazon Kinesis on EMR cluster :
> I am using Spark 1.3.0 and EMR AMI: 3.5.0
>
> I've setup the Credentials
> export AWS_ACCESS_KEY_ID=XX
> export AWS_SECRET_KEY=XXX
>
> *A) This is the Kinesis Word Count Producer which ran Successfully : *
> run-example org.apache.spark.examples.streaming.KinesisWordCountProducerASL
> mySparkStream https://kinesis.us-east-1.amazonaws.com 1 5
>
> *B) This one is the Normal Consumer using Spark Streaming which also ran
> Successfully: *
> run-example org.apache.spark.examples.streaming.JavaKinesisWordCountASL
> mySparkStream https://kinesis.us-east-1.amazonaws.com
>
> *C) And this is the YARN based program which is NOT WORKING: *
> run-example org.apache.spark.examples.streaming.JavaKinesisWordCountASLYARN
> mySparkStream https://kinesis.us-east-1.amazonaws.com\
> Spark assembly has been built with Hive, including Datanucleus jars on
> classpath
> 15/03/25 11:52:45 INFO spark.SparkContext: Running Spark version 1.3.0
> 15/03/25 11:52:45 WARN spark.SparkConf:
> SPARK_CLASSPATH was detected (set to
>
> '/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar').
> This is deprecated in Spark 1.0+.
> Please instead use:
> •   ./spark-submit with --driver-class-path to augment the driver
> classpath
> •   spark.executor.extraClassPath to augment the executor classpath
> 15/03/25 11:52:45 WARN spark.SparkConf: Setting
> 'spark.executor.extraClassPath' to
>
> '/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar'
> as a work-around.
> 15/03/25 11:52:45 WARN spark.SparkConf: Setting
> 'spark.driver.extraClassPath' to
>
> '/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar'
> as a work-around.
> 15/03/25 11:52:46 INFO spark.SecurityManager: Changing view acls to: hadoop
> 15/03/25 11:52:46 INFO spark.SecurityManager: Changing modify acls to:
> hadoop
> 15/03/25 11:52:46 INFO spark.SecurityManager: SecurityManager:
> authentication disabled; ui acls disabled; users with view permissions:
> Set(hadoop); users with modify permissions: Set(hadoop)
> 15/03/25 11:52:47 INFO slf4j.Slf4jLogger: Slf4jLogger started
> 15/03/25 11:52:48 INFO Remoting: Starting remoting
> 15/03/25 11:52:48 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sparkDriver@ip-10-80-175-92.ec2.internal:59504]
> 15/03/25 11:52:48 INFO util.Utils: Successfully started service
> 'sparkDriver' on port 59504.
> 15/03/25 11:52:48 INFO spark.SparkEnv: Registering MapOutputTracker
> 15/03/25 11:52:48 INFO spark.SparkEnv: Registering BlockManagerMaster
> 15/03/25 11:52:48 INFO storage.DiskBlockManager: Created local directory at
>
> /mnt/spark/spark-120befbc-6dae-4751-b41f-dbf7b3d97616/blockmgr-d339d180-36f5-465f-bda3-cecccb23b1d3
> 15/03/25 11:52:48 INFO storage.MemoryStore: MemoryStore started with
> capacity 265.4 MB
> 15/03/25 11:52:48 INFO spark.HttpFileServer: HTTP File server directory is
>
> /mnt/spark/spark-85e88478-3dad-4fcf-a43a-efd15166bef3/httpd-6115870a-0d90-44df-aa7c-a6bd1a47e107
> 15/03/25 11:52:48 INFO spark.HttpServer: Starting HTTP Server
> 15/03/25 11:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 15/03/25 11:52:49 INFO server.AbstractConnector: Started
> SocketConnector@0.0.0.0:44879
> 15/03/25 11:52:49 INFO util.Utils: Successfully started service 'HTTP file
> server' on port 44879.
> 15/03/25 11:52:49 INFO spark.SparkEnv: Registering OutputCommitCoordinator
> 15/03/25 11:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 15/03/25 11:52:49 INFO server.AbstractConnector: Started
> SelectChannelConnector@0.0.0.0:4040
> 15/03/25 11:52:49 INFO util.Utils: Successfully started service 'SparkUI'
> on
> port 4040.
> 15/03/25 11:52:49 INFO ui.SparkUI: Started SparkUI at
> http://ip-10-80-175-92.ec2.internal:4040
> 15/03/25 11:52:50 INFO spark.SparkContext: Added JAR
> file:/home/hadoop/spark/lib/spark-examples-1.3.0-hadoop2.4.0.jar at
> http://10.80.175.92:44879/jars/spark-examples-1.3.0-hadoop2.4.0.jar with
> timestamp 1427284370358
> 15/03/25 11:52:50 INFO cluster.YarnClusterScheduler: Created
> YarnClusterScheduler
> 15/03/25 11:52:51 ERROR cluster.YarnClusterSchedulerBackend: Application ID
> is not set.
> 15/03/25 11:52:51 INFO netty.NettyBlockTransferService: Server created on
> 49982
> 15/03/25 11:52:51 INFO storage.BlockManagerMaster: Trying to register
> BlockManager
> 15/03/25 11:52:51 INFO storage.BlockManagerMasterAc

foreachRDD execution

2015-03-25 Thread Luis Ángel Vicente Sánchez
I have a simple and probably dumb question about foreachRDD.

We are using spark streaming + cassandra to compute concurrent users every
5min. Our batch size is 10secs and our block interval is 2.5secs.

At the end of the world we are using foreachRDD to join the data in the RDD
with existing data in Cassandra, update the counters and then save it back
to Cassandra.

To the best of my understanding, in this scenario, spark streaming produces
one RDD every 10secs and foreachRDD executes them sequentially, that is,
foreachRDD would never run in parallel.

Am I right?

Regards,

Luis


Re: NetwrokWordCount + Spark standalone

2015-03-25 Thread Akhil Das
You can open the Master UI running on 8080 port of your ubuntu machine and
after submitting the job, you can see how many cores are being used etc
from the UI.

Thanks
Best Regards

On Wed, Mar 25, 2015 at 6:50 PM, James King  wrote:

> Thanks Akhil,
>
> Yes indeed this is why it works when using local[2] but I'm unclear of why
> it doesn't work when using standalone daemons?
>
> Is there way to check what cores are being seen when running against
> standalone daemons?
>
> I'm running the master and worker on same ubuntu host. The Driver program
> is running from a windows machine.
>
> On ubuntu host command cat /proc/cpuinfo | grep processor | wc -l
> is giving 2
>
> On Windows machine it is:
> NumberOfCores=2
> NumberOfLogicalProcessors=4
>
>
> On Wed, Mar 25, 2015 at 2:06 PM, Akhil Das 
> wrote:
>
>> Spark Streaming requires you to have minimum of 2 cores, 1 for receiving
>> your data and the other for processing. So when you say local[2] it
>> basically initialize 2 threads on your local machine, 1 for receiving data
>> from network and the other for your word count processing.
>>
>> Thanks
>> Best Regards
>>
>> On Wed, Mar 25, 2015 at 6:31 PM, James King 
>> wrote:
>>
>>> I'm trying to run the Java NetwrokWordCount example against a simple
>>> spark standalone runtime of one  master and one worker.
>>>
>>> But it doesn't seem to work, the text entered on the Netcat data server
>>> is not being picked up and printed to Eclispe console output.
>>>
>>> However if I use conf.setMaster("local[2]") it works, the correct text
>>> gets picked up and printed to Eclipse console.
>>>
>>> Any ideas why, any pointers?
>>>
>>
>>
>


JavaKinesisWordCountASLYARN Example not working on EMR

2015-03-25 Thread ankur.jain
Hi,
I am trying to run a Spark on YARN program provided by Spark in the examples
directory using Amazon Kinesis on EMR cluster : 
I am using Spark 1.3.0 and EMR AMI: 3.5.0 

I've setup the Credentials 
export AWS_ACCESS_KEY_ID=XX
export AWS_SECRET_KEY=XXX

*A) This is the Kinesis Word Count Producer which ran Successfully : *
run-example org.apache.spark.examples.streaming.KinesisWordCountProducerASL
mySparkStream https://kinesis.us-east-1.amazonaws.com 1 5

*B) This one is the Normal Consumer using Spark Streaming which also ran
Successfully: *
run-example org.apache.spark.examples.streaming.JavaKinesisWordCountASL
mySparkStream https://kinesis.us-east-1.amazonaws.com

*C) And this is the YARN based program which is NOT WORKING: *
run-example org.apache.spark.examples.streaming.JavaKinesisWordCountASLYARN
mySparkStream https://kinesis.us-east-1.amazonaws.com\
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
15/03/25 11:52:45 INFO spark.SparkContext: Running Spark version 1.3.0
15/03/25 11:52:45 WARN spark.SparkConf: 
SPARK_CLASSPATH was detected (set to
'/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar').
This is deprecated in Spark 1.0+.
Please instead use:
•   ./spark-submit with --driver-class-path to augment the driver classpath
•   spark.executor.extraClassPath to augment the executor classpath
15/03/25 11:52:45 WARN spark.SparkConf: Setting
'spark.executor.extraClassPath' to
'/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar'
as a work-around.
15/03/25 11:52:45 WARN spark.SparkConf: Setting
'spark.driver.extraClassPath' to
'/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar'
as a work-around.
15/03/25 11:52:46 INFO spark.SecurityManager: Changing view acls to: hadoop
15/03/25 11:52:46 INFO spark.SecurityManager: Changing modify acls to:
hadoop
15/03/25 11:52:46 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(hadoop); users with modify permissions: Set(hadoop)
15/03/25 11:52:47 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/03/25 11:52:48 INFO Remoting: Starting remoting
15/03/25 11:52:48 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriver@ip-10-80-175-92.ec2.internal:59504]
15/03/25 11:52:48 INFO util.Utils: Successfully started service
'sparkDriver' on port 59504.
15/03/25 11:52:48 INFO spark.SparkEnv: Registering MapOutputTracker
15/03/25 11:52:48 INFO spark.SparkEnv: Registering BlockManagerMaster
15/03/25 11:52:48 INFO storage.DiskBlockManager: Created local directory at
/mnt/spark/spark-120befbc-6dae-4751-b41f-dbf7b3d97616/blockmgr-d339d180-36f5-465f-bda3-cecccb23b1d3
15/03/25 11:52:48 INFO storage.MemoryStore: MemoryStore started with
capacity 265.4 MB
15/03/25 11:52:48 INFO spark.HttpFileServer: HTTP File server directory is
/mnt/spark/spark-85e88478-3dad-4fcf-a43a-efd15166bef3/httpd-6115870a-0d90-44df-aa7c-a6bd1a47e107
15/03/25 11:52:48 INFO spark.HttpServer: Starting HTTP Server
15/03/25 11:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/03/25 11:52:49 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:44879
15/03/25 11:52:49 INFO util.Utils: Successfully started service 'HTTP file
server' on port 44879.
15/03/25 11:52:49 INFO spark.SparkEnv: Registering OutputCommitCoordinator
15/03/25 11:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/03/25 11:52:49 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
15/03/25 11:52:49 INFO util.Utils: Successfully started service 'SparkUI' on
port 4040.
15/03/25 11:52:49 INFO ui.SparkUI: Started SparkUI at
http://ip-10-80-175-92.ec2.internal:4040
15/03/25 11:52:50 INFO spark.SparkContext: Added JAR
file:/home/hadoop/spark/lib/spark-examples-1.3.0-hadoop2.4.0.jar at
http://10.80.175.92:44879/jars/spark-examples-1.3.0-hadoop2.4.0.jar with
timestamp 1427284370358
15/03/25 11:52:50 INFO cluster.YarnClusterScheduler: Created
YarnClusterScheduler
15/03/25 11:52:51 ERROR cluster.YarnClusterSchedulerBackend: Application ID
is not set.
15/03/25 11:52:51 INFO netty.NettyBlockTransferService: Server created on
49982
15/03/25 11:52:51 INFO storage.BlockManagerMaster: Trying to register
BlockManager
15/03/25 11:52:51 INFO storage.BlockManagerMasterActor: Registering block
manager ip-10-80-175-92.ec2.internal:49982 with 265.4 MB RAM,
BlockManagerId(, ip-10-80-175-92.ec2.internal, 49982)
15/03/25 11:52:51 INFO storage.BlockManagerMaster: Registered BlockManager
*Exception in thread "main" java.lang.NullPointerException*
*at
org.apache.s

Re: Spark ML Pipeline inaccessible types

2015-03-25 Thread Peter Rudenko

Hi Martin, here’s 2 possibilities to overcome this:

1) Put your logic into org.apache.spark package in your project - then 
everything would be accessible.

2) Dirty trick:

|object SparkVector extends HashingTF { val VectorUDT: DataType = 
outputDataType } |


then you can do like this:

|StructType("vectorTypeColumn", SparkVector.VectorUDT, false)) |

Thanks,
Peter Rudenko

On 2015-03-25 13:14, zapletal-mar...@email.cz wrote:


Sean,

thanks for your response. I am familiar with /NoSuchMethodException/ 
in general, but I think it is not the case this time. The code 
actually attempts to get parameter by name using /val m = 
this.getClass.getMethodName(paramName)./


This may be a bug, but it is only a side effect caused by the real 
problem I am facing. My issue is that VectorUDT is not accessible by 
user code and therefore it is not possible to use custom ML pipeline 
with the existing Predictors (see the last two paragraphs in my first 
email).


Best Regards,
Martin

-- Původní zpráva --
Od: Sean Owen 
Komu: zapletal-mar...@email.cz
Datum: 25. 3. 2015 11:05:54
Předmět: Re: Spark ML Pipeline inaccessible types


NoSuchMethodError in general means that your runtime and compile-time
environments are different. I think you need to first make sure you
don't have mismatching versions of Spark.

On Wed, Mar 25, 2015 at 11:00 AM,  wrote:
> Hi,
>
> I have started implementing a machine learning pipeline using
Spark 1.3.0
> and the new pipelining API and DataFrames. I got to a point
where I have my
> training data set prepared using a sequence of Transformers, but
I am
> struggling to actually train a model and use it for predictions.
>
> I am getting a java.lang.NoSuchMethodException:
>
org.apache.spark.ml.regression.LinearRegression.myFeaturesColumnName()
> exception thrown at checkInputColumn method in Params trait when
using a
> Predictor (LinearRegression in my case, but that should not
matter). This
> looks like a bug - the exception is thrown when executing
getParam(colName)
> when the require(actualDataType.equals(datatype), ...)
requirement is not
> met so the expected requirement failed exception is not thrown
and is hidden
> by the unexpected NoSuchMethodException instead. I can raise a
bug if this
> really is an issue and I am not using something incorrectly.
>
> The problem I am facing however is that the Predictor expects
features to
> have VectorUDT type as defined in Predictor class (protected def
> featuresDataType: DataType = new VectorUDT). But since this type is
> private[spark] my Transformer can not prepare features with this
type which
> then correctly results in the exception above when I use a
different type.
>
> Is there a way to define a custom Pipeline that would be able to
use the
> existing Predictors without having to bypass the access modifiers or
> reimplement something or is the pipelining API not yet expected
to be used
> in this way?
>
> Thanks,
> Martin
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


​


Re: NetwrokWordCount + Spark standalone

2015-03-25 Thread James King
Thanks Akhil,

Yes indeed this is why it works when using local[2] but I'm unclear of why
it doesn't work when using standalone daemons?

Is there way to check what cores are being seen when running against
standalone daemons?

I'm running the master and worker on same ubuntu host. The Driver program
is running from a windows machine.

On ubuntu host command cat /proc/cpuinfo | grep processor | wc -l
is giving 2

On Windows machine it is:
NumberOfCores=2
NumberOfLogicalProcessors=4


On Wed, Mar 25, 2015 at 2:06 PM, Akhil Das 
wrote:

> Spark Streaming requires you to have minimum of 2 cores, 1 for receiving
> your data and the other for processing. So when you say local[2] it
> basically initialize 2 threads on your local machine, 1 for receiving data
> from network and the other for your word count processing.
>
> Thanks
> Best Regards
>
> On Wed, Mar 25, 2015 at 6:31 PM, James King  wrote:
>
>> I'm trying to run the Java NetwrokWordCount example against a simple
>> spark standalone runtime of one  master and one worker.
>>
>> But it doesn't seem to work, the text entered on the Netcat data server
>> is not being picked up and printed to Eclispe console output.
>>
>> However if I use conf.setMaster("local[2]") it works, the correct text
>> gets picked up and printed to Eclipse console.
>>
>> Any ideas why, any pointers?
>>
>
>


Re: FAILED SelectChannelConnector@0.0.0.0:4040 java.net.BindException: Address already in use

2015-03-25 Thread , Roy
Yes I do have other application already running.

Thanks for your explanation.



On Wed, Mar 25, 2015 at 2:49 AM, Akhil Das 
wrote:

> It means you are already having 4 applications running on 4040, 4041,
> 4042, 4043. And that's why it was able to run on 4044.
>
> You can do a *netstat -pnat | grep 404* *And see what all processes are
> running.
>
> Thanks
> Best Regards
>
> On Wed, Mar 25, 2015 at 1:13 AM, , Roy  wrote:
>
>> I get following message for each time I run spark job
>>
>>
>>1. 15/03/24 15:35:56 WARN AbstractLifeCycle: FAILED
>>SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address
>>already in use
>>
>>
>> full trace is here
>>
>> http://pastebin.com/xSvRN01f
>>
>> how do I fix this ?
>>
>> I am on CDH 5.3.1
>>
>> thanks
>> roy
>>
>>
>>
>


Re: spark worker on mesos slave | possible networking config issue

2015-03-25 Thread Akhil Das
Remove SPARK_LOCAL_IP then?

Thanks
Best Regards

On Wed, Mar 25, 2015 at 6:45 PM, Anirudha Jadhav  wrote:

> is there a way to have this dynamically pick the local IP.
>
> static assignment does not work cos the workers  are dynamically allocated
> on mesos
>
> On Wed, Mar 25, 2015 at 3:04 AM, Akhil Das 
> wrote:
>
>> It says:
>> ried to associate with unreachable remote address 
>> [akka.tcp://sparkDriver@localhost:51849].
>> Address is now gated for 5000 ms, all messages to this address will be
>> delivered to dead letters. Reason: Connection refused: localhost/
>> 127.0.0.1:51849
>>
>> I'd suggest you changing this property:
>> export SPARK_LOCAL_IP=127.0.0.1
>>
>> Point it to your network address like 192.168.1.10
>>
>> Thanks
>> Best Regards
>>
>> On Tue, Mar 24, 2015 at 11:18 PM, Anirudha Jadhav 
>> wrote:
>>
>>> is there some setting i am missing:
>>> this is my spark-env.sh>>>
>>>
>>> export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
>>> export SPARK_EXECUTOR_URI=http://100.125.5.93/sparkx.tgz
>>> export SPARK_LOCAL_IP=127.0.0.1
>>>
>>>
>>>
>>> here is what i see on the slave node.
>>> 
>>> less
>>> 20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/stderr
>>> >
>>>
>>> WARNING: Logging before InitGoogleLogging() is written to STDERR
>>> I0324 02:30:29.389225 27755 fetcher.cpp:76] Fetching URI '
>>> http://100.125.5.93/sparkx.tgz'
>>> I0324 02:30:29.389361 27755 fetcher.cpp:126] Downloading '
>>> http://100.125.5.93/sparkx.tgz' to
>>> '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz'
>>> I0324 02:30:35.353446 27755 fetcher.cpp:64] Extracted resource
>>> '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56/sparkx.tgz'
>>> into
>>> '/tmp/mesos/slaves/20150226-160708-78932-5050-8971-S0/frameworks/20150323-205508-78932-5050-29804-0012/executors/20150226-160708-78932-5050-8971-S0/runs/cceea834-c4d9-49d6-a579-8352f1889b56'
>>> Spark assembly has been built with Hive, including Datanucleus jars on
>>> classpath
>>> Using Spark's default log4j profile:
>>> org/apache/spark/log4j-defaults.properties
>>> 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered signal handlers
>>> for [TERM, HUP, INT]
>>> I0324 02:30:37.071077 27863 exec.cpp:132] Version: 0.21.1
>>> I0324 02:30:37.080971 27885 exec.cpp:206] Executor registered on slave
>>> 20150226-160708-78932-5050-8971-S0
>>> 15/03/24 02:30:37 INFO MesosExecutorBackend: Registered with Mesos as
>>> executor ID 20150226-160708-78932-5050-8971-S0 with 1 cpus
>>> 15/03/24 02:30:37 INFO SecurityManager: Changing view acls to: ubuntu
>>> 15/03/24 02:30:37 INFO SecurityManager: Changing modify acls to: ubuntu
>>> 15/03/24 02:30:37 INFO SecurityManager: SecurityManager: authentication
>>> disabled; ui acls disabled; users with view permissions: Set(ubuntu); users
>>> with modify permissions: Set(ubuntu)
>>> 15/03/24 02:30:37 INFO Slf4jLogger: Slf4jLogger started
>>> 15/03/24 02:30:37 INFO Remoting: Starting remoting
>>> 15/03/24 02:30:38 INFO Remoting: Remoting started; listening on
>>> addresses :[akka.tcp://sparkExecutor@mesos-si2:50542]
>>> 15/03/24 02:30:38 INFO Utils: Successfully started service
>>> 'sparkExecutor' on port 50542.
>>> 15/03/24 02:30:38 INFO AkkaUtils: Connecting to MapOutputTracker:
>>> akka.tcp://sparkDriver@localhost:51849/user/MapOutputTracker
>>> 15/03/24 02:30:38 WARN Remoting: Tried to associate with unreachable
>>> remote address [akka.tcp://sparkDriver@localhost:51849]. Address is now
>>> gated for 5000 ms, all messages to this address will be delivered to dead
>>> letters. Reason: Connection refused: localhost/127.0.0.1:51849
>>> akka.actor.ActorNotFound: Actor not found for:
>>> ActorSelection[Anchor(akka.tcp://sparkDriver@localhost:51849/),
>>> Path(/user/MapOutputTracker)]
>>> at
>>> akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
>>> at
>>> akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
>>> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>>> at
>>> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
>>> at
>>> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
>>> at
>>> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
>>> at
>>> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
>>> at
>>> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72

  1   2   >