Re: Pig on Spark
Hey Mayur, We use HiveColumnarLoader and XMLLoader. Are these working as well ? Will try few things regarding porting Java MR. Regards, Suman Bharadwaj S On Thu, Apr 24, 2014 at 3:09 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: Right now UDF is not working. Its in the top list though. You should be able to soon :) Are thr any other functionality of pig you use often apart from the usual suspects?? Existing Java MR jobs would be a easier move. are these cascading jobs or single map reduce jobs. If single then you should be able to, write a scala wrapper code code to call map reduce functions with some magic let your core code be. Would be interesting to see an actual example get it to work. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Apr 24, 2014 at 2:46 AM, suman bharadwaj suman@gmail.comwrote: We currently are in the process of converting PIG and Java map reduce jobs to SPARK jobs. And we have written couple of PIG UDFs as well. Hence was checking if we can leverage SPORK without converting to SPARK jobs. And is there any way I can port my existing Java MR jobs to SPARK ? I know this thread has a different subject, let me know if need to ask this question in separate thread. Thanks in advance. On Thu, Apr 24, 2014 at 2:13 AM, Mayur Rustagi mayur.rust...@gmail.comwrote: UDF Generate many many more are not working :) Several of them work. Joins, filters, group by etc. I am translating the ones we need, would be happy to get help on others. Will host a jira to track them if you are intersted. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj suman@gmail.comwrote: Are all the features available in PIG working in SPORK ?? Like for eg: UDFs ? Thanks. On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: Thr are two benefits I get as of now 1. Most of the time a lot of customers dont want the full power but they want something dead simple with which they can do dsl. They end up using Hive for a lot of ETL just cause its SQL they understand it. Pig is close wraps up a lot of framework level semantics away from the user lets him focus on data flow 2. Some have codebases in Pig already are just looking to do it faster. I am yet to benchmark that on Pig on spark. I agree that pig on spark cannot solve a lot problems but it can solve some without forcing the end customer to do anything even close to coding, I believe thr is quite some value in making Spark accessible to larger group of audience. End of the day to each his own :) Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi mundlap...@gmail.com wrote: This seems like an interesting question. I love Apache Pig. It is so natural and the language flows with nice syntax. While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot for analytics and provided feedback to Pig Team to do much more functionality when it was at version 0.7. Lots of new functionality got offered now . End of the day, Pig is a DSL for data flows. There will be always gaps and enhancements. I was often thought is DSL right way to solve data flow problems? May be not, we need complete language construct. We may have found the answer - Scala. With Scala's dynamic compilation, we can write much power constructs than any DSL can provide. If I am a new organization and beginning to choose, I would go with Scala. Here is the example: #!/bin/sh exec scala $0 $@ !# YOUR DSL GOES HERE BUT IN SCALA! You have DSL like scripting, functional and complete language power! If we can improve first 3 lines, here you go, you have most powerful DSL to solve data problems. -Bharath On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng men...@gmail.comwrote: Hi Sameer, Lin (cc'ed) could also give you some updates about Pig on Spark development on her side. Best, Xiangrui On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak ssti...@live.com wrote: Hi Mayur, We are planning to upgrade our distribution MR1 MR2 (YARN) and the goal is to get SPROK set up next month. I will keep you posted. Can you please keep me informed about your progress as well. From: mayur.rust...@gmail.com Date: Mon, 10 Mar 2014 11:47:56 -0700 Subject: Re: Pig on Spark To: user@spark.apache.org Hi Sameer, Did you make any progress on this. My team is also trying it out would love to know some detail so progress. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Thu, Mar 6, 2014 at 2:20 PM,
MultipleOutputs IdentityReducer
MultipleOutputs IdentityReducer
Hello, I am trying to write multiple files with Spark, but I can not find a way to do it. Here is the idea. val rddKeyValue : Rdd[(String, String)] = rddlines.map( line = createKeyValue(line)) now I would like to save this as keyname.txt and all the values inside the file I tried to use this after the map, but it would overwrite the file, so I would get only one value for each file. With GroupByKey I get outOfMemoryError, so I wonder if there is a way to append the next line on the text with the same key ?? On Hadoop we can use IdentityReducer and KeyBAsedOutput[1] I tried to this: rddKeyValue.saveAsHadoopFile(hdfs://test-platform-analytics-master/tmp/dump/product, classOf[String], classOf[String], classOf[KeyBasedOutput[String, String]]) [1] class KeyBasedOutput[T : Null ,V : AnyRef] extends MultipleTextOutputFormat[T , V] { /** * Use they key as part of the path for the final output file. */ override protected def generateFileNameForKeyValue(key: T, value: V, leaf: String) = { key.toString() } /** * When actually writing the data, discard the key since it is already in * the file path. */ override protected def generateActualKey(key: T, value: V) = { null } } Thanks a lot
RE: JMX with Spark
Can you share your working metrics.properties.? I want remote jmx to be enabled so i need to use the JMXSink and monitor my spark master and workers. But what are the parameters that are to be defined like host and port ? So your config can help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/JMX-with-Spark-tp4309p4823.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
read file from hdfs
I have just 2 two questions? sc.textFile(hdfs://host:port/user/matei/whatever.txt) Is host master node? What port we should use? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/read-file-from-hdfs-tp4824.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Pig on Spark
I've only had a quick look at Pig, but it seems that a declarative layer on top of Spark couldn't be anything other than a big win, as it allows developers to declare *what* they want, permitting the compiler to determine how best poke at the RDD API to implement it. In my brief time with Spark, I've often thought that it feels very unnatural to use imperative code to declare a pipeline.
FW: reduceByKeyAndWindow - spark internals
Any suggestions where I can find this in the documentation or elsewhere? Thanks From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: April-24-14 11:26 AM To: u...@spark.incubator.apache.org Subject: reduceByKeyAndWindow - spark internals If I have this code: val stream1= doublesInputStream.window(Seconds(10), Seconds(2)) val stream2= stream1.reduceByKeyAndWindow(_ + _, Seconds(10), Seconds(10)) Does reduceByKeyAndWindow merge all RDDs from stream1 that came in the 10 second window? Example, in the first 10 secs stream1 will have 5 RDDS. Does reduceByKeyAndWindow merge these 5RDDs into 1 RDD and remove duplicates? -Adrian
Re: Deploying a python code on a spark EC2 cluster
This is the error from stderr: Spark Executor Command: java -cp :/root/ephemeral-hdfs/conf:/root/ephemeral-hdfs/conf:/root/ephemeral-hdfs/conf:/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop1.0.4.jar -Djava.library.path=/root/ephemeral-hdfs/lib/native/ -Dspark.local.dir=/mnt/spark -Dspark.local.dir=/mnt/spark -Dspark.local.dir=/mnt/spark -Dspark.local.dir=/mnt/spark -Xms2048M -Xmx2048M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://spark@192.168.122.1:44577/user/CoarseGrainedScheduler 1 ip-10-84-7-178.eu-west-1.compute.internal 1 akka.tcp://sparkwor...@ip-10-84-7-178.eu-west-1.compute.internal:57839/user/Worker app-20140425133749- 14/04/25 13:39:37 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/04/25 13:39:38 INFO Remoting: Starting remoting 14/04/25 13:39:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkexecu...@ip-10-84-7-178.eu-west-1.compute.internal:36800] 14/04/25 13:39:38 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkexecu...@ip-10-84-7-178.eu-west-1.compute.internal:36800] 14/04/25 13:39:38 INFO worker.WorkerWatcher: Connecting to worker akka.tcp://sparkwor...@ip-10-84-7-178.eu-west-1.compute.internal:57839/user/Worker 14/04/25 13:39:38 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@192.168.122.1:44577/user/CoarseGrainedScheduler 14/04/25 13:39:39 INFO worker.WorkerWatcher: Successfully connected to akka.tcp://sparkwor...@ip-10-84-7-178.eu-west-1.compute.internal:57839/user/Worker 14/04/25 13:41:19 ERROR executor.CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkexecu...@ip-10-84-7-178.eu-west-1.compute.internal:36800] - [akka.tcp://spark@192.168.122.1:44577] disassociated! Shutting down. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Deploying-a-python-code-on-a-spark-EC2-cluster-tp4758p4828.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Pig on Spark
It depends, personally I have the opposite opinion. IMO expressing pipelines in a functional language feels natural, you just have to get used with the language (scala). Testing spark jobs is easy where testing a Pig script is much harder and not natural. If you want a more high level language that deals with RDDs for you, you can use spark sql http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html Of course you can express less things this way, but if you have some complex logic I think it would make sense to write a classic spark job that would be more robust in the long term. 2014-04-25 15:30 GMT+02:00 Mark Baker dist...@acm.org: I've only had a quick look at Pig, but it seems that a declarative layer on top of Spark couldn't be anything other than a big win, as it allows developers to declare *what* they want, permitting the compiler to determine how best poke at the RDD API to implement it. In my brief time with Spark, I've often thought that it feels very unnatural to use imperative code to declare a pipeline.
Re: Deploying a python code on a spark EC2 cluster
In order to check if there is any issue with python API I ran a scala application provided in the examples. Still the same error ./bin/run-example org.apache.spark.examples.SparkPi spark://[Master-URL]:7077 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/mnt/work/spark-0.9.1/examples/target/scala-2.10/spark-examples-assembly-0.9.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/mnt/work/spark-0.9.1/assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop1.0.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 14/04/25 17:07:10 INFO Utils: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/04/25 17:07:10 WARN Utils: Your hostname, rd-hu resolves to a loopback address: 127.0.1.1; using 192.168.122.1 instead (on interface virbr0) 14/04/25 17:07:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 14/04/25 17:07:11 INFO Slf4jLogger: Slf4jLogger started 14/04/25 17:07:11 INFO Remoting: Starting remoting 14/04/25 17:07:11 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@192.168.122.1:26278] 14/04/25 17:07:11 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@192.168.122.1:26278] 14/04/25 17:07:11 INFO SparkEnv: Registering BlockManagerMaster 14/04/25 17:07:11 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140425170711-d1da 14/04/25 17:07:11 INFO MemoryStore: MemoryStore started with capacity 16.0 GB. 14/04/25 17:07:11 INFO ConnectionManager: Bound socket to port 9788 with id = ConnectionManagerId(192.168.122.1,9788) 14/04/25 17:07:11 INFO BlockManagerMaster: Trying to register BlockManager 14/04/25 17:07:11 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager 192.168.122.1:9788 with 16.0 GB RAM 14/04/25 17:07:11 INFO BlockManagerMaster: Registered BlockManager 14/04/25 17:07:11 INFO HttpServer: Starting HTTP Server 14/04/25 17:07:11 INFO HttpBroadcast: Broadcast server started at http://192.168.122.1:58091 14/04/25 17:07:11 INFO SparkEnv: Registering MapOutputTracker 14/04/25 17:07:11 INFO HttpFileServer: HTTP File server directory is /tmp/spark-599577a4-5732-4949-a2e8-f59eb679e843 14/04/25 17:07:11 INFO HttpServer: Starting HTTP Server 14/04/25 17:07:12 WARN AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:444) at sun.nio.ch.Net.bind(Net.java:436) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187) at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316) at org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.server.Server.doStart(Server.java:286) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:118) at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:118) at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:118) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:118) at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:129) at org.apache.spark.ui.SparkUI.bind(SparkUI.scala:57) at org.apache.spark.SparkContext.init(SparkContext.scala:159) at org.apache.spark.SparkContext.init(SparkContext.scala:100) at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31) at org.apache.spark.examples.SparkPi.main(SparkPi.scala) 14/04/25 17:07:12 WARN AbstractLifeCycle: FAILED org.eclipse.jetty.server.Server@74f4b96: java.net.BindException: Address already in use java.net.BindException: Address already in use at sun.nio.ch.Net.bind0(Native Method) at sun.nio.ch.Net.bind(Net.java:444) at sun.nio.ch.Net.bind(Net.java:436) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) at org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187) at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316) at
Re: what is the best way to do cartesian
You might want to try the built-in RDD.cartesian() method. On Thu, Apr 24, 2014 at 9:05 PM, Qin Wei wei@dewmobile.net wrote: Hi All, I have a problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark. The basic flow is as below: (Item1, (User1 , Score1)) RDD1 ==(Item2, (User2 , Score2)) (Item1, (User2 , Score3)) (Item2, (User1 , Score4)) RDD1.groupByKey == RDD2 (Item1, ((User1, Score1), (User2, Score3))) (Item2, ((User1, Score4), (User2, Score2))) The similarity of Vector ((User1, Score1), (User2, Score3)) and ((User1, Score4), (User2, Score2)) is the similarity of Item1 and Item2. In my situation, RDD2 contains 20 million records, my spark programm is extreamly slow, the source code is as below: val conf = new SparkConf().setMaster(spark://211.151.121.184:7077).setAppName(Score Calcu Total).set(spark.executor.memory, 20g).setJars(Seq(/home/deployer/score-calcu-assembly-1.0.jar)) val sc = new SparkContext(conf) val mongoRDD = sc.textFile(args(0).toString, 400) val jsonRDD = mongoRDD.map(arg = new JSONObject(arg)) val newRDD = jsonRDD.map(arg = { var score = haha(arg.get(a).asInstanceOf[JSONObject]) // set score to 0.5 for testing arg.put(score, 0.5) arg }) val resourceScoresRDD = newRDD.map(arg = (arg.get(rid).toString.toLong, (arg.get(zid).toString, arg.get(score).asInstanceOf[Number].doubleValue))).groupByKey().cache() val resourceScores = resourceScoresRDD.collect() val bcResourceScores = sc.broadcast(resourceScores) val simRDD = resourceScoresRDD.mapPartitions({iter = val m = bcResourceScores.value for{ (r1, v1) - iter (r2, v2) - m if r1 r2 } yield (r1, r2, cosSimilarity(v1, v2))}, true).filter(arg = arg._3 0.1) println(simRDD.count) And I saw this in Spark Web UI: http://apache-spark-user-list.1001560.n3.nabble.com/file/n4807/QQ%E6%88%AA%E5%9B%BE20140424204018.png http://apache-spark-user-list.1001560.n3.nabble.com/file/n4807/QQ%E6%88%AA%E5%9B%BE20140424204001.png My standalone cluster has 3 worker node (16 core and 32G RAM),and the workload of the machine in my cluster is heavy when the spark program is running. Is there any better way to do the algorithm? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-best-way-to-do-cartesian-tp4807.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: what is the best way to do cartesian
Depending on the size of the rdd you could also do a collect broadcast and then compute the product in a map function over the other rdd. If this is the same rdd you might also want to cache it. This pattern worked quite good for me Le 25 avr. 2014 18:33, Alex Boisvert alex.boisv...@gmail.com a écrit : You might want to try the built-in RDD.cartesian() method. On Thu, Apr 24, 2014 at 9:05 PM, Qin Wei wei@dewmobile.net wrote: Hi All, I have a problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark. The basic flow is as below: (Item1, (User1 , Score1)) RDD1 ==(Item2, (User2 , Score2)) (Item1, (User2 , Score3)) (Item2, (User1 , Score4)) RDD1.groupByKey == RDD2 (Item1, ((User1, Score1), (User2, Score3))) (Item2, ((User1, Score4), (User2, Score2))) The similarity of Vector ((User1, Score1), (User2, Score3)) and ((User1, Score4), (User2, Score2)) is the similarity of Item1 and Item2. In my situation, RDD2 contains 20 million records, my spark programm is extreamly slow, the source code is as below: val conf = new SparkConf().setMaster(spark://211.151.121.184:7077).setAppName(Score Calcu Total).set(spark.executor.memory, 20g).setJars(Seq(/home/deployer/score-calcu-assembly-1.0.jar)) val sc = new SparkContext(conf) val mongoRDD = sc.textFile(args(0).toString, 400) val jsonRDD = mongoRDD.map(arg = new JSONObject(arg)) val newRDD = jsonRDD.map(arg = { var score = haha(arg.get(a).asInstanceOf[JSONObject]) // set score to 0.5 for testing arg.put(score, 0.5) arg }) val resourceScoresRDD = newRDD.map(arg = (arg.get(rid).toString.toLong, (arg.get(zid).toString, arg.get(score).asInstanceOf[Number].doubleValue))).groupByKey().cache() val resourceScores = resourceScoresRDD.collect() val bcResourceScores = sc.broadcast(resourceScores) val simRDD = resourceScoresRDD.mapPartitions({iter = val m = bcResourceScores.value for{ (r1, v1) - iter (r2, v2) - m if r1 r2 } yield (r1, r2, cosSimilarity(v1, v2))}, true).filter(arg = arg._3 0.1) println(simRDD.count) And I saw this in Spark Web UI: http://apache-spark-user-list.1001560.n3.nabble.com/file/n4807/QQ%E6%88%AA%E5%9B%BE20140424204018.png http://apache-spark-user-list.1001560.n3.nabble.com/file/n4807/QQ%E6%88%AA%E5%9B%BE20140424204001.png My standalone cluster has 3 worker node (16 core and 32G RAM),and the workload of the machine in my cluster is heavy when the spark program is running. Is there any better way to do the algorithm? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-best-way-to-do-cartesian-tp4807.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Spark Shark 0.9.1 on ec2 with Hadoop 2 error
I've run into a problem trying to launch a cluster using the provided ec2 python script with --hadoop-major-version 2. The launch completes correctly with the exception of an Exception getting thrown for Tachyon 7 (I've included it at the end of the message, but that is not the focus and seems unrelated to my issue.) When I log in and try to run shark-withinfo, I get the following exception and I'm not sure where to go from here. Exception in thread main java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:278) at shark.SharkCliDriver$.main(SharkCliDriver.scala:128) at shark.SharkCliDriver.main(SharkCliDriver.scala) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.hive.ql.metadata.HiveUtils.getAuthenticator(HiveUtils.java:368) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:270) ... 2 more Caused by: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator.setConf(HadoopDefaultAuthenticator.java:53) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.hadoop.hive.ql.metadata.HiveUtils.getAuthenticator(HiveUtils.java:365) ... 3 more Caused by: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131) at org.apache.hadoop.security.Groups.init(Groups.java:64) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:232) at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:718) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:703) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:605) at org.apache.hadoop.hive.shims.HadoopShimsSecure.getUGIForConf(HadoopShimsSecure.java:491) at org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator.setConf(HadoopDefaultAuthenticator.java:51) ... 6 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:129) ... 15 more Caused by: java.lang.UnsatisfiedLinkError: org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative()V at org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native Method) at org.apache.hadoop.security.JniBasedUnixGroupsMapping.clinit(JniBasedUnixGroupsMapping.java:49) at org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.init(JniBasedUnixGroupsMappingWithFallback.java:38) ... 20 more For completeness, the Tachyon exception during cluster launch: Exception in thread main java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246) at tachyon.UnderFileSystemHdfs.init(UnderFileSystemHdfs.java:73) at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53) at tachyon.UnderFileSystem.get(UnderFileSystem.java:53) at tachyon.Format.main(Format.java:54) Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4 at org.apache.hadoop.ipc.Client.call(Client.java:1070) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:238) at
Re: Securing Spark's Network
Hi Jacob, This post might give you a brief idea about the ports being used https://groups.google.com/forum/#!topic/spark-users/PN0WoJiB0TA On Fri, Apr 25, 2014 at 8:53 PM, Jacob Eisinger jeis...@us.ibm.com wrote: Howdy, We tried running Spark 0.9.1 stand-alone inside docker containers distributed over multiple hosts. This is complicated due to Spark opening up ephemeral / dynamic ports for the workers and the CLI. To ensure our docker solution doesn't break Spark in unexpected ways and maintains a secure cluster, I am interested in understanding more about Spark's network architecture. I'd appreciate it if you could you point us to any documentation! A couple specific questions: 1. What are these ports being used for? Checking out the code / experiments, it looks like asynchronous communication for shuffling around results. Anything else? 2. How do you secure the network? Network administrators tend to secure and monitor the network at the port level. If these ports are dynamic and open randomly, firewalls are not easily configured and security alarms are raised. Is there a way to limit the range easily? (We did investigate setting the kernel parameter ip_local_reserved_ports, but this is broken [1] on some versions of Linux's cgroups.) Thanks, Jacob [1] https://github.com/lxc/lxc/issues/97 Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075
Strange lookup behavior. Possible bug?
Hi All, Im running a lookup on a JavaPairRDDString, Tuple2. When running on local machine - the lookup is successfull. However, when running a standalone cluster with the exact same dataset - one of the tasks never ends (constantly in RUNNING status). When viewing the worker log, it seems that the task has finished successfully: 14/04/25 13:40:38 INFO BlockManager: Found block rdd_2_0 locally 14/04/25 13:40:38 INFO Executor: Serialized size of result for 2 is 10896794 14/04/25 13:40:38 INFO Executor: Sending result for 2 directly to driver 14/04/25 13:40:38 INFO Executor: Finished task ID 2 But it seems the driver is not aware of this, and hangs indefinitely. If I execute a count priot to the lookup - I get the correct number which suggests that the cluster is operating as expected. The exact same scenario works with a different type of key (Tuple2): JavaPairRDDTuple2, Tuple2. Any ideas on how to debug this problem ? Thanks, Yadid
help
I need someone's help please I am getting the following error. [error] 14/04/26 03:09:47 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140426030946-0004/8 removed: class java.io.IOException: Cannot run program /home/exobrain/install/spark-0.9.1/bin/compute-classpath.sh (in directory .): error=13 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/help-tp4841.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Pig on Spark
I've only had a quick look at Pig, but it seems that a declarative layer on top of Spark couldn't be anything other than a big win, as it allows developers to declare *what* they want, permitting the compiler to determine how best poke at the RDD API to implement it. The devil is in the details - allowing developers to declare *what* they want - seems not practical in a declarative world since we are bound by the DSL constructs. The work around or rather hack is to have UDFs to have full language constructs. Some problems are hard, you will have twist your mind to solve in a restrictive way. At that time, we think, we wish we have complete language power. Being in Big Data world for short time (7 years), seen enough problems with Hive/Pig. All I am providing here is a thought to spark the Spark community to think beyond declarative constructs. I am sure there is a place for Pig and Hive. -Bharath On Fri, Apr 25, 2014 at 10:21 AM, Michael Armbrust mich...@databricks.comwrote: On Fri, Apr 25, 2014 at 6:30 AM, Mark Baker dist...@acm.org wrote: I've only had a quick look at Pig, but it seems that a declarative layer on top of Spark couldn't be anything other than a big win, as it allows developers to declare *what* they want, permitting the compiler to determine how best poke at the RDD API to implement it. Having Pig too would certainly be a win, but Spark SQLhttp://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.htmlis also a declarative layer on top of Spark. Since the optimization is lazy, you can chain multiple SQL statements in a row and still optimize them holistically (similar to a pig job). Alpha version coming soon to a Spark 1.0 release near you! Spark SQL also lets to drop back into functional Scala when that is more natural for a particular task.
Re: Spark and HBase
Phoenix generally presents itself as an endpoint using JDBC, which in my testing seems to play nicely using JdbcRDD. However, a few days ago a patch was made against Phoenix to implement support via PIG using a custom Hadoop InputFormat, which means now it has Spark support too. Here's a code snippet that sets up an RDD for a specific query: -- val phoenixConf = new PhoenixPigConfiguration(new Configuration()) phoenixConf.setSelectStatement(SELECT EVENTTYPE,EVENTTIME FROM EVENTS WHERE EVENTTYPE = 'some_type') phoenixConf.setSelectColumns(EVENTTYPE,EVENTTIME) phoenixConf.configure(servername, EVENTS, 100L) val phoenixRDD = sc.newAPIHadoopRDD( phoenixConf.getConfiguration(), classOf[PhoenixInputFormat], classOf[NullWritable], classOf[PhoenixRecord]) -- I'm still very new at Spark and even less experienced with Phoenix, but I'm hoping there's an advantage over the JdbcRDD in terms of partitioning. The JdbcRDD seems to implement partitioning based on a query predicate that is user defined, but I think Phoenix's InputFormat is able to figure out the splits which Spark is able to leverage. I don't really know how to verify if this is the case or not though, so if anyone else is looking into this, I'd love to hear their thoughts. Josh On Tue, Apr 8, 2014 at 1:00 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Just took a quick look at the overview herehttp://phoenix.incubator.apache.org/ and the quick start guide herehttp://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html . It looks like Apache Phoenix aims to provide flexible SQL access to data, both for transactional and analytic purposes, and at interactive speeds. Nick On Tue, Apr 8, 2014 at 12:38 PM, Bin Wang binwang...@gmail.com wrote: First, I have not tried it myself. However, what I have heard it has some basic SQL features so you can query you HBase table like query content on HDFS using Hive. So it is not query a simple column, I believe you can do joins and other SQL queries. Maybe you can wrap up an EMR cluster with Hbase preconfigured and give it a try. Sorry cannot provide more detailed explanation and help. On Tue, Apr 8, 2014 at 10:17 AM, Flavio Pompermaier pomperma...@okkam.it wrote: Thanks for the quick reply Bin. Phenix is something I'm going to try for sure but is seems somehow useless if I can use Spark. Probably, as you said, since Phoenix use a dedicated data structure within each HBase Table has a more effective memory usage but if I need to deserialize data stored in a HBase cell I still have to read in memory that object and thus I need Spark. From what I understood Phoenix is good if I have to query a simple column of HBase but things get really complicated if I have to add an index for each column in my table and I store complex object within the cells. Is it correct? Best, Flavio On Tue, Apr 8, 2014 at 6:05 PM, Bin Wang binwang...@gmail.com wrote: Hi Flavio, I happened to attend, actually attending the 2014 Apache Conf, I heard a project called Apache Phoenix, which fully leverage HBase and suppose to be 1000x faster than Hive. And it is not memory bounded, in which case sets up a limit for Spark. It is still in the incubating group and the stats functions spark has already implemented are still on the roadmap. I am not sure whether it will be good but might be something interesting to check out. /usr/bin On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier pomperma...@okkam.it wrote: Hi to everybody, in these days I looked a bit at the recent evolution of the big data stacks and it seems that HBase is somehow fading away in favour of Spark+HDFS. Am I correct? Do you think that Spark and HBase should work together or not? Best regards, Flavio
Re: help
hi thank you for your reply but I could not find it. it says that no such file or directory http://apache-spark-user-list.1001560.n3.nabble.com/file/n4848/Capture.png -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/help-tp4841p4848.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Build times for Spark
I've cloned the github repo and I'm building Spark on a pretty beefy machine (24 CPUs, 78GB of RAM) and it takes a pretty long time. For instance, today I did a 'git pull' for the first time in a week or two, and then doing 'sbt/sbt assembly' took 43 minutes of wallclock time (88 minutes of CPU time). After that, I did 'SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly' and that took 25 minutes wallclock, 73 minutes CPU. Is that typical? Or does that indicate some setup problem in my environment? -- Ken Williams, Senior Research Scientist WindLogics http://windlogics.com CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution of any kind is strictly prohibited. If you are not the intended recipient, please contact the sender via reply e-mail and destroy all copies of the original message. Thank you.
Scala Spark / Shark: How to access existing Hive tables in Hortonworks?
I am trying to find some docs / description of the approach on the subject, please help. I have Hadoop 2.2.0 from Hortonworks installed with some existing Hive tables I need to query. Hive SQL works extremly and unreasonably slow on single node and cluster as well. I hope Shark will work faster. From Spark/Shark docs I can not figure out how to make Shark work with existing Hive tables. Any ideas how to achieve this? Thanks!
Re: Scala Spark / Shark: How to access existing Hive tables in Hortonworks?
You have to configure shark to access the Hortonworks hive metastore (hcatalog?) you will start seeing the tables in shark shell can run queries like normal shark will leverage spark for processing your queries. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sat, Apr 26, 2014 at 2:00 AM, Darq Moth darqm...@gmail.com wrote: I am trying to find some docs / description of the approach on the subject, please help. I have Hadoop 2.2.0 from Hortonworks installed with some existing Hive tables I need to query. Hive SQL works extremly and unreasonably slow on single node and cluster as well. I hope Shark will work faster. From Spark/Shark docs I can not figure out how to make Shark work with existing Hive tables. Any ideas how to achieve this? Thanks!
Re: Build times for Spark
You can always increase the sbt memory by setting export JAVA_OPTS=-Xmx10g Thanks Best Regards On Sat, Apr 26, 2014 at 2:17 AM, Williams, Ken ken.willi...@windlogics.comwrote: No, I haven't done any config for SBT. Is there somewhere you might be able to point me toward for how to do that? -Ken *From:* Josh Rosen [mailto:rosenvi...@gmail.com] *Sent:* Friday, April 25, 2014 3:27 PM *To:* user@spark.apache.org *Subject:* Re: Build times for Spark Did you configure SBT to use the extra memory? On Fri, Apr 25, 2014 at 12:53 PM, Williams, Ken ken.willi...@windlogics.com wrote: I've cloned the github repo and I'm building Spark on a pretty beefy machine (24 CPUs, 78GB of RAM) and it takes a pretty long time. For instance, today I did a 'git pull' for the first time in a week or two, and then doing 'sbt/sbt assembly' took 43 minutes of wallclock time (88 minutes of CPU time). After that, I did 'SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly' and that took 25 minutes wallclock, 73 minutes CPU. Is that typical? Or does that indicate some setup problem in my environment? -- Ken Williams, Senior Research Scientist *WindLogics* http://windlogics.com -- CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution of any kind is strictly prohibited. If you are not the intended recipient, please contact the sender via reply e-mail and destroy all copies of the original message. Thank you.
Re: Securing Spark's Network
Howdy Akhil, Thanks - that did help! And, it made me think about how the EC2 scripts work [1] to set up security. From my understanding of EC2 security groups [2], this just sets up external access, right? (This has no effect on internal communication between the instances, right?) I am still confused as to why I am seeing the workers open up new ports for each job. Jacob [1] https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L230 [2] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html#default-security-group Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 From: Akhil Das ak...@sigmoidanalytics.com To: user@spark.apache.org Date: 04/25/2014 12:51 PM Subject:Re: Securing Spark's Network Sent by:ak...@mobipulse.in Hi Jacob, This post might give you a brief idea about the ports being used https://groups.google.com/forum/#!topic/spark-users/PN0WoJiB0TA On Fri, Apr 25, 2014 at 8:53 PM, Jacob Eisinger jeis...@us.ibm.com wrote: Howdy, We tried running Spark 0.9.1 stand-alone inside docker containers distributed over multiple hosts. This is complicated due to Spark opening up ephemeral / dynamic ports for the workers and the CLI. To ensure our docker solution doesn't break Spark in unexpected ways and maintains a secure cluster, I am interested in understanding more about Spark's network architecture. I'd appreciate it if you could you point us to any documentation! A couple specific questions: 1. What are these ports being used for? Checking out the code / experiments, it looks like asynchronous communication for shuffling around results. Anything else? 2. How do you secure the network? Network administrators tend to secure and monitor the network at the port level. If these ports are dynamic and open randomly, firewalls are not easily configured and security alarms are raised. Is there a way to limit the range easily? (We did investigate setting the kernel parameter ip_local_reserved_ports, but this is broken [1] on some versions of Linux's cgroups.) Thanks, Jacob [1] https://github.com/lxc/lxc/issues/97 Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075
Re: Build times for Spark
Are you by any chance building this on NFS ? As far as I know the build is severely bottlenecked by filesystem calls during assembly (each class file in each dependency gets a fstat call or something like that). That is partly why building from say a local ext4 filesystem or a SSD is much faster irrespective of memory / CPU. Thanks Shivaram On Fri, Apr 25, 2014 at 2:09 PM, Akhil Das ak...@sigmoidanalytics.comwrote: You can always increase the sbt memory by setting export JAVA_OPTS=-Xmx10g Thanks Best Regards On Sat, Apr 26, 2014 at 2:17 AM, Williams, Ken ken.willi...@windlogics.com wrote: No, I haven't done any config for SBT. Is there somewhere you might be able to point me toward for how to do that? -Ken *From:* Josh Rosen [mailto:rosenvi...@gmail.com] *Sent:* Friday, April 25, 2014 3:27 PM *To:* user@spark.apache.org *Subject:* Re: Build times for Spark Did you configure SBT to use the extra memory? On Fri, Apr 25, 2014 at 12:53 PM, Williams, Ken ken.willi...@windlogics.com wrote: I've cloned the github repo and I'm building Spark on a pretty beefy machine (24 CPUs, 78GB of RAM) and it takes a pretty long time. For instance, today I did a 'git pull' for the first time in a week or two, and then doing 'sbt/sbt assembly' took 43 minutes of wallclock time (88 minutes of CPU time). After that, I did 'SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly' and that took 25 minutes wallclock, 73 minutes CPU. Is that typical? Or does that indicate some setup problem in my environment? -- Ken Williams, Senior Research Scientist *WindLogics* http://windlogics.com -- CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution of any kind is strictly prohibited. If you are not the intended recipient, please contact the sender via reply e-mail and destroy all copies of the original message. Thank you.
Re: Build times for Spark
AFAIK the resolver does pick up things form your local ~/.m2 -- Note that as ~/.m2 is on NFS that adds to the amount of filesystem traffic. Shivaram On Fri, Apr 25, 2014 at 2:57 PM, Williams, Ken ken.willi...@windlogics.comwrote: I am indeed, but it's a pretty fast NFS. I don't have any SSD I can use, but I could try to use local disk to see what happens. For me, a large portion of the time seems to be spent on lines like Resolving org.fusesource.jansi#jansi;1.4 ... or similar . Is this going out to find Maven resources? Any way to tell it to just use my local ~/.m2 repository instead when the resource already exists there? Sometimes I even get sporadic errors like this: [info] Resolving org.apache.hadoop#hadoop-yarn;2.2.0 ... [error] SERVER ERROR: Bad Gateway url= http://repo.maven.apache.org/maven2/org/apache/hadoop/hadoop-yarn-server/2.2.0/hadoop-yarn-server-2.2.0.jar -Ken *From:* Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] *Sent:* Friday, April 25, 2014 4:31 PM *To:* user@spark.apache.org *Subject:* Re: Build times for Spark Are you by any chance building this on NFS ? As far as I know the build is severely bottlenecked by filesystem calls during assembly (each class file in each dependency gets a fstat call or something like that). That is partly why building from say a local ext4 filesystem or a SSD is much faster irrespective of memory / CPU. Thanks Shivaram On Fri, Apr 25, 2014 at 2:09 PM, Akhil Das ak...@sigmoidanalytics.com wrote: You can always increase the sbt memory by setting export JAVA_OPTS=-Xmx10g Thanks Best Regards On Sat, Apr 26, 2014 at 2:17 AM, Williams, Ken ken.willi...@windlogics.com wrote: No, I haven't done any config for SBT. Is there somewhere you might be able to point me toward for how to do that? -Ken *From:* Josh Rosen [mailto:rosenvi...@gmail.com] *Sent:* Friday, April 25, 2014 3:27 PM *To:* user@spark.apache.org *Subject:* Re: Build times for Spark Did you configure SBT to use the extra memory? On Fri, Apr 25, 2014 at 12:53 PM, Williams, Ken ken.willi...@windlogics.com wrote: I've cloned the github repo and I'm building Spark on a pretty beefy machine (24 CPUs, 78GB of RAM) and it takes a pretty long time. For instance, today I did a 'git pull' for the first time in a week or two, and then doing 'sbt/sbt assembly' took 43 minutes of wallclock time (88 minutes of CPU time). After that, I did 'SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly' and that took 25 minutes wallclock, 73 minutes CPU. Is that typical? Or does that indicate some setup problem in my environment? -- Ken Williams, Senior Research Scientist *WindLogics* http://windlogics.com -- CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution of any kind is strictly prohibited. If you are not the intended recipient, please contact the sender via reply e-mail and destroy all copies of the original message. Thank you.
Re: Spark and HBase
Josh, is there a specific use pattern you think is served well by Phoenix + Spark? Just curious. On Fri, Apr 25, 2014 at 3:17 PM, Josh Mahonin jmaho...@filetrek.com wrote: Phoenix generally presents itself as an endpoint using JDBC, which in my testing seems to play nicely using JdbcRDD. However, a few days ago a patch was made against Phoenix to implement support via PIG using a custom Hadoop InputFormat, which means now it has Spark support too. Here's a code snippet that sets up an RDD for a specific query: -- val phoenixConf = new PhoenixPigConfiguration(new Configuration()) phoenixConf.setSelectStatement(SELECT EVENTTYPE,EVENTTIME FROM EVENTS WHERE EVENTTYPE = 'some_type') phoenixConf.setSelectColumns(EVENTTYPE,EVENTTIME) phoenixConf.configure(servername, EVENTS, 100L) val phoenixRDD = sc.newAPIHadoopRDD( phoenixConf.getConfiguration(), classOf[PhoenixInputFormat], classOf[NullWritable], classOf[PhoenixRecord]) -- I'm still very new at Spark and even less experienced with Phoenix, but I'm hoping there's an advantage over the JdbcRDD in terms of partitioning. The JdbcRDD seems to implement partitioning based on a query predicate that is user defined, but I think Phoenix's InputFormat is able to figure out the splits which Spark is able to leverage. I don't really know how to verify if this is the case or not though, so if anyone else is looking into this, I'd love to hear their thoughts. Josh On Tue, Apr 8, 2014 at 1:00 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Just took a quick look at the overview herehttp://phoenix.incubator.apache.org/ and the quick start guide herehttp://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html . It looks like Apache Phoenix aims to provide flexible SQL access to data, both for transactional and analytic purposes, and at interactive speeds. Nick On Tue, Apr 8, 2014 at 12:38 PM, Bin Wang binwang...@gmail.com wrote: First, I have not tried it myself. However, what I have heard it has some basic SQL features so you can query you HBase table like query content on HDFS using Hive. So it is not query a simple column, I believe you can do joins and other SQL queries. Maybe you can wrap up an EMR cluster with Hbase preconfigured and give it a try. Sorry cannot provide more detailed explanation and help. On Tue, Apr 8, 2014 at 10:17 AM, Flavio Pompermaier pomperma...@okkam.it wrote: Thanks for the quick reply Bin. Phenix is something I'm going to try for sure but is seems somehow useless if I can use Spark. Probably, as you said, since Phoenix use a dedicated data structure within each HBase Table has a more effective memory usage but if I need to deserialize data stored in a HBase cell I still have to read in memory that object and thus I need Spark. From what I understood Phoenix is good if I have to query a simple column of HBase but things get really complicated if I have to add an index for each column in my table and I store complex object within the cells. Is it correct? Best, Flavio On Tue, Apr 8, 2014 at 6:05 PM, Bin Wang binwang...@gmail.com wrote: Hi Flavio, I happened to attend, actually attending the 2014 Apache Conf, I heard a project called Apache Phoenix, which fully leverage HBase and suppose to be 1000x faster than Hive. And it is not memory bounded, in which case sets up a limit for Spark. It is still in the incubating group and the stats functions spark has already implemented are still on the roadmap. I am not sure whether it will be good but might be something interesting to check out. /usr/bin On Tue, Apr 8, 2014 at 9:57 AM, Flavio Pompermaier pomperma...@okkam.it wrote: Hi to everybody, in these days I looked a bit at the recent evolution of the big data stacks and it seems that HBase is somehow fading away in favour of Spark+HDFS. Am I correct? Do you think that Spark and HBase should work together or not? Best regards, Flavio
Re: Strange lookup behavior. Possible bug?
Some additional information - maybe this rings a bell with someone: I suspect this happens when the lookup returns more than one value. For 0 and 1 values, the function behaves as you would expect. Anyone ? On 4/25/14, 1:55 PM, Yadid Ayzenberg wrote: Hi All, Im running a lookup on a JavaPairRDDString, Tuple2. When running on local machine - the lookup is successfull. However, when running a standalone cluster with the exact same dataset - one of the tasks never ends (constantly in RUNNING status). When viewing the worker log, it seems that the task has finished successfully: 14/04/25 13:40:38 INFO BlockManager: Found block rdd_2_0 locally 14/04/25 13:40:38 INFO Executor: Serialized size of result for 2 is 10896794 14/04/25 13:40:38 INFO Executor: Sending result for 2 directly to driver 14/04/25 13:40:38 INFO Executor: Finished task ID 2 But it seems the driver is not aware of this, and hangs indefinitely. If I execute a count priot to the lookup - I get the correct number which suggests that the cluster is operating as expected. The exact same scenario works with a different type of key (Tuple2): JavaPairRDDTuple2, Tuple2. Any ideas on how to debug this problem ? Thanks, Yadid
Re: help
Sorry, but I don't know where Cloudera puts the executor log files. Maybe their docs give the correct path? On Fri, Apr 25, 2014 at 12:32 PM, Joe L selme...@yahoo.com wrote: hi thank you for your reply but I could not find it. it says that no such file or directory http://apache-spark-user-list.1001560.n3.nabble.com/file/n4848/Capture.png -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/help-tp4841p4848.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Running out of memory Naive Bayes
I've been trying to use the Naive Bayes classifier. Each example in the dataset is about 2 million features, only about 20-50 of which are non-zero, so the vectors are very sparse. I keep running out of memory though, even for about 1000 examples on 30gb RAM while the entire dataset is 4 million examples. And I would also like to note that I'm using the sparse vector class.
Re: parallelize for a large Seq is extreamly slow.
I've tried to set larger buffer, but reduceByKey seems to be failed. need help:) 14/04/26 12:31:12 INFO cluster.CoarseGrainedSchedulerBackend: Shutting down all executors 14/04/26 12:31:12 INFO cluster.CoarseGrainedSchedulerBackend: Asking each executor to shut down 14/04/26 12:31:12 INFO scheduler.DAGScheduler: Failed to run countByKey at filter_2.scala:35 14/04/26 12:31:12 INFO yarn.ApplicationMaster: finishApplicationMaster with FAILED Exception in thread Thread-3 org.apache.hadoop.yarn.exceptions.YarnException: Application doesn't exist in cache appattempt_1398305021882_0069_01 at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:45) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.finishApplicationMaster(ApplicationMasterService.java:294) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.finishApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:75) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:97) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.finishApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:94) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at $Proxy12.finishApplicationMaster(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.unregisterApplicationMaster(AMRMClientImpl.java:311) at org.apache.spark.deploy.yarn.ApplicationMaster.finishApplicationMaster(ApplicationMaster.scala:320) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:165) Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.YarnException): Application doesn't exist in cache appattempt_1398305021882_0069_01 at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:45) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.finishApplicationMaster(ApplicationMasterService.java:294) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.finishApplicationMaster(ApplicationMasterProtocolPBServiceImpl.java:75) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:97) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042) at org.apache.hadoop.ipc.Client.call(Client.java:1347) at org.apache.hadoop.ipc.Client.call(Client.java:1300) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at $Proxy11.finishApplicationMaster(Unknown Source) at