Spark mapPartition output object size coming larger than expected
I am storing the output of mapPartitions in a ListBuffer and exposing its iterator as the output. The output is a list of Long tuples(Tuple2). When I check the size of the object using Spark's SizeEstimator.estimate method it comes out to 80 bytes per record/tuple object(calculating this by "size of ListBuffer object/# records"). This I think is too huge for a Tuple2 object of long type(two 8 byte longs + some object overhead memory). Any ideas why this is so and how to reduce the memory captured by output? I am sure I am missing something obvious. Also, these ListBuffer object are getting too huge for memory leading to memory and disk spills causing bad performance. Any ideas on how I can just simply write the output of mapPartitions without storing the whole output as an in-memory object. Each input record to mapPartitions can generate 0 or more output records, so I think I cannot use "rdd.map" function iterator. I am not sure even if that will help my cause. Here is the code code snippet. /var outputRDD = sortedRDD.mapPartitionsWithIndex((partitionNo,p) => { var outputList = ListBuffer[(Long,Long)]() var inputCnt: Long = 0; var outputCnt: Long = 0; while (p.hasNext) { inputCnt = inputCnt + 1; val tpl = p.next() var partitionKey = "" try{ partitionKey = tpl._1.split(keyDelimiter)(0) //Partition key }catch{ case aob : ArrayIndexOutOfBoundsException => { println("segmentKey:"+partitionKey); } } val value = tpl._2 var xs: Array[Any] = value.toSeq.toArray; //value.copyToArray(xs); val xs_string : Array[String] = new Array[String](value.size); for(i <- 0 to value.size-1){ xs_string(i) = xs(i) match { case None => "" case null => "" case _ => xs(i).toString() } } val outputTuples = windowObject.process(partitionKey, xs_string); if(outputTuples != null){ for (i <- 0 until outputTuples.size()) { val outputRecord = outputTuples.get(i) if (outputRecord != null) { outputList += ((outputRecord.getProfileID1 , outputRecord.getProfileID2)) outputCnt = outputCnt +1; } } } } if(debugFlag.equals("DEBUG")){ logger.info("partitionNo:"+ partitionNo + ", input #: "+ inputCnt +", output #: "+ outputCnt+", outputList object size:" + SizeEstimator.estimate(outputList)); } outputList.iterator }, false)/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-mapPartition-output-object-size-coming-larger-than-expected-tp28367.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
How to add all jars in a folder to executor classpath?
I need to add all the jars in hive/lib to my spark job executor classpath. I tried this spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hive/lib and spark.executor.extraClassPath=/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hive/lib/* but it does not add any jars to the classpath of the executor. How can I add all the jars in a folder to executor or driver class path and what if I have multiple folders? What is the syntax for that? I am using Spark 1.6.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-all-jars-in-a-folder-to-executor-classpath-tp27916.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Spark SQL(Hive query through HiveContext) always creating 31 partitions
I am running hive queries using HiveContext from my Spark code. No matter which query I run and how much data it is, it always generates 31 partitions. Anybody knows the reason? Is there a predefined/configurable setting for it? I essentially need more partitions. I using this code snippet to execute hive query: /var pairedRDD = hqlContext.sql(hql).rdd.map(...)/ Thanks, Nitin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Hive-query-through-HiveContext-always-creating-31-partitions-tp26671.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Out of Memory error caused by output object in mapPartitions
My mapPartition code as given below outputs one record for each input record. So, the output object has equal number of records as input. I am loading the output data into a listbuffer object. This object is turning out to be too huge for memory leading to Out Of Memory exception. To be more clear my logic of partition is as below: *Iterator(Iter1) -> Processing -> ListBuffer(list1) iter1.size() = list1.size() list1 goes out of memory* *I cannot change the partition size.* My parition is based on input key and all the records corresponding to a key need to go into same partition. Is there a workaround to this? / tempRDD = iterateRDD.mapPartitions(p => { var outputList = ListBuffer[String]() var minVal = 0L while (p.hasNext) { val tpl = p.next() val key = tpl._1 val value = tpl._2 if(key != prevKey){ if(value < key){ minVal = value; outputList.add(minVal.toString() + "\t" +key.toString()) }else{ minVal = key; outputList.add(minVal.toString() + "\t" +value.toString()) } }else{ outputList.add(minVal.toString() + "\t" +value.toString()) } prevKey = key; } outputList.iterator })/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-Memory-error-caused-by-output-object-in-mapPartitions-tp26229.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark SQL incompatible with Apache Sentry(Cloudera bundle)
CDH version: 5.3 Spark Version: 1.2 I was trying to execute a Hive query from Spark code(using HiveContext class). It was working fine untill we installed Apache Sentry. Now its giving me read permission exception. /org.apache.hadoop.security.AccessControlException: Permission denied: user=kakn, access=READ_EXECUTE, inode=/user/hive/warehouse/rt_freewheel_mastering.db/digital_profile_cluster_in:hive:hive:drwxrwx--t/ I understand this exception since with Sentry you can read/write the Hive warehouse table directories only with hive user. It is Sentry which does the user translation from a user e.g kakn in below case to hive. The query has to go via HiveServer2 for the user translation. Spark code(HiveContext) seems to run the query on its own(Hive CLI) and bypasses HiveServer2. Is there any way to make the query execution go through HiveServer2? I am stuck here, any suggestions comments would be really appreciated. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-incompatible-with-Apache-Sentry-Cloudera-bundle-tp23477.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Does HiveContext connect to HiveServer2?
Hey, I have exactly this question. Did you get an answer to it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-connect-to-HiveServer2-tp22200p23431.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Hive query execution from Spark(through HiveContext) failing with Apache Sentry
I am trying to run a hive query from Spark code using HiveContext object. It was running fine earlier but since the Apache Sentry has been set installed the process is failing with this exception : /org.apache.hadoop.security.AccessControlException: Permission denied: user=kakn, access=READ_EXECUTE, inode=/user/hive/warehouse/rt_freewheel_mastering.db/digital_profile_cluster_in:hive:hive:drwxrwx--t/ hive-site.xml http://apache-spark-user-list.1001560.n3.nabble.com/file/n23381/hive-site.xml I have pasted the full stack trace at the end of this post. My username kakn is a registered user with Sentry. I know that Spark takes all the configurations from hive-site.xml to execute the hql, so I added a few Sentry specific properties but seem to have no effect.It seems that the HiveContext is not going through HiveServer2(which I understand talks to Sentry component for user translation/delegation). I have attached the hive-site.xml. /property namehive.security.authorization.task.factory/name valueorg.apache.sentry.binding.hive.SentryHiveAuthorizationTaskFactoryImpl/value /property property namehive.metastore.pre.event.listeners/name valueorg.apache.sentry.binding.metastore.MetastoreAuthzBinding/value descriptionlist of comma seperated listeners for metastore events./description /property property namehive.warehouse.subdir.inherit.perms/name valuetrue/value /property/ /org.apache.hadoop.security.AccessControlException: Permission denied: user=kakn, access=READ_EXECUTE, inode=/user/hive/warehouse/rt_freewheel_mastering.db/digital_profile_cluster_in:hive:hive:drwxrwx--t at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:257) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:238) at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:151) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:138) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6287) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6269) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPathAccess(FSNamesystem.java:6194) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListingInt(FSNamesystem.java:4793) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListing(FSNamesystem.java:4755) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getListing(NameNodeRpcServer.java:800) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getListing(AuthorizationProviderProxyClientProtocol.java:310) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:606) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73) at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1895) at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1876) at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654) at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:104) at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:716) at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712) at
Possible to use hive-config.xml instead of hive-site.xml for HiveContext?
I am running hive queries from HiveContext, for which we need a hive-site.xml. Is it possible to replace it with hive-config.xml? I tried but does not work. Just want a conformation. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Possible-to-use-hive-config-xml-instead-of-hive-site-xml-for-HiveContext-tp22776.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Generating version agnostic jar path value for --jars clause
I have a list of cloudera jars which I need to provide in --jars clause, mainly for the HiveContext functionality I am using. However, many of these jars have version number as part of their names. This leads to an issue that the names might change when I do a Cloudera upgrade. Just a note here, there are many jars which cloudera exposes as a symlink which is the link to the latest version of that jar(e.g /opt/cloudera/parcels/CDH/lib/parquet/parquet-hadoop-bundle.jar - parquet-hadoop-bundle-1.5.0-cdh5.3.2.jar),in which case its good but there are many jars which aren't. Is there a flexible way to avoid this situation? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Generating-version-agnostic-jar-path-value-for-jars-clause-tp22734.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Does HiveContext connect to HiveServer2?
I am wondering if HiveContext connects to HiveServer2 or does it work though Hive CLI. The reason I am asking is because Cloudera has deprecated Hive CLI. If the connection is through HiverServer2, is there a way to specify user credentials? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-connect-to-HiveServer2-tp22200.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Weird exception in Spark job
Any Ideas on this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Weird-exception-in-Spark-job-tp22195p22204.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Is yarn-standalone mode deprecated?
Is yarn-standalone mode deprecated in Spark now. The reason I am asking is because while I can find it in 0.9.0 documentation(https://spark.apache.org/docs/0.9.0/running-on-yarn.html). I am not able to find it in 1.2.0. I am using this mode to run the Spark jobs from Oozie as a java action. Removing this mode will prevent me from doing that. Are there any other ways of running a Spark job from Oozie other than Shell action? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-yarn-standalone-mode-deprecated-tp22188.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Weird exception in Spark job
I am trying to run a Hive query from Spark code through HiveContext. Anybody knows what these exceptions mean? I have no clue LogType: stderr LogLength: 3345 Log Contents: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/jars/avro-tools-1.7.6-cdh5.3.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/jars/pig-0.12.0-cdh5.3.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/jars/slf4j-simple-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Exception in thread main java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1655) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:161) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [1 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at akka.remote.Remoting.start(Remoting.scala:173) at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184) at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579) at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577) at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588) at akka.actor.ActorSystem$.apply(ActorSystem.scala:111) at akka.actor.ActorSystem$.apply(ActorSystem.scala:104) at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54) at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53) at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1684) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1675) at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:122) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642) ... 4 more LogType: stdout LogLength: 7460 Log Contents: 2015-03-23 18:54:55,571 INFO [main] executor.CoarseGrainedExecutorBackend (SignalLogger.scala:register(47)) - Registered signal handlers for [TERM, HUP, INT] 2015-03-23 18:54:56,898 INFO [main] spark.SecurityManager (Logging.scala:logInfo(59)) - Changing view acls to: yarn,kakn 2015-03-23 18:54:56,901 INFO [main] spark.SecurityManager (Logging.scala:logInfo(59)) - Changing modify acls to: yarn,kakn 2015-03-23 18:54:56,911 INFO [main] spark.SecurityManager (Logging.scala:logInfo(59)) - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, kakn); users with modify permissions: Set(yarn, kakn) 2015-03-23 18:54:57,725 INFO [driverPropsFetcher-akka.actor.default-dispatcher-3] slf4j.Slf4jLogger (Slf4jLogger.scala:applyOrElse(80)) - Slf4jLogger started 2015-03-23 18:54:57,810 INFO [driverPropsFetcher-akka.actor.default-dispatcher-3] Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Starting remoting 2015-03-23 18:54:57,871 ERROR [driverPropsFetcher-akka.actor.default-dispatcher-5] actor.ActorSystemImpl (Slf4jLogger.scala:apply$mcV$sp(66)) - Uncaught fatal error from thread [driverPropsFetcher-akka.actor.default-dispatcher-4] shutting down ActorSystem
HiveContext test, Spark Context did not initialize after waiting 10000ms
I am trying to run a Hive query from Spark using HiveContext. Here is the code / val conf = new SparkConf().setAppName(HiveSparkIntegrationTest) conf.set(spark.executor.extraClassPath, /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib); conf.set(spark.driver.extraClassPath, /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib); conf.set(spark.yarn.am.waitTime, 30L) val sc = new SparkContext(conf) val sqlContext = new HiveContext(sc) def inputRDD = sqlContext.sql(describe spark_poc.src_digital_profile_user); inputRDD.collect().foreach { println } println(inputRDD.schema.getClass.getName) / Getting this exception. Any clues? The weird part is if I try to do the same thing but in Java instead of Scala, it runs fine. /Exception in thread Driver java.lang.NullPointerException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) 15/03/06 17:39:32 ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting for 1 ms. Please check earlier log output for errors. Failing the application. Exception in thread main java.lang.NullPointerException at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkContextInitialized(ApplicationMaster.scala:218) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:110) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:434) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:433) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) 15/03/06 17:39:32 INFO yarn.ApplicationMaster: AppMaster received a signal./ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HiveContext-test-Spark-Context-did-not-initialize-after-waiting-1ms-tp21953.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Running Spark jobs via oozie
I am also starting to work on this one. Did you get any solution to this issue? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-jobs-via-oozie-tp5187p21896.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Executing hive query from Spark code
I want to run Hive query inside Spark and use the RDDs generated from that inside Spark. I read in the documentation /Hive support is enabled by adding the -Phive and -Phive-thriftserver flags to Spark’s build. This command builds a new assembly jar that includes Hive. Note that this Hive assembly jar must also be present on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to access data stored in Hive./ I just wanted to know what -Phive and -Phive-thriftserver flags really do and is there a way to have the hive support without updating the assembly. Does that flag add a hive support jar or something? The reason I am asking is that I will be using Cloudera version of Spark in future and I am not sure how to add the Hive support to that Spark distribution. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Executing-hive-query-from-Spark-code-tp21880.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Counters in Spark
I am trying to implement counters in Spark and I guess Accumulators are the way to do it. My motive is to update a counter in map function and access/reset it in the driver code. However the /println/ statement at the end still yields value 0(It should 9). Am I doing something wrong? def main(args : Array[String]){ val conf = new SparkConf().setAppName(SortedNeighbourhoodMatching) val sc = new SparkContext(conf) var counter = sc.accumulable(0, Counter) var inputFilePath = args(0) val inputRDD = sc.textFile(inputFilePath) inputRDD.map { x = { counter += 1 } } println(counter.value) } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Counters-in-Spark-tp21646.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Where can I find logs set inside RDD processing functions?
I am trying to debug my mapPartitionsFunction. Here is the code. There are two ways I am trying to log using log.info() or println(). I am running in yarn-cluster mode. While I can see the logs from driver code, I am not able to see logs from map, mapPartition functions in the Application Tracking URL. Where can I find the logs? /var outputRDD = partitionedRDD.mapPartitions(p = { val outputList = new ArrayList[scala.Tuple3[Long, Long, Int]] p.map({ case(key, value) = { log.info(Inside map) println(Inside map); for(i - 0 until outputTuples.size()){ val outputRecord = outputTuples.get(i) if(outputRecord != null){ outputList.add(outputRecord.getCurrRecordProfileID(), outputRecord.getWindowRecordProfileID, outputRecord.getScore()) } } } }) outputList.iterator() })/ Here is my log4j.properties /log4j.rootCategory=INFO, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n # Settings to quiet third party logs that are too verbose log4j.logger.org.eclipse.jetty=WARN log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Where-can-I-find-logs-set-inside-RDD-processing-functions-tp21537.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Sort based shuffle not working properly?
I am trying to implement secondary sort in spark as we do in map-reduce. Here is my data(tab separated, without c1, c2, c2). c1c2 c3 1 2 4 1 3 6 2 4 7 2 6 8 3 5 5 3 1 8 3 2 0 To do secondary sort, I create paried RDD as /((c1 + ,+ c2), row)/ and then use a custom partitioner to partition only on c1. I have set /spark.shuffle.manager = SORT/ so the keys per partition are sorted. For the key 3 I am expecting to get (3, 1) (3, 2) (3, 5) but still getting the original order 3,5 3,1 3,2 Here is the custom partitioner code: /class StraightPartitioner(p: Int) extends org.apache.spark.Partitioner { def numPartitions = p def getPartition(key: Any) = { key.asInstanceOf[String].split(,)(0).toInt } }/ and driver code, please tell me what I am doing wrong /val conf = new SparkConf().setAppName(MapInheritanceExample) conf.set(spark.shuffle.manager, SORT); val sc = new SparkContext(conf) val pF = sc.textFile(inputFile) val log = LogFactory.getLog(MapFunctionTest) val partitionedRDD = pF.map { x = var arr = x.split(\t); (arr(0)+,+arr(1), null) }.partitionBy(new StraightPartitioner(10)) var outputRDD = partitionedRDD.mapPartitions(p = { p.map({ case(o, n) = { o } }) })/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sort-based-shuffle-not-working-properly-tp21487.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Sort based shuffle not working properly?
Just to add, I am suing Spark 1.1.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sort-based-shuffle-not-working-properly-tp21487p21488.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Window comparison matching using the sliding window functionality: feasibility
Mine was not really a moving average problem. It was more like partitioning on some keys and sorting(on different keys) and then running a sliding window through the partition. I reverted back to map-reduce for that(I needed secondary sort, which is not very mature in Spark right now). But, as far as I understand your problem, you should be able to handle it by converting your RDD to key-value RDDs which I think will be automatically partitioned on the key and then use *mapPartitions *to run your logic. On Mon, Feb 2, 2015 at 1:20 AM, ashu [via Apache Spark User List] ml-node+s1001560n21458...@n3.nabble.com wrote: Hi, I want to know about your moving avg implementation. I am also doing some time-series analysis about CPU performance. So I tried simple regression but result is not good. rmse is 10 but when I extrapolate it just shoot up linearly. I think I should first smoothed out the data then try regression to forecast. i am thinking of moving avg as an option,tried it out according to this http://stackoverflow.com/questions/23402303/apache-spark-moving-average but partitionBy is giving me error, I am building with Spark 1.2.0. Can you share your ARIMA implementation if it is open source, else can you give me hints about it Will really appreciate the help. Thanks -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352p21458.html To unsubscribe from Window comparison matching using the sliding window functionality: feasibility, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=15352code=bml0aW5rYWswMDFAZ21haWwuY29tfDE1MzUyfDEyMjcwMjA2NQ== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352p21467.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Connected Components running for a long time and failing eventually
I am trying to run connected components on a graph generated by reading an edge file. Its running for a long time(3-4 hrs) and then eventually failing. Cant find any error in log file. The file I am testing it on has 27M rows(edges). Is there something obviously wrong with the code? I tested the same code with around 1000 rows input and it works just fine. object ConnectedComponentsTest { def main(args: Array[String]) { val inputFile = /user/hive/warehouse/spark_poc.db/window_compare_output_subset/00_0.snappy,/user/hive/warehouse/spark_poc.db/window_compare_output_subset/01_0.snappy // Should be some file on your system val conf = new SparkConf().setAppName(ConnectedComponentsTest) val sc = new SparkContext(conf) val graph = GraphLoader.edgeListFile(sc, inputFile, true); val cc = graph.connectedComponents().vertices; cc.saveAsTextFile(/user/kakn/output); } } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Connected-Components-running-for-a-long-time-and-failing-eventually-tp19659.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Partition sorting by Spark framework
I need to sort my RDD partitions but the whole partition(s) might not fit into memory, so I cannot run the Collections Sort() method. Does Spark support partitions sorting by virtue of its framework? I am working on 1.1.0 version. I looked up similar unanswered question: /http://apache-spark-user-list.1001560.n3.nabble.com/sort-order-after-reduceByKey-groupByKey-td2959.html/ Thanks All!! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Partition-sorting-by-Spark-framework-tp18213.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Does JavaSchemaRDD inherit the Hive partitioning of data?
Any suggestions guys?? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-JavaSchemaRDD-inherit-the-Hive-partitioning-of-data-tp17410p17539.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Does JavaSchemaRDD inherit the Hive partitioning of data?
So, this means that I can create table and insert data in it with Dynbamic partitioning and those partitions would be inherited by RDDs. Is it in Spark 1.1.0? If not, is there a way to partition the data in a file based on some attributes of the rows in the data data(without hardcoding the number of partitions). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-JavaSchemaRDD-inherit-the-Hive-partitioning-of-data-tp17410p17558.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Is Spark 1.1.0 incompatible with Hive?
I am working on running the following hive query from spark. /SELECT * FROM spark_poc.table_name DISTRIBUTE BY GEO_REGION, GEO_COUNTRY SORT BY IP_ADDRESS, COOKIE_ID/ Ran into /java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;/ (complete stack trace at the bottom). Found a few mentions of this issue in the user list. It seems(from the below thread link) that there is a Guava version incompatibility between Spark 1.1.0 and Hive which is probably fixed in 1.2.0. /http://apache-spark-user-list.1001560.n3.nabble.com/Hive-From-Spark-td10110.html#a12671/ *So, wanted to confirm, is Spark SQL 1.1.0 incompatible with Hive or is there a workaround to this?* /Exception in thread Driver java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162) Caused by: java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; at org.apache.spark.util.collection.OpenHashSet.org$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261) at org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165) at org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102) at org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210) at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:169) at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161) at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155) at org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78) at org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70) at org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:236) at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:126) at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:104) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:750) at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:601) at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:872) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:79) at org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:68) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809) at org.apache.spark.sql.hive.HadoopTableReader.init(TableReader.scala:68) at org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:68) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188) at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364) at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54) at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:292) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58) at
Does JavaSchemaRDD inherit the Hive partitioning of data?
Would the rdd resulting from the below query be partitioned on GEO_REGION, GEO_COUNTRY? I ran some tests(using mapPartitions on the resulting RDD) and seems that there are always 50 partitions generated while there should be around 1000. /SELECT * FROM spark_poc.table_nameDISTRIBUTE BY GEO_REGION, GEO_COUNTRY SORT BY IP_ADDRESS, COOKIE_ID/ If not, how can I partition the data based on an attribute/combination of attributes in data. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-JavaSchemaRDD-inherit-the-Hive-partitioning-of-data-tp17410.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
JavaHiveContext class not found error. Help!!
I am trying to run the below Hive query on Yarn. I am using Cloudera 5.1. What can I do to make this work? /SELECT * FROM table_name DISTRIBUTE BY GEO_REGION, GEO_COUNTRY SORT BY IP_ADDRESS, COOKIE_ID;/ Below is the stack trace: Exception in thread Thread-4 java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:187) Caused by: *java.lang.NoClassDefFoundError: org/apache/spark/sql/hive/api/java/JavaHiveContext* at HiveContextExample.main(HiveContextExample.java:57) ... 5 more Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.hive.api.java.JavaHiveContext at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 6 more Here is the code invoking it: /SparkConf conf = new SparkConf().setAppName(PartitionData); JavaSparkContext ctx = new JavaSparkContext(conf); JavaHiveContext hiveContext = new JavaHiveContext(ctx); String sql = SELECT * FROM table_name DISTRIBUTE BY GEO_REGION, GEO_COUNTRY SORT BY IP_ADDRESS, COOKIE_ID; JavaSchemaRDD partitionedRDD = hiveContext.sql(sql);/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/JavaHiveContext-class-not-found-error-Help-tp17149.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
com.esotericsoftware.kryo.KryoException: Buffer overflow.
I am running a simple rdd filter command. What does it mean? Here is the full stack trace(and code below it): com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 133 at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at com.esotericsoftware.kryo.io.Output.writeString_slow(Output.java:420) at com.esotericsoftware.kryo.io.Output.writeString(Output.java:326) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:274) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:262) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:138) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) *Here is the code of the main function:* /String comparisonFieldIndexes = 16,18; String segmentFieldIndexes = 14,15; String comparisonFieldWeights = 50, 50; String delimiter = +'\001'; PartitionDataOnColumn parOnCol = new PartitionDataOnColumn(70, comparisonFieldIndexes, comparisonFieldWeights, segmentFieldIndexes, delimiter); JavaRDDString filtered_rdd = origRDD.filter(parOnCol.new FilterEmptyFields(parOnCol.fieldIndexes, parOnCol.DELIMITER) ); parOnCol.printRDD(filtered_rdd);/ *Here is the FilterEmptyFields class:* /public class FilterEmptyFields implements FunctionString, Boolean { final int[] nonEmptyFields; final String DELIMITER; public FilterEmptyFields(int[] nonEmptyFields, String delimiter){ this.nonEmptyFields = nonEmptyFields; this.DELIMITER = delimiter; } @Override public Boolean call(String s){ String[] fields = s.split(DELIMITER); for(int i=0; inonEmptyFields.length; i++){ if(fields[nonEmptyFields[i]] == null || fields[nonEmptyFields[i]].isEmpty()){ return false; } } return true; } }lt;/i Any suggestions guys? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/com-esotericsoftware-kryo-KryoException-Buffer-overflow-tp16947.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Window comparison matching using the sliding window functionality: feasibility
Thanks @category_theory, the post was of great help!! I had to learn a few thing before I could understand it completely. However, I am facing the issue of partitioning the data (using partitionBy) without providing a hardcoded value for number of partitions. The partitions need to be driven by data(segmentation key I am using) in my case. So my question is say if the number of partitions generated by my segmentation key = 1000 the number given to the partitioner = 2000 In this case, would there be 2000 partitions created(which will break the partition boundary of the segmentation key)? If so then sliding window will roll over multiple partitions and computation would generate wrong results. Thanks again for the response!! On Tue, Sep 30, 2014 at 11:51 AM, category_theory [via Apache Spark User List] ml-node+s1001560n1540...@n3.nabble.com wrote: Not sure if this is what you are after but its based on a moving average within spark... I was building an ARIMA model on top of spark and this helped me out a lot: http://stackoverflow.com/questions/23402303/apache-spark-moving-average ᐧ *JIMMY MCERLAIN* DATA SCIENTIST (NERD) *. . . . . . . . . . . . . . . . . .* *IF WE CAN’T DOUBLE YOUR SALES,* *ONE OF US IS IN THE WRONG BUSINESS.* *E*: [hidden email] http://user/SendEmail.jtp?type=nodenode=15407i=0 *M*: *510.303.7751 510.303.7751* On Tue, Sep 30, 2014 at 8:19 AM, nitinkak001 [hidden email] http://user/SendEmail.jtp?type=nodenode=15407i=1 wrote: Any ideas guys? Trying to find some information online. Not much luck so far. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352p15404.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=15407i=2 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=15407i=3 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352p15407.html To unsubscribe from Window comparison matching using the sliding window functionality: feasibility, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=15352code=bml0aW5rYWswMDFAZ21haWwuY29tfDE1MzUyfDEyMjcwMjA2NQ== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352p16201.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Window comparison matching using the sliding window functionality: feasibility
Any ideas guys? Trying to find some information online. Not much luck so far. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352p15404.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Window comparison matching using the sliding window functionality: feasibility
Need to know the feasibility of the below task. I am thinking of this one to be a mapreduce-spark effort. I need to run distributed sliding Window Comparison for digital data matching on top of Hadoop. The data(Hive Table) will be partitioned, distributed across data node. Then the window comparison tool, multiple instance of it, would run on the individual partitions(locally to the data node). This window comparison tool will be a sliding window in which all the rows in a window interval will be compared based on different columns to each other and a score will be generated. I am more familiar with map-reduce and I think uptill the partitioning part we can do in it. For the distributed window comparison I am thinking of using spark. I know spark streaming has a sliding window functionality. Can I use that to accomplish above task? Any suggestions are appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org