Spark mapPartition output object size coming larger than expected

2017-02-06 Thread nitinkak001
I am storing the output of mapPartitions in a ListBuffer and exposing its
iterator as the output. The output is a list of Long tuples(Tuple2). When I
check the size of the object using Spark's SizeEstimator.estimate method it
comes out to 80 bytes per record/tuple object(calculating this by "size of
ListBuffer object/# records"). This I think is too huge for a Tuple2 object
of long type(two 8 byte longs + some object overhead memory). Any ideas why
this is so and how to reduce the memory captured by output? I am sure I am
missing something obvious.

Also, these ListBuffer object are getting too huge for memory leading to
memory and disk spills causing bad performance. Any ideas on how I can just
simply write the output of mapPartitions without storing the whole output as
an in-memory object. Each input record to mapPartitions can generate 0 or
more output records, so I think I cannot use "rdd.map" function iterator. I
am not sure even if that will help my cause.

Here is the code code snippet.

/var outputRDD = sortedRDD.mapPartitionsWithIndex((partitionNo,p) => {
  var outputList = ListBuffer[(Long,Long)]()
  var inputCnt: Long = 0;
  var outputCnt: Long = 0;
  while (p.hasNext) {
  inputCnt = inputCnt + 1;
val tpl = p.next()
var partitionKey = ""
try{
  partitionKey = tpl._1.split(keyDelimiter)(0) 
//Partition key
}catch{
  case aob : ArrayIndexOutOfBoundsException => {
println("segmentKey:"+partitionKey);
  }
}  
val value = tpl._2
var xs: Array[Any] = value.toSeq.toArray;
//value.copyToArray(xs);

val xs_string : Array[String] = new Array[String](value.size);
for(i <- 0 to value.size-1){
  xs_string(i) = xs(i) match { case None => ""
   case null => ""
   case _ =>  xs(i).toString()
 }
}

val outputTuples = windowObject.process(partitionKey, xs_string);

if(outputTuples != null){
for (i <- 0 until outputTuples.size()) {
val outputRecord = outputTuples.get(i)
if (outputRecord != null) {
outputList += 
((outputRecord.getProfileID1 ,
outputRecord.getProfileID2))
outputCnt = outputCnt +1;
}
}
}   
  }
  
  if(debugFlag.equals("DEBUG")){
logger.info("partitionNo:"+ partitionNo + ", input #: "+ 
inputCnt +",
output #: "+ outputCnt+", outputList object size:" +
SizeEstimator.estimate(outputList));
  }
  
  outputList.iterator

}, false)/



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-mapPartition-output-object-size-coming-larger-than-expected-tp28367.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to add all jars in a folder to executor classpath?

2016-10-18 Thread nitinkak001
I need to add all the jars in hive/lib to my spark job executor classpath. I
tried this

spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hive/lib
and 
spark.executor.extraClassPath=/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hive/lib/*

but it does not add any jars to the classpath of the executor. How can I add
all the jars in a folder to executor or driver class path and what if I have
multiple folders? What is the syntax for that?

I am using Spark 1.6.0



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-add-all-jars-in-a-folder-to-executor-classpath-tp27916.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark SQL(Hive query through HiveContext) always creating 31 partitions

2016-04-04 Thread nitinkak001
I am running hive queries using HiveContext from my Spark code. No matter
which query I run and how much data it is, it always generates 31
partitions. Anybody knows the reason? Is there a predefined/configurable
setting for it? I essentially need more partitions.

I using this code snippet to execute hive query:

/var pairedRDD = hqlContext.sql(hql).rdd.map(...)/

Thanks,
Nitin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Hive-query-through-HiveContext-always-creating-31-partitions-tp26671.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Out of Memory error caused by output object in mapPartitions

2016-02-15 Thread nitinkak001
My mapPartition code as given below outputs one record for each input record.
So, the output object has equal number of records as input. I am loading the
output data into a listbuffer object. This object is turning out to be too
huge for memory leading to Out Of Memory exception. 

To be more clear my logic of partition is as below:

*Iterator(Iter1) -> Processing  -> ListBuffer(list1)

iter1.size() = list1.size()
list1 goes out of memory*

*I cannot change the partition size.* My parition is based on input key and
all the records corresponding to a key need to go into same partition. Is
there a workaround to this?

/   tempRDD = iterateRDD.mapPartitions(p => {
var outputList = ListBuffer[String]()
var minVal = 0L
while (p.hasNext) {
val tpl = p.next()
val key = tpl._1
val value = tpl._2
if(key != prevKey){
  if(value < key){
minVal = value;
outputList.add(minVal.toString() + "\t" +key.toString())
  }else{
minVal = key;
outputList.add(minVal.toString() + "\t" +value.toString())
  }
}else{
  outputList.add(minVal.toString() + "\t" +value.toString())
}
prevKey = key;
}
outputList.iterator
  })/






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-Memory-error-caused-by-output-object-in-mapPartitions-tp26229.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark SQL incompatible with Apache Sentry(Cloudera bundle)

2015-06-24 Thread nitinkak001
CDH version: 5.3
Spark Version: 1.2 

I was trying to execute a Hive query from Spark code(using HiveContext
class). It was working fine untill we installed Apache Sentry. Now its
giving me read permission exception. 

/org.apache.hadoop.security.AccessControlException: Permission denied:
user=kakn, access=READ_EXECUTE,
inode=/user/hive/warehouse/rt_freewheel_mastering.db/digital_profile_cluster_in:hive:hive:drwxrwx--t/

I understand this exception since with Sentry you can read/write the Hive
warehouse table directories only with hive user. It is Sentry which does
the user translation from a user e.g kakn in below case to hive. The query
has to go via HiveServer2 for the user translation. Spark code(HiveContext)
seems to run the query on its own(Hive CLI) and bypasses HiveServer2. Is
there any way to make the query execution go through HiveServer2? 

I am stuck here, any suggestions comments would be really appreciated.
Thanks.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-incompatible-with-Apache-Sentry-Cloudera-bundle-tp23477.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does HiveContext connect to HiveServer2?

2015-06-22 Thread nitinkak001
Hey, I have exactly this question. Did you get an answer to it?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-connect-to-HiveServer2-tp22200p23431.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-17 Thread nitinkak001
I am trying to run a hive query from Spark code using HiveContext object. It
was running fine earlier but since the Apache Sentry has been set installed
the process is failing with this exception :

/org.apache.hadoop.security.AccessControlException: Permission denied:
user=kakn, access=READ_EXECUTE,
inode=/user/hive/warehouse/rt_freewheel_mastering.db/digital_profile_cluster_in:hive:hive:drwxrwx--t/
hive-site.xml
http://apache-spark-user-list.1001560.n3.nabble.com/file/n23381/hive-site.xml 
 
I have pasted the full stack trace at the end of this post. My username
kakn is a registered user with Sentry. I know that Spark takes all the
configurations from hive-site.xml to execute the hql, so I added a few
Sentry specific properties but seem to have no effect.It seems that the
HiveContext is not going through HiveServer2(which I understand talks to
Sentry component for user translation/delegation).  I have attached the
hive-site.xml.

/property
namehive.security.authorization.task.factory/name
   
valueorg.apache.sentry.binding.hive.SentryHiveAuthorizationTaskFactoryImpl/value
  /property
  property
namehive.metastore.pre.event.listeners/name
valueorg.apache.sentry.binding.metastore.MetastoreAuthzBinding/value
descriptionlist of comma seperated listeners for metastore
events./description
  /property
  property
namehive.warehouse.subdir.inherit.perms/name
valuetrue/value
 /property/




/org.apache.hadoop.security.AccessControlException: Permission denied:
user=kakn, access=READ_EXECUTE,
inode=/user/hive/warehouse/rt_freewheel_mastering.db/digital_profile_cluster_in:hive:hive:drwxrwx--t
at
org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:257)
at
org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:238)
at
org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:151)
at
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:138)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6287)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6269)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPathAccess(FSNamesystem.java:6194)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListingInt(FSNamesystem.java:4793)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListing(FSNamesystem.java:4755)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getListing(NameNodeRpcServer.java:800)
at
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getListing(AuthorizationProviderProxyClientProtocol.java:310)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:606)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1895)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1876)
at
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
at
org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:104)
at
org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:716)
at
org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
at

Possible to use hive-config.xml instead of hive-site.xml for HiveContext?

2015-05-05 Thread nitinkak001
I am running hive queries from HiveContext, for which we need a
hive-site.xml.

Is it possible to replace it with hive-config.xml? I tried but does not
work. Just want a conformation.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Possible-to-use-hive-config-xml-instead-of-hive-site-xml-for-HiveContext-tp22776.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Generating version agnostic jar path value for --jars clause

2015-05-02 Thread nitinkak001
I have a list of cloudera jars which I need to provide in --jars clause,
mainly for the HiveContext functionality I am using. However,  many of these
jars have version number as part of their names. This leads to an issue that
the names might change when I do a Cloudera upgrade.

Just a note here, there are many jars which cloudera exposes as a symlink
which is the link to the latest version of that jar(e.g
/opt/cloudera/parcels/CDH/lib/parquet/parquet-hadoop-bundle.jar -
parquet-hadoop-bundle-1.5.0-cdh5.3.2.jar),in which case its good but there
are many jars which aren't.

Is there a flexible way to avoid this situation?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Generating-version-agnostic-jar-path-value-for-jars-clause-tp22734.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Does HiveContext connect to HiveServer2?

2015-03-24 Thread nitinkak001
I am wondering if HiveContext connects to HiveServer2 or does it work though
Hive CLI. The reason I am asking is because Cloudera has deprecated Hive
CLI. 

If the connection is through HiverServer2, is there a way to specify user
credentials?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-connect-to-HiveServer2-tp22200.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Weird exception in Spark job

2015-03-24 Thread nitinkak001
Any Ideas on this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Weird-exception-in-Spark-job-tp22195p22204.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Is yarn-standalone mode deprecated?

2015-03-23 Thread nitinkak001
Is yarn-standalone mode deprecated in Spark now. The reason I am asking is
because while I can find it in 0.9.0
documentation(https://spark.apache.org/docs/0.9.0/running-on-yarn.html). I
am not able to find it in 1.2.0. 

I am using this mode to run the Spark jobs from Oozie as a java action.
Removing this mode will prevent me from doing that. Are there any other ways
of running a Spark job from Oozie other than Shell action? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-yarn-standalone-mode-deprecated-tp22188.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Weird exception in Spark job

2015-03-23 Thread nitinkak001
I am trying to run a Hive query from Spark code through HiveContext. Anybody
knows what these exceptions mean? I have no clue

LogType: stderr
LogLength: 3345
Log Contents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/jars/avro-tools-1.7.6-cdh5.3.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/jars/pig-0.12.0-cdh5.3.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/jars/slf4j-simple-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread main java.lang.reflect.UndeclaredThrowableException
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1655)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:161)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after
[1 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at akka.remote.Remoting.start(Remoting.scala:173)
at
akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579)
at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577)
at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
at
org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
at
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1684)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1675)
at 
org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:122)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
... 4 more

LogType: stdout
LogLength: 7460
Log Contents:
2015-03-23 18:54:55,571 INFO  [main] executor.CoarseGrainedExecutorBackend
(SignalLogger.scala:register(47)) - Registered signal handlers for [TERM,
HUP, INT]
2015-03-23 18:54:56,898 INFO  [main] spark.SecurityManager
(Logging.scala:logInfo(59)) - Changing view acls to: yarn,kakn
2015-03-23 18:54:56,901 INFO  [main] spark.SecurityManager
(Logging.scala:logInfo(59)) - Changing modify acls to: yarn,kakn
2015-03-23 18:54:56,911 INFO  [main] spark.SecurityManager
(Logging.scala:logInfo(59)) - SecurityManager: authentication disabled; ui
acls disabled; users with view permissions: Set(yarn, kakn); users with
modify permissions: Set(yarn, kakn)
2015-03-23 18:54:57,725 INFO 
[driverPropsFetcher-akka.actor.default-dispatcher-3] slf4j.Slf4jLogger
(Slf4jLogger.scala:applyOrElse(80)) - Slf4jLogger started
2015-03-23 18:54:57,810 INFO 
[driverPropsFetcher-akka.actor.default-dispatcher-3] Remoting
(Slf4jLogger.scala:apply$mcV$sp(74)) - Starting remoting
2015-03-23 18:54:57,871 ERROR
[driverPropsFetcher-akka.actor.default-dispatcher-5] actor.ActorSystemImpl
(Slf4jLogger.scala:apply$mcV$sp(66)) - Uncaught fatal error from thread
[driverPropsFetcher-akka.actor.default-dispatcher-4] shutting down
ActorSystem 

HiveContext test, Spark Context did not initialize after waiting 10000ms

2015-03-06 Thread nitinkak001
I am trying to run a Hive query from Spark using HiveContext. Here is the
code

/ val conf = new SparkConf().setAppName(HiveSparkIntegrationTest)

   
conf.set(spark.executor.extraClassPath,
/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib);
conf.set(spark.driver.extraClassPath,
/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib);
conf.set(spark.yarn.am.waitTime, 30L)

val sc = new SparkContext(conf)

val sqlContext = new HiveContext(sc)

def inputRDD = sqlContext.sql(describe
spark_poc.src_digital_profile_user);

inputRDD.collect().foreach { println }

println(inputRDD.schema.getClass.getName)
/

Getting this exception. Any clues? The weird part is if I try to do the same
thing but in Java instead of Scala, it runs fine.

/Exception in thread Driver java.lang.NullPointerException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
15/03/06 17:39:32 ERROR yarn.ApplicationMaster: SparkContext did not
initialize after waiting for 1 ms. Please check earlier log output for
errors. Failing the application.
Exception in thread main java.lang.NullPointerException
at
org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkContextInitialized(ApplicationMaster.scala:218)
at
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:110)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:434)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
at
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:433)
at
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
15/03/06 17:39:32 INFO yarn.ApplicationMaster: AppMaster received a signal./



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/HiveContext-test-Spark-Context-did-not-initialize-after-waiting-1ms-tp21953.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running Spark jobs via oozie

2015-03-03 Thread nitinkak001
I am also starting to work on this one. Did you get any solution to this
issue?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-jobs-via-oozie-tp5187p21896.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Executing hive query from Spark code

2015-03-02 Thread nitinkak001
I want to run Hive query inside Spark and use the RDDs generated from that
inside Spark. I read in the documentation 

/Hive support is enabled by adding the -Phive and -Phive-thriftserver flags
to Spark’s build. This command builds a new assembly jar that includes Hive.
Note that this Hive assembly jar must also be present on all of the worker
nodes, as they will need access to the Hive serialization and
deserialization libraries (SerDes) in order to access data stored in Hive./

I just wanted to know what -Phive and -Phive-thriftserver flags really do
and is there a way to have the hive support without updating the assembly.
Does that flag add a hive support jar or something?

The reason I am asking is that I will be using Cloudera version of Spark in
future and I am not sure how to add the Hive support to that Spark
distribution.






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Executing-hive-query-from-Spark-code-tp21880.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Counters in Spark

2015-02-13 Thread nitinkak001
I am trying to implement counters in Spark and I guess Accumulators are the
way to do it.

My motive is to update a counter in map function and access/reset it in the
driver code. However the /println/ statement at the end still yields value
0(It should 9). Am I doing something wrong?

def main(args : Array[String]){

val conf = new SparkConf().setAppName(SortedNeighbourhoodMatching)
val sc = new SparkContext(conf)
var counter = sc.accumulable(0, Counter)
var inputFilePath = args(0)
val inputRDD = sc.textFile(inputFilePath)

inputRDD.map { x = {
  counter += 1
} }
println(counter.value)
}




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Counters-in-Spark-tp21646.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Where can I find logs set inside RDD processing functions?

2015-02-06 Thread nitinkak001
I am trying to debug my mapPartitionsFunction. Here is the code. There are
two ways I am trying to log using log.info() or println(). I am running in
yarn-cluster mode. While I can see the logs from driver code, I am not able
to see logs from map, mapPartition functions in the Application Tracking
URL. Where can I find the logs?

 /var outputRDD = partitionedRDD.mapPartitions(p = {
  val outputList = new ArrayList[scala.Tuple3[Long, Long, Int]]
  p.map({ case(key, value) = {
   log.info(Inside map)
   println(Inside map);
   for(i - 0 until outputTuples.size()){
 val outputRecord = outputTuples.get(i)
 if(outputRecord != null){
   outputList.add(outputRecord.getCurrRecordProfileID(),
outputRecord.getWindowRecordProfileID, outputRecord.getScore())
 }  
   }
}
  })
  outputList.iterator()
})/

Here is my log4j.properties

/log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p
%c{1}: %m%n

# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO/




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Where-can-I-find-logs-set-inside-RDD-processing-functions-tp21537.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Sort based shuffle not working properly?

2015-02-03 Thread nitinkak001
I am trying to implement secondary sort in spark as we do in map-reduce. 

Here is my data(tab separated, without c1, c2, c2). 
c1c2 c3
1   2   4
1   3   6
2   4   7
2   6   8
3   5   5
3   1   8
3   2   0

To do secondary sort, I create paried RDD as 

/((c1 + ,+ c2), row)/

and then use a custom partitioner to partition only on c1. I have set
/spark.shuffle.manager = SORT/ so the keys per partition are sorted. For the
key 3 I am expecting to get 
(3, 1)
(3, 2)
(3, 5)
but still getting the original order
3,5
3,1
3,2

Here is the custom partitioner code:

/class StraightPartitioner(p: Int) extends org.apache.spark.Partitioner {
  def numPartitions = p
  def getPartition(key: Any) = {
key.asInstanceOf[String].split(,)(0).toInt
  }

}/

and driver code, please tell me what I am doing wrong

/val conf = new SparkConf().setAppName(MapInheritanceExample)
conf.set(spark.shuffle.manager, SORT);
val sc = new SparkContext(conf)
val pF = sc.textFile(inputFile)

val log = LogFactory.getLog(MapFunctionTest)
val partitionedRDD = pF.map { x =

var arr = x.split(\t);
(arr(0)+,+arr(1), null)

}.partitionBy(new StraightPartitioner(10))

var outputRDD = partitionedRDD.mapPartitions(p = {
  p.map({ case(o, n) = {
   o
}
  })
})/



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Sort-based-shuffle-not-working-properly-tp21487.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Sort based shuffle not working properly?

2015-02-03 Thread nitinkak001
Just to add, I am suing Spark 1.1.0



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Sort-based-shuffle-not-working-properly-tp21487p21488.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Window comparison matching using the sliding window functionality: feasibility

2015-02-02 Thread nitinkak001
Mine was not really a moving average problem. It was more like partitioning
on some keys and sorting(on different keys) and then running a sliding
window through the partition. I reverted back to map-reduce for that(I
needed secondary sort, which is not very mature in Spark right now).

But, as far as I understand your problem, you should be able to handle it
by converting your RDD to key-value RDDs which I think will be
automatically partitioned on the key and then use *mapPartitions *to run
your logic.

On Mon, Feb 2, 2015 at 1:20 AM, ashu [via Apache Spark User List] 
ml-node+s1001560n21458...@n3.nabble.com wrote:

 Hi,
 I want to know about your moving avg implementation. I am also doing some
 time-series analysis about CPU performance. So I tried simple regression
 but result is not good. rmse is 10 but when I extrapolate it just shoot up
 linearly. I think I should first smoothed out the data then try regression
 to forecast.
 i am thinking of moving avg as an option,tried it out according to this
 http://stackoverflow.com/questions/23402303/apache-spark-moving-average

 but partitionBy is giving me error, I am building with Spark 1.2.0.
 Can you share your ARIMA implementation if it is open source, else can you
 give me hints about it

 Will really appreciate the help.
 Thanks

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352p21458.html
  To unsubscribe from Window comparison matching using the sliding window
 functionality: feasibility, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=15352code=bml0aW5rYWswMDFAZ21haWwuY29tfDE1MzUyfDEyMjcwMjA2NQ==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352p21467.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Connected Components running for a long time and failing eventually

2014-11-24 Thread nitinkak001
I am trying to run connected components on a graph generated by reading an
edge file. Its running for a long time(3-4 hrs) and then eventually failing.
Cant find any error in log file. The file I am testing it on has 27M
rows(edges). Is there something obviously wrong with the code?

I tested the same code with around 1000 rows input and it works just fine.

object ConnectedComponentsTest {
  def main(args: Array[String]) {
val inputFile =
/user/hive/warehouse/spark_poc.db/window_compare_output_subset/00_0.snappy,/user/hive/warehouse/spark_poc.db/window_compare_output_subset/01_0.snappy
// Should be some file on your system
val conf = new SparkConf().setAppName(ConnectedComponentsTest)
val sc = new SparkContext(conf)
val graph = GraphLoader.edgeListFile(sc, inputFile, true);
val cc = graph.connectedComponents().vertices;
cc.saveAsTextFile(/user/kakn/output);
  }
}



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Connected-Components-running-for-a-long-time-and-failing-eventually-tp19659.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Partition sorting by Spark framework

2014-11-05 Thread nitinkak001
I need to sort my RDD partitions but the whole partition(s) might not fit
into memory, so I cannot run the Collections Sort() method. Does Spark
support partitions sorting by virtue of its framework? I am working on 1.1.0
version.

I looked up similar unanswered question:

/http://apache-spark-user-list.1001560.n3.nabble.com/sort-order-after-reduceByKey-groupByKey-td2959.html/

Thanks All!!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Partition-sorting-by-Spark-framework-tp18213.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-28 Thread nitinkak001
Any suggestions guys??



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-JavaSchemaRDD-inherit-the-Hive-partitioning-of-data-tp17410p17539.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-28 Thread nitinkak001
So, this means that I can create table and insert data in it with Dynbamic
partitioning and those partitions would be inherited by RDDs. Is it in Spark
1.1.0?

If not, is there a way to partition the data in a file based on some
attributes of the rows in the data data(without hardcoding the number of
partitions).





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-JavaSchemaRDD-inherit-the-Hive-partitioning-of-data-tp17410p17558.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Is Spark 1.1.0 incompatible with Hive?

2014-10-27 Thread nitinkak001
I am working on running the following hive query from spark. 

/SELECT * FROM spark_poc.table_name DISTRIBUTE BY GEO_REGION, GEO_COUNTRY
SORT BY IP_ADDRESS, COOKIE_ID/

Ran into /java.lang.NoSuchMethodError:
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;/
(complete stack trace at the bottom). Found a few mentions of this issue in
the user list. It seems(from the below thread link) that there is a Guava
version incompatibility between Spark 1.1.0 and Hive which is probably fixed
in 1.2.0. 

/http://apache-spark-user-list.1001560.n3.nabble.com/Hive-From-Spark-td10110.html#a12671/

*So, wanted to confirm, is Spark SQL 1.1.0 incompatible with Hive or is
there a workaround to this?*



/Exception in thread Driver java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
Caused by: java.lang.NoSuchMethodError:
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
at
org.apache.spark.util.collection.OpenHashSet.org$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261)
at
org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165)
at
org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102)
at
org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at
org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210)
at
org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:169)
at
org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161)
at
org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155)
at
org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78)
at
org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70)
at
org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31)
at
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:236)
at
org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:126)
at
org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:104)
at
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:750)
at
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:601)
at
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:872)
at
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:79)
at
org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:68)
at
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36)
at
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
at
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809)
at
org.apache.spark.sql.hive.HadoopTableReader.init(TableReader.scala:68)
at
org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:68)
at
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
at
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
at
org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
at
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:292)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 

Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-27 Thread nitinkak001
Would the rdd resulting from the below query be partitioned on GEO_REGION,
GEO_COUNTRY? I ran some tests(using mapPartitions on the resulting RDD) and
seems that there are always 50 partitions generated while there should be
around 1000.

/SELECT * FROM spark_poc.table_nameDISTRIBUTE BY GEO_REGION, GEO_COUNTRY
SORT BY IP_ADDRESS, COOKIE_ID/

If not, how can I partition the data based on an attribute/combination of
attributes in data.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-JavaSchemaRDD-inherit-the-Hive-partitioning-of-data-tp17410.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



JavaHiveContext class not found error. Help!!

2014-10-23 Thread nitinkak001
I am trying to run the below Hive query on Yarn. I am using Cloudera 5.1.
What can I do to make this work?

/SELECT * FROM table_name DISTRIBUTE BY GEO_REGION, GEO_COUNTRY SORT BY
IP_ADDRESS, COOKIE_ID;/

Below is the stack trace:

Exception in thread Thread-4 java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:187)
Caused by: *java.lang.NoClassDefFoundError:
org/apache/spark/sql/hive/api/java/JavaHiveContext*
at HiveContextExample.main(HiveContextExample.java:57)
... 5 more
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.sql.hive.api.java.JavaHiveContext
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 6 more


Here is the code invoking it:

/SparkConf conf = new SparkConf().setAppName(PartitionData);

JavaSparkContext ctx = new JavaSparkContext(conf);

JavaHiveContext hiveContext = new JavaHiveContext(ctx);

String sql = SELECT * FROM table_name DISTRIBUTE BY GEO_REGION,
GEO_COUNTRY SORT BY IP_ADDRESS, COOKIE_ID;

JavaSchemaRDD partitionedRDD = hiveContext.sql(sql);/



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/JavaHiveContext-class-not-found-error-Help-tp17149.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



com.esotericsoftware.kryo.KryoException: Buffer overflow.

2014-10-21 Thread nitinkak001
I am running a simple rdd filter command. What does it mean? 
Here is the full stack trace(and code below it): 

com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 133
at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
at
com.esotericsoftware.kryo.io.Output.writeString_slow(Output.java:420)
at com.esotericsoftware.kryo.io.Output.writeString(Output.java:326)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:274)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$StringArraySerializer.write(DefaultArraySerializers.java:262)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:138)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

*Here is the code of the main function:*

/String comparisonFieldIndexes = 16,18;
String segmentFieldIndexes = 14,15;
String comparisonFieldWeights = 50, 50;
String delimiter = +'\001';

PartitionDataOnColumn parOnCol = new PartitionDataOnColumn(70,
comparisonFieldIndexes, comparisonFieldWeights, segmentFieldIndexes,
delimiter);

JavaRDDString filtered_rdd = origRDD.filter(parOnCol.new
FilterEmptyFields(parOnCol.fieldIndexes, parOnCol.DELIMITER) );

parOnCol.printRDD(filtered_rdd);/


*Here is the FilterEmptyFields class:*

/public class FilterEmptyFields implements FunctionString, Boolean
{

final int[] nonEmptyFields;
final String DELIMITER;

public FilterEmptyFields(int[] nonEmptyFields, String 
delimiter){
this.nonEmptyFields = nonEmptyFields;
this.DELIMITER = delimiter;
}

@Override
public Boolean call(String s){

String[] fields = s.split(DELIMITER);

for(int i=0; inonEmptyFields.length; i++){
if(fields[nonEmptyFields[i]] == null  ||
fields[nonEmptyFields[i]].isEmpty()){
return false;
}
}

return true;
}

}lt;/i

Any suggestions guys?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/com-esotericsoftware-kryo-KryoException-Buffer-overflow-tp16947.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Window comparison matching using the sliding window functionality: feasibility

2014-10-10 Thread nitinkak001
Thanks @category_theory, the post was of great help!!

I had to learn a few thing before I could understand it completely.
However, I am facing the issue of partitioning the data (using partitionBy)
without providing a hardcoded value for number of partitions. The
partitions need to be driven by data(segmentation key I am using) in my
case.

So my question is say if

the number of partitions generated by my segmentation key = 1000
the number given to the partitioner = 2000

In this case, would there be 2000 partitions created(which will break the
partition boundary of the segmentation key)? If so then sliding window will
roll over multiple partitions and computation would generate wrong results.

Thanks again for the response!!

On Tue, Sep 30, 2014 at 11:51 AM, category_theory [via Apache Spark User
List] ml-node+s1001560n1540...@n3.nabble.com wrote:

 Not sure if this is what you are after but its based on a moving average
 within spark...  I was building an ARIMA model on top of spark and this
 helped me out a lot:

 http://stackoverflow.com/questions/23402303/apache-spark-moving-average
 ᐧ




 *JIMMY MCERLAIN*

 DATA SCIENTIST (NERD)

 *. . . . . . . . . . . . . . . . . .*


 *IF WE CAN’T DOUBLE YOUR SALES,*



 *ONE OF US IS IN THE WRONG BUSINESS.*

 *E*: [hidden email] http://user/SendEmail.jtp?type=nodenode=15407i=0


 *M*: *510.303.7751 510.303.7751*

 On Tue, Sep 30, 2014 at 8:19 AM, nitinkak001 [hidden email]
 http://user/SendEmail.jtp?type=nodenode=15407i=1 wrote:

 Any ideas guys?

 Trying to find some information online. Not much luck so far.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352p15404.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=15407i=2
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=15407i=3




 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352p15407.html
  To unsubscribe from Window comparison matching using the sliding window
 functionality: feasibility, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=15352code=bml0aW5rYWswMDFAZ21haWwuY29tfDE1MzUyfDEyMjcwMjA2NQ==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352p16201.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Window comparison matching using the sliding window functionality: feasibility

2014-09-30 Thread nitinkak001
Any ideas guys?

Trying to find some information online. Not much luck so far.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352p15404.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Window comparison matching using the sliding window functionality: feasibility

2014-09-29 Thread nitinkak001
Need to know the feasibility of the below task. I am thinking of this one to
be a mapreduce-spark effort.

I need to run distributed sliding Window Comparison for digital data
matching on top of Hadoop. The data(Hive Table) will be partitioned,
distributed across data node. Then the window comparison tool, multiple
instance of it, would run on the individual partitions(locally to the data
node). 

This window comparison tool will be a sliding window in which all the rows
in a window interval will be compared based on different columns to each
other and a score will be generated. 

I am more familiar with map-reduce and I think uptill the partitioning part
we can do in it. For the distributed window comparison I am thinking of
using spark. I know spark streaming has a sliding window functionality. Can
I use that to accomplish above task?

Any suggestions are appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org