Spark mapPartition output object size coming larger than expected

2017-02-06 Thread nitinkak001
I am storing the output of mapPartitions in a ListBuffer and exposing its iterator as the output. The output is a list of Long tuples(Tuple2). When I check the size of the object using Spark's SizeEstimator.estimate method it comes out to 80 bytes per record/tuple object(calculating this by "size

How to add all jars in a folder to executor classpath?

2016-10-18 Thread nitinkak001
I need to add all the jars in hive/lib to my spark job executor classpath. I tried this spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hive/lib and spark.executor.extraClassPath=/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hive/lib/* but it does not

Spark SQL(Hive query through HiveContext) always creating 31 partitions

2016-04-04 Thread nitinkak001
I am running hive queries using HiveContext from my Spark code. No matter which query I run and how much data it is, it always generates 31 partitions. Anybody knows the reason? Is there a predefined/configurable setting for it? I essentially need more partitions. I using this code snippet to

Out of Memory error caused by output object in mapPartitions

2016-02-15 Thread nitinkak001
My mapPartition code as given below outputs one record for each input record. So, the output object has equal number of records as input. I am loading the output data into a listbuffer object. This object is turning out to be too huge for memory leading to Out Of Memory exception. To be more

Spark SQL incompatible with Apache Sentry(Cloudera bundle)

2015-06-24 Thread nitinkak001
CDH version: 5.3 Spark Version: 1.2 I was trying to execute a Hive query from Spark code(using HiveContext class). It was working fine untill we installed Apache Sentry. Now its giving me read permission exception. /org.apache.hadoop.security.AccessControlException: Permission denied:

Re: Does HiveContext connect to HiveServer2?

2015-06-22 Thread nitinkak001
Hey, I have exactly this question. Did you get an answer to it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-connect-to-HiveServer2-tp22200p23431.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-17 Thread nitinkak001
I am trying to run a hive query from Spark code using HiveContext object. It was running fine earlier but since the Apache Sentry has been set installed the process is failing with this exception : /org.apache.hadoop.security.AccessControlException: Permission denied: user=kakn,

Possible to use hive-config.xml instead of hive-site.xml for HiveContext?

2015-05-05 Thread nitinkak001
I am running hive queries from HiveContext, for which we need a hive-site.xml. Is it possible to replace it with hive-config.xml? I tried but does not work. Just want a conformation. -- View this message in context:

Generating version agnostic jar path value for --jars clause

2015-05-02 Thread nitinkak001
I have a list of cloudera jars which I need to provide in --jars clause, mainly for the HiveContext functionality I am using. However, many of these jars have version number as part of their names. This leads to an issue that the names might change when I do a Cloudera upgrade. Just a note here,

Does HiveContext connect to HiveServer2?

2015-03-24 Thread nitinkak001
I am wondering if HiveContext connects to HiveServer2 or does it work though Hive CLI. The reason I am asking is because Cloudera has deprecated Hive CLI. If the connection is through HiverServer2, is there a way to specify user credentials? -- View this message in context:

Re: Weird exception in Spark job

2015-03-24 Thread nitinkak001
Any Ideas on this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Weird-exception-in-Spark-job-tp22195p22204.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Is yarn-standalone mode deprecated?

2015-03-23 Thread nitinkak001
Is yarn-standalone mode deprecated in Spark now. The reason I am asking is because while I can find it in 0.9.0 documentation(https://spark.apache.org/docs/0.9.0/running-on-yarn.html). I am not able to find it in 1.2.0. I am using this mode to run the Spark jobs from Oozie as a java action.

Weird exception in Spark job

2015-03-23 Thread nitinkak001
I am trying to run a Hive query from Spark code through HiveContext. Anybody knows what these exceptions mean? I have no clue LogType: stderr LogLength: 3345 Log Contents: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in

HiveContext test, Spark Context did not initialize after waiting 10000ms

2015-03-06 Thread nitinkak001
I am trying to run a Hive query from Spark using HiveContext. Here is the code / val conf = new SparkConf().setAppName(HiveSparkIntegrationTest) conf.set(spark.executor.extraClassPath, /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib);

Re: Running Spark jobs via oozie

2015-03-03 Thread nitinkak001
I am also starting to work on this one. Did you get any solution to this issue? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-jobs-via-oozie-tp5187p21896.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Executing hive query from Spark code

2015-03-02 Thread nitinkak001
I want to run Hive query inside Spark and use the RDDs generated from that inside Spark. I read in the documentation /Hive support is enabled by adding the -Phive and -Phive-thriftserver flags to Spark’s build. This command builds a new assembly jar that includes Hive. Note that this Hive

Counters in Spark

2015-02-13 Thread nitinkak001
I am trying to implement counters in Spark and I guess Accumulators are the way to do it. My motive is to update a counter in map function and access/reset it in the driver code. However the /println/ statement at the end still yields value 0(It should 9). Am I doing something wrong? def

Where can I find logs set inside RDD processing functions?

2015-02-06 Thread nitinkak001
I am trying to debug my mapPartitionsFunction. Here is the code. There are two ways I am trying to log using log.info() or println(). I am running in yarn-cluster mode. While I can see the logs from driver code, I am not able to see logs from map, mapPartition functions in the Application Tracking

Sort based shuffle not working properly?

2015-02-03 Thread nitinkak001
I am trying to implement secondary sort in spark as we do in map-reduce. Here is my data(tab separated, without c1, c2, c2). c1c2 c3 1 2 4 1 3 6 2 4 7 2 6 8 3 5 5 3 1 8 3 2 0 To do secondary sort, I

Re: Sort based shuffle not working properly?

2015-02-03 Thread nitinkak001
Just to add, I am suing Spark 1.1.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sort-based-shuffle-not-working-properly-tp21487p21488.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Window comparison matching using the sliding window functionality: feasibility

2015-02-02 Thread nitinkak001
Mine was not really a moving average problem. It was more like partitioning on some keys and sorting(on different keys) and then running a sliding window through the partition. I reverted back to map-reduce for that(I needed secondary sort, which is not very mature in Spark right now). But, as

Connected Components running for a long time and failing eventually

2014-11-24 Thread nitinkak001
I am trying to run connected components on a graph generated by reading an edge file. Its running for a long time(3-4 hrs) and then eventually failing. Cant find any error in log file. The file I am testing it on has 27M rows(edges). Is there something obviously wrong with the code? I tested the

Partition sorting by Spark framework

2014-11-05 Thread nitinkak001
I need to sort my RDD partitions but the whole partition(s) might not fit into memory, so I cannot run the Collections Sort() method. Does Spark support partitions sorting by virtue of its framework? I am working on 1.1.0 version. I looked up similar unanswered question:

Re: Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-28 Thread nitinkak001
Any suggestions guys?? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-JavaSchemaRDD-inherit-the-Hive-partitioning-of-data-tp17410p17539.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-28 Thread nitinkak001
So, this means that I can create table and insert data in it with Dynbamic partitioning and those partitions would be inherited by RDDs. Is it in Spark 1.1.0? If not, is there a way to partition the data in a file based on some attributes of the rows in the data data(without hardcoding the number

Is Spark 1.1.0 incompatible with Hive?

2014-10-27 Thread nitinkak001
I am working on running the following hive query from spark. /SELECT * FROM spark_poc.table_name DISTRIBUTE BY GEO_REGION, GEO_COUNTRY SORT BY IP_ADDRESS, COOKIE_ID/ Ran into /java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;/ (complete

Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-27 Thread nitinkak001
Would the rdd resulting from the below query be partitioned on GEO_REGION, GEO_COUNTRY? I ran some tests(using mapPartitions on the resulting RDD) and seems that there are always 50 partitions generated while there should be around 1000. /SELECT * FROM spark_poc.table_nameDISTRIBUTE BY

JavaHiveContext class not found error. Help!!

2014-10-23 Thread nitinkak001
I am trying to run the below Hive query on Yarn. I am using Cloudera 5.1. What can I do to make this work? /SELECT * FROM table_name DISTRIBUTE BY GEO_REGION, GEO_COUNTRY SORT BY IP_ADDRESS, COOKIE_ID;/ Below is the stack trace: Exception in thread Thread-4

com.esotericsoftware.kryo.KryoException: Buffer overflow.

2014-10-21 Thread nitinkak001
I am running a simple rdd filter command. What does it mean? Here is the full stack trace(and code below it): com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 133 at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at

Re: Window comparison matching using the sliding window functionality: feasibility

2014-10-10 Thread nitinkak001
) *. . . . . . . . . . . . . . . . . .* *IF WE CAN’T DOUBLE YOUR SALES,* *ONE OF US IS IN THE WRONG BUSINESS.* *E*: [hidden email] http://user/SendEmail.jtp?type=nodenode=15407i=0 *M*: *510.303.7751 510.303.7751* On Tue, Sep 30, 2014 at 8:19 AM, nitinkak001 [hidden email] http://user/SendEmail.jtp?type=nodenode=15407i=1

Re: Window comparison matching using the sliding window functionality: feasibility

2014-09-30 Thread nitinkak001
Any ideas guys? Trying to find some information online. Not much luck so far. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352p15404.html Sent from the Apache Spark User

Window comparison matching using the sliding window functionality: feasibility

2014-09-29 Thread nitinkak001
Need to know the feasibility of the below task. I am thinking of this one to be a mapreduce-spark effort. I need to run distributed sliding Window Comparison for digital data matching on top of Hadoop. The data(Hive Table) will be partitioned, distributed across data node. Then the window