from:"nitinkak001"

Spark mapPartition output object size coming larger than expected

2017-02-06 Thread nitinkak001

I am storing the output of mapPartitions in a ListBuffer and exposing its iterator as the output. The output is a list of Long tuples(Tuple2). When I check the size of the object using Spark's SizeEstimator.estimate method it comes out to 80 bytes per record/tuple object(calculating this by "size

How to add all jars in a folder to executor classpath?

2016-10-18 Thread nitinkak001

I need to add all the jars in hive/lib to my spark job executor classpath. I tried this spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hive/lib and spark.executor.extraClassPath=/opt/cloudera/parcels/CDH-5.8.2-1.cdh5.8.2.p0.3/lib/hive/lib/* but it does not

Spark SQL(Hive query through HiveContext) always creating 31 partitions

2016-04-04 Thread nitinkak001

I am running hive queries using HiveContext from my Spark code. No matter which query I run and how much data it is, it always generates 31 partitions. Anybody knows the reason? Is there a predefined/configurable setting for it? I essentially need more partitions. I using this code snippet to

Out of Memory error caused by output object in mapPartitions

2016-02-15 Thread nitinkak001

My mapPartition code as given below outputs one record for each input record. So, the output object has equal number of records as input. I am loading the output data into a listbuffer object. This object is turning out to be too huge for memory leading to Out Of Memory exception. To be more

Spark SQL incompatible with Apache Sentry(Cloudera bundle)

2015-06-24 Thread nitinkak001

CDH version: 5.3 Spark Version: 1.2 I was trying to execute a Hive query from Spark code(using HiveContext class). It was working fine untill we installed Apache Sentry. Now its giving me read permission exception. /org.apache.hadoop.security.AccessControlException: Permission denied:

Re: Does HiveContext connect to HiveServer2?

2015-06-22 Thread nitinkak001

Hey, I have exactly this question. Did you get an answer to it? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-connect-to-HiveServer2-tp22200p23431.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-17 Thread nitinkak001

I am trying to run a hive query from Spark code using HiveContext object. It was running fine earlier but since the Apache Sentry has been set installed the process is failing with this exception : /org.apache.hadoop.security.AccessControlException: Permission denied: user=kakn,

Possible to use hive-config.xml instead of hive-site.xml for HiveContext?

2015-05-05 Thread nitinkak001

I am running hive queries from HiveContext, for which we need a hive-site.xml. Is it possible to replace it with hive-config.xml? I tried but does not work. Just want a conformation. -- View this message in context:

Generating version agnostic jar path value for --jars clause

2015-05-02 Thread nitinkak001

I have a list of cloudera jars which I need to provide in --jars clause, mainly for the HiveContext functionality I am using. However, many of these jars have version number as part of their names. This leads to an issue that the names might change when I do a Cloudera upgrade. Just a note here,

Does HiveContext connect to HiveServer2?

2015-03-24 Thread nitinkak001

I am wondering if HiveContext connects to HiveServer2 or does it work though Hive CLI. The reason I am asking is because Cloudera has deprecated Hive CLI. If the connection is through HiverServer2, is there a way to specify user credentials? -- View this message in context:

Re: Weird exception in Spark job

2015-03-24 Thread nitinkak001

Any Ideas on this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Weird-exception-in-Spark-job-tp22195p22204.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Is yarn-standalone mode deprecated?

2015-03-23 Thread nitinkak001

Is yarn-standalone mode deprecated in Spark now. The reason I am asking is because while I can find it in 0.9.0 documentation(https://spark.apache.org/docs/0.9.0/running-on-yarn.html). I am not able to find it in 1.2.0. I am using this mode to run the Spark jobs from Oozie as a java action.

Weird exception in Spark job

2015-03-23 Thread nitinkak001

I am trying to run a Hive query from Spark code through HiveContext. Anybody knows what these exceptions mean? I have no clue LogType: stderr LogLength: 3345 Log Contents: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in

HiveContext test, Spark Context did not initialize after waiting 10000ms

2015-03-06 Thread nitinkak001

I am trying to run a Hive query from Spark using HiveContext. Here is the code / val conf = new SparkConf().setAppName(HiveSparkIntegrationTest) conf.set(spark.executor.extraClassPath, /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hive/lib);

Re: Running Spark jobs via oozie

2015-03-03 Thread nitinkak001

I am also starting to work on this one. Did you get any solution to this issue? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-jobs-via-oozie-tp5187p21896.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Executing hive query from Spark code

2015-03-02 Thread nitinkak001

I want to run Hive query inside Spark and use the RDDs generated from that inside Spark. I read in the documentation /Hive support is enabled by adding the -Phive and -Phive-thriftserver flags to Spark’s build. This command builds a new assembly jar that includes Hive. Note that this Hive

Counters in Spark

2015-02-13 Thread nitinkak001

I am trying to implement counters in Spark and I guess Accumulators are the way to do it. My motive is to update a counter in map function and access/reset it in the driver code. However the /println/ statement at the end still yields value 0(It should 9). Am I doing something wrong? def

Where can I find logs set inside RDD processing functions?

2015-02-06 Thread nitinkak001

I am trying to debug my mapPartitionsFunction. Here is the code. There are two ways I am trying to log using log.info() or println(). I am running in yarn-cluster mode. While I can see the logs from driver code, I am not able to see logs from map, mapPartition functions in the Application Tracking

Sort based shuffle not working properly?

2015-02-03 Thread nitinkak001

I am trying to implement secondary sort in spark as we do in map-reduce. Here is my data(tab separated, without c1, c2, c2). c1c2 c3 1 2 4 1 3 6 2 4 7 2 6 8 3 5 5 3 1 8 3 2 0 To do secondary sort, I

Re: Sort based shuffle not working properly?

2015-02-03 Thread nitinkak001

Just to add, I am suing Spark 1.1.0 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sort-based-shuffle-not-working-properly-tp21487p21488.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Window comparison matching using the sliding window functionality: feasibility

2015-02-02 Thread nitinkak001

Mine was not really a moving average problem. It was more like partitioning on some keys and sorting(on different keys) and then running a sliding window through the partition. I reverted back to map-reduce for that(I needed secondary sort, which is not very mature in Spark right now). But, as

Connected Components running for a long time and failing eventually

2014-11-24 Thread nitinkak001

I am trying to run connected components on a graph generated by reading an edge file. Its running for a long time(3-4 hrs) and then eventually failing. Cant find any error in log file. The file I am testing it on has 27M rows(edges). Is there something obviously wrong with the code? I tested the

Partition sorting by Spark framework

2014-11-05 Thread nitinkak001

I need to sort my RDD partitions but the whole partition(s) might not fit into memory, so I cannot run the Collections Sort() method. Does Spark support partitions sorting by virtue of its framework? I am working on 1.1.0 version. I looked up similar unanswered question:

Re: Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-28 Thread nitinkak001

Any suggestions guys?? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-JavaSchemaRDD-inherit-the-Hive-partitioning-of-data-tp17410p17539.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-28 Thread nitinkak001

So, this means that I can create table and insert data in it with Dynbamic partitioning and those partitions would be inherited by RDDs. Is it in Spark 1.1.0? If not, is there a way to partition the data in a file based on some attributes of the rows in the data data(without hardcoding the number

Is Spark 1.1.0 incompatible with Hive?

2014-10-27 Thread nitinkak001

I am working on running the following hive query from spark. /SELECT * FROM spark_poc.table_name DISTRIBUTE BY GEO_REGION, GEO_COUNTRY SORT BY IP_ADDRESS, COOKIE_ID/ Ran into /java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;/ (complete

Does JavaSchemaRDD inherit the Hive partitioning of data?

2014-10-27 Thread nitinkak001

Would the rdd resulting from the below query be partitioned on GEO_REGION, GEO_COUNTRY? I ran some tests(using mapPartitions on the resulting RDD) and seems that there are always 50 partitions generated while there should be around 1000. /SELECT * FROM spark_poc.table_nameDISTRIBUTE BY

JavaHiveContext class not found error. Help!!

2014-10-23 Thread nitinkak001

I am trying to run the below Hive query on Yarn. I am using Cloudera 5.1. What can I do to make this work? /SELECT * FROM table_name DISTRIBUTE BY GEO_REGION, GEO_COUNTRY SORT BY IP_ADDRESS, COOKIE_ID;/ Below is the stack trace: Exception in thread Thread-4

com.esotericsoftware.kryo.KryoException: Buffer overflow.

2014-10-21 Thread nitinkak001

I am running a simple rdd filter command. What does it mean? Here is the full stack trace(and code below it): com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 133 at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at

Re: Window comparison matching using the sliding window functionality: feasibility

2014-10-10 Thread nitinkak001

) *. . . . . . . . . . . . . . . . . .* *IF WE CAN’T DOUBLE YOUR SALES,* *ONE OF US IS IN THE WRONG BUSINESS.* *E*: [hidden email] http://user/SendEmail.jtp?type=nodenode=15407i=0 *M*: *510.303.7751 510.303.7751* On Tue, Sep 30, 2014 at 8:19 AM, nitinkak001 [hidden email] http://user/SendEmail.jtp?type=nodenode=15407i=1

Re: Window comparison matching using the sliding window functionality: feasibility

2014-09-30 Thread nitinkak001

Any ideas guys? Trying to find some information online. Not much luck so far. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Window-comparison-matching-using-the-sliding-window-functionality-feasibility-tp15352p15404.html Sent from the Apache Spark User

Window comparison matching using the sliding window functionality: feasibility

2014-09-29 Thread nitinkak001

Need to know the feasibility of the below task. I am thinking of this one to be a mapreduce-spark effort. I need to run distributed sliding Window Comparison for digital data matching on top of Hadoop. The data(Hive Table) will be partitioned, distributed across data node. Then the window

Spark mapPartition output object size coming larger than expected

How to add all jars in a folder to executor classpath?

Spark SQL(Hive query through HiveContext) always creating 31 partitions

Out of Memory error caused by output object in mapPartitions

Spark SQL incompatible with Apache Sentry(Cloudera bundle)

Re: Does HiveContext connect to HiveServer2?

Hive query execution from Spark(through HiveContext) failing with Apache Sentry

Possible to use hive-config.xml instead of hive-site.xml for HiveContext?

Generating version agnostic jar path value for --jars clause

Does HiveContext connect to HiveServer2?

Re: Weird exception in Spark job

Is yarn-standalone mode deprecated?

Weird exception in Spark job

HiveContext test, Spark Context did not initialize after waiting 10000ms

Re: Running Spark jobs via oozie

Executing hive query from Spark code

Counters in Spark

Where can I find logs set inside RDD processing functions?

Sort based shuffle not working properly?

Re: Sort based shuffle not working properly?

Re: Window comparison matching using the sliding window functionality: feasibility

Connected Components running for a long time and failing eventually

Partition sorting by Spark framework

Re: Does JavaSchemaRDD inherit the Hive partitioning of data?

Re: Does JavaSchemaRDD inherit the Hive partitioning of data?

Is Spark 1.1.0 incompatible with Hive?

Does JavaSchemaRDD inherit the Hive partitioning of data?

JavaHiveContext class not found error. Help!!

com.esotericsoftware.kryo.KryoException: Buffer overflow.

Re: Window comparison matching using the sliding window functionality: feasibility

Re: Window comparison matching using the sliding window functionality: feasibility

Window comparison matching using the sliding window functionality: feasibility

32 matches

Site Navigation

Mail list logo

Footer information