Re: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

2021-03-09 Thread Pankaj Bhootra
Hi, Could someone please revert on this? Thanks Pankaj Bhootra On Sun, 7 Mar 2021, 01:22 Pankaj Bhootra, wrote: > Hello Team > > I am new to Spark and this question may be a possible duplicate of the > issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347

Fwd: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

2021-03-06 Thread Pankaj Bhootra
using csv files to parquet, but from my hands-on so far, it seems that parquet's read time is slower than csv? This seems contradictory to popular opinion that parquet performs better in terms of both computation and storage? Thanks Pankaj Bhootra -- Forwarded message

Structured Streaming to Kafka Topic

2019-03-06 Thread Pankaj Wahane
ic. I have temporarily used a UDF that accepts all these columns as parameters and create a json string for adding a column "value" for writing to Kafka. Is there easier and cleaner way to do the same? Thanks, Pankaj

Re: [E] How to do stop streaming before the application got killed

2017-12-22 Thread Rastogi, Pankaj
def run() = { println("In shutdown hook") // stop gracefully ssCtx.stop(true, true) } }) } } Pankaj On Fri, Dec 22, 2017 at 9:56 AM, Toy <noppani...@gmail.com> wrote: > I'm trying to write a deployment job for Spark application. Basically

Re: [E] Re: Spark Job is stuck at SUBMITTED when set Driver Memory > Executor Memory

2017-06-12 Thread Rastogi, Pankaj
Please make sure that you have enough memory available on the driver node. If there is not enough free memory on the driver node, then your application won't start. Pankaj From: vaquar khan <vaquar.k...@gmail.com<mailto:vaquar.k...@gmail.com>> Date: Saturday, June 10, 201

SPARK-19547

2017-06-07 Thread Rastogi, Pankaj
(EventLoop.scala:48) I see that there is Spark ticket opened with the same issue(https://issues.apache.org/jira/browse/SPARK-19547) but it has been marked as INVALID. Can someone explain why this ticket is marked INVALID. Thanks, Pankaj

Re: how to add colum to dataframe

2016-12-06 Thread Pankaj Wahane
You may want to try using df2.na.fill(…) From: lk_spark Date: Tuesday, 6 December 2016 at 3:05 PM To: "user.spark" Subject: how to add colum to dataframe hi,all: my spark version is 2.0 I have a parquet file with one colum name url type is

Execution error during ALS execution in spark

2016-03-31 Thread Pankaj Rawat
amount of time taken during execution is fine, but the process should not Fail. 4. What is exactly meant by Akka timeout error during ALS job execution ? Regards, Pankaj Rawat

Re: Spark Streaming: java.lang.NoClassDefFoundError: org/apache/kafka/common/message/KafkaLZ4BlockOutputStream

2016-03-11 Thread Pankaj Wahane
Next thing you may want to check is if the jar has been provided to all the executors in your cluster. Most of the class not found errors got resolved for me after making required jars available in the SparkContext. Thanks. From: Ted Yu > Date:

seriazable error in apache spark job

2015-12-17 Thread Pankaj Narang
I am encountering below error. Can somebody guide ? Something similar is one this link https://github.com/elastic/elasticsearch-hadoop/issues/298 actor.MentionCrawlActor java.io.NotSerializableException: actor.MentionCrawlActor at

Re: Question on take function - Spark Java API

2015-08-26 Thread Pankaj Wahane
Technologies http://www.nubetech.co/ Check out Reifier at Spark Summit 2015 https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/ http://in.linkedin.com/in/sonalgoyal On Wed, Aug 26, 2015 at 8:25 AM, Pankaj Wahane pankaj.wah...@qiotec.com

Question on take function - Spark Java API

2015-08-25 Thread Pankaj Wahane
. Best Regards, Pankaj -- QIO Technologies Limited is a limited company registered in England Wales at 1 Curzon Street, London, England, W1J 5HD, with registered number 09368431 This message and the information contained within it is intended solely for the addressee and may contain

Spark Streaming Restart at scheduled intervals

2015-08-10 Thread Pankaj Narang
but could not achieve the same. Have anybody have idea how to do that ? Regards Pankaj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Restart-at-scheduled-intervals-tp24192.html Sent from the Apache Spark User List mailing list archive

Out of memory with twitter spark streaming

2015-08-06 Thread Pankaj Narang
directory to recover from failures println(tweets for the last stream are saved which can be processed later) val= f:/svn1/checkpoint/ ssc.checkpoint(checkpointDir) ssc.start() ssc.awaitTermination() regards Pankaj -- View this message in context: http://apache-spark-user-list

AvroFiles

2015-05-05 Thread Pankaj Deshpande
Hi I am using Spark 1.3.1 to read an avro file stored on HDFS. The avro file was created using Avro 1.7.7. Similar to the example mentioned in http://www.infoobjects.com/spark-with-avro/ I am getting a nullPointerException on Schema read. It could be a avro version mismatch. Has anybody had a

Re: AvroFiles

2015-05-05 Thread Pankaj Deshpande
(classOf[Event], new AvroSerializer[Event]())) } I encountered a similar error since several of the Avor core classes are not marked Serializable. HTH. Todd On Tue, May 5, 2015 at 7:09 PM, Pankaj Deshpande ppa...@gmail.com wrote: Hi I am using Spark 1.3.1 to read an avro file stored

Issue with deploye Driver in cluster mode

2015-02-26 Thread pankaj
Hi, I have 3 node spark cluster node1 , node2 and node 3 I running below command on node 1 for deploying driver /usr/local/spark-1.2.1-bin-hadoop2.4/bin/spark-submit --class com.fst.firststep.aggregator.FirstStepMessageProcessor --master spark://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:7077

Loading JSON dataset with Spark Mllib

2015-02-15 Thread pankaj channe
or loading data into hive tables. Thanks, Pankaj

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Pankaj
http://spark.apache.org/docs/latest/ Follow this. Its easy to get started. Use prebuilt version of spark as of now :D On Thu, Jan 22, 2015 at 5:06 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Apache-Spark team , What are the system requirements installing Hadoop and Apache

Re: reading a csv dynamically

2015-01-21 Thread Pankaj Narang
=(_.split(,).length,line)) val groupedData = dataLengthRDD.groupByKey() now you can process the groupedData as it will have arrays of length x in one RDD. groupByKey([numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, IterableV) pairs. I hope this helps Regards Pankaj

Re: Finding most occurrences in a JSON Nested Array

2015-01-21 Thread Pankaj Narang
send me the current code here. I will fix and send back to you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p21295.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Finding most occurrences in a JSON Nested Array

2015-01-19 Thread Pankaj Narang
I just checked the post. do you need help still ? I think getAs(Seq[String]) should help. If you are still stuck let me know. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p21252.html Sent from

Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread Pankaj Narang
Instead of counted.saveAsText(“/path/to/save/dir) if you call counted.collect what happens ? If you still face the same issue please paste the stacktrace here. -- View this message in context:

Re: NoSuchMethodError: com.typesafe.config.Config.getDuration with akka-http/akka-stream

2015-01-06 Thread Pankaj Narang
Good luck. Let me know If I can assist you further Regards -Pankaj Linkedin https://www.linkedin.com/profile/view?id=171566646 Skype pankaj.narang -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-com-typesafe-config-Config-getDuration

Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-06 Thread Pankaj Narang
I suggest to create uber jar instead. check my thread for the same http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-com-typesafe-config-Config-getDuration-with-akka-http-akka-stream-td20926.html Regards -Pankaj Linkedin https://www.linkedin.com/profile/view?id=171566646

Re: Finding most occurrences in a JSON Nested Array

2015-01-06 Thread Pankaj Narang
Thats great. I was not having access on the developer machine so sent you the psuedo code only. Happy to see its working. If you need any more help related to spark let me know anytime. -- View this message in context:

Re: Spark SQL implementation error

2015-01-06 Thread Pankaj Narang
As per telephonic call see how we can fetch the count val tweetsCount = sql(SELECT COUNT(*) FROM tweets) println(f\n\n\nThere are ${tweetsCount.collect.head.getLong(0)} Tweets on this Dataset\n\n) -- View this message in context:

Re: Finding most occurrences in a JSON Nested Array

2015-01-05 Thread Pankaj Narang
If you need more help let me know -Pankaj Linkedin https://www.linkedin.com/profile/view?id=171566646 Skype pankaj.narang -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p20976.html Sent from

Re: Finding most occurrences in a JSON Nested Array

2015-01-05 Thread Pankaj Narang
}, {hiking,1} Now hbmap .map{case(hobby,count)=(count,hobby)}.sortByKey(ascending =false).collect will give you hobbies sorted in descending by their count This is pseudo code and must help you Regards Pankaj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com

Re: Finding most occurrences in a JSON Nested Array

2015-01-05 Thread Pankaj Narang
= popularHashTags.flatMap ( x = x.getAs[Seq[String]](0)) Even if you want I will take the remote of your machine to fix that Regards Pankaj Linkedin https://www.linkedin.com/profile/view?id=171566646 Skype pankaj.narang -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com

Re: saveAsTextFile

2015-01-03 Thread Pankaj Narang
If you can paste the code here I can certainly help. Also confirm the version of spark you are using Regards Pankaj Infoshore Software India -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p20953.html Sent from the Apache Spark

Re: NoClassDefFoundError when trying to run spark application

2015-01-02 Thread Pankaj Narang
do you assemble the uber jar ? you can use sbt assembly to build the jar and then run. It should fix the issue -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoClassDefFoundError-when-trying-to-run-spark-application-tp20707p20944.html Sent from the Apache

(send this email to subscribe)

2015-01-02 Thread Pankaj

Re: NoSuchMethodError: com.typesafe.config.Config.getDuration with akka-http/akka-stream

2015-01-02 Thread Pankaj Narang
(FlowMaterializer.scala:256) I think there is version mismatch on the jars you use at runtime If you need more help add me on skype pankaj.narang ---Pankaj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-com-typesafe-config-Config

Re: Publishing streaming results to web interface

2015-01-02 Thread Pankaj Narang
ON RDD are saveAsObjectFile, saveAsFile * Now you can read these files to show them on web interface in any language of your choice Regards Pankaj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Publishing-streaming-results-to-web-interface

Re: Reading nested JSON data with Spark SQL

2015-01-01 Thread Pankaj Narang
. Pankaj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reading-nested-JSON-data-with-Spark-SQL-tp19310p20933.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Reading nested JSON data with Spark SQL

2015-01-01 Thread Pankaj Narang
Also it looks like that when I store the String in parquet and try to fetch them using spark code I got classcast exception below how my array of strings are saved. each character ascii value is present in array of ints res25: Array[Seq[String]] r= Array(ArrayBuffer(Array(104, 116, 116, 112,

Re: Reading nested JSON data with Spark SQL

2015-01-01 Thread Pankaj Narang
oops sqlContext.setConf(spark.sql.parquet.binaryAsString, true) thois solved the issue important for everyone -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reading-nested-JSON-data-with-Spark-SQL-tp19310p20936.html Sent from the Apache Spark User

Time based aggregation in Real time Spark Streaming

2014-12-01 Thread pankaj
Hi, My incoming message has time stamp as one field and i have to perform aggregation over 3 minute of time slice. Message sample Item ID Item Type timeStamp 1 X 1-12-2014:12:01 1 X 1-12-2014:12:02 1 X

Re: Time based aggregation in Real time Spark Streaming

2014-12-01 Thread pankaj
Hi , suppose i keep batch size of 3 minute. in 1 batch there can be incoming records with any time stamp. so it is difficult to keep track of when the 3 minute interval was start and end. i am doing output operation on worker nodes in forEachPartition not in drivers(forEachRdd) so i cannot use

Re: Spark streaming job failing after some time.

2014-11-24 Thread pankaj channe
. Now, I can't figure out as to why it should run successfully during this time even if it could not find SparkContext. I am sure there should be good reason behind this behavior. Anyone has any idea on this? Thanks, Pankaj Channe On Saturday, November 22, 2014, pankaj channe pankajc...@gmail.com

Re: Spark streaming job failing after some time.

2014-11-22 Thread pankaj channe
, Nov 22, 2014 at 8:39 AM, pankaj channe pankajc...@gmail.com wrote: I have seen similar posts on this issue but could not find solution. Apologies if this has been discussed here before. I am running a spark streaming job with yarn on a 5 node cluster. I am using following command to submit my

Spark streaming job failing after some time.

2014-11-21 Thread pankaj channe
:169) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at org.apache.spark.storage.DiskBlockManager$$anon$1.run(DiskBlockManager.scala:169) Note: I am building my jar on my local with spark dependency added in pom.xml and running it on cluster running spark. -Pankaj