Re: [Spark Internals]: Is sort order preserved after partitioned write?

2022-09-26 Thread Swetha Baskaran
Hi Enrico, Using Spark version 3.1.3 and turning AQE off seems to fix the sorting. Looking into why, do you have thoughts? Thanks, Swetha On Sat, Sep 17, 2022 at 1:58 PM Enrico Minack wrote: > Hi, > > from a quick glance over your transformations, sortCol should be sorted. > >

Re: [Spark Internals]: Is sort order preserved after partitioned write?

2022-09-16 Thread Swetha Baskaran
7960| |1207876|581|4990757154529| 4990796737202| | 10|581|4990806212169| 4990751997961| |1207876|581| 4990803020856| 4990796737203| +---+---+-+-

[Spark Internals]: Is sort order preserved after partitioned write?

2022-09-15 Thread Swetha Baskaran
Hi! We expected the order of sorted partitions to be preserved after a dataframe write. We use the following code to write out one file per partition, with the rows sorted by a column. *df.repartition($"col1").sortWithinPartitions("col1", "col2") .write.partitionBy("col1")

Read text file row by row and apply conditions

2019-09-29 Thread swetha kadiyala
type=U then i have to store all rows data into a separate table called Table3. Can anyone help me how to read row by row and split the columns and apply the condition based on indicator type and store columns data into respective tables. Thanks, Swetha

Re: Spark CSV Quote only NOT NULL

2019-07-13 Thread Swetha Ramaiah
Glad to help! On Sat, Jul 13, 2019 at 12:17 PM Gourav Sengupta wrote: > Hi Swetha, > I always look into the source code a lot but it never occured to me to > look into the test suite, thank a ton for the tip. Does definitely give > quite a few ideas - thanks a ton. > >

Re: Spark CSV Quote only NOT NULL

2019-07-11 Thread Swetha Ramaiah
. https://github.com/apache/spark/blob/v2.4.3/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala <https://github.com/apache/spark/blob/v2.4.3/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala> Regards Swetha > On Jul

Re: Spark CSV Quote only NOT NULL

2019-07-11 Thread Swetha Ramaiah
If you are using Spark 2.4.0, I think you can try something like this: .option("quote", "\u") .option("emptyValue", “”) .option("nullValue", null) Regards Swetha > On Jul 11, 2019, at 1:45 PM, Anil Kulkarni wrote: > > Hi Spark users, >

Re: Lag and queued up batches info in Structured Streaming UI

2018-06-27 Thread swetha kasireddy
Thanks TD, but the sql plan does not seem to provide any information on which stage is taking longer time or to identify any bottlenecks about various stages. Spark kafka Direct used to provide information about various stages in a micro batch and the time taken by each stage. Is there a way to

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-28 Thread swetha kasireddy
-> "test1" ) val hubbleStream = KafkaUtils.createDirectStream[String, String]( ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams) ) val kafkaStreamRdd = kafkaStream.transform { rdd =>

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-28 Thread swetha kasireddy
There is no difference in performance even with Cache being enabled. On Mon, Aug 28, 2017 at 11:27 AM, swetha kasireddy < swethakasire...@gmail.com> wrote: > There is no difference in performance even with Cache being disabled. > > On Mon, Aug 28, 2017 at 7:43 AM, Co

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-28 Thread swetha kasireddy
hose settings seem reasonable enough. If preferred locations is > behaving correctly you shouldn't need cached consumers for all 96 > partitions on any one executor, so that maxCapacity setting is > probably unnecessary. > > On Fri, Aug 25, 2017 at 7:04 PM, swetha kasireddy &g

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-25 Thread swetha kasireddy
er.poll.ms" -> Integer.*valueOf*(1024), http://markmail.org/message/n4cdxwurlhf44q5x https://issues.apache.org/jira/browse/SPARK-19185 Also, I have a batch of 60 seconds. What do you suggest the following to be? session.timeout.ms, heartbeat.interval.ms On Fri, Aug 25, 2017 at

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-25 Thread swetha kasireddy
Because I saw some posts that say that consumer cache enabled will have concurrentModification exception with reduceByKeyAndWIndow. I see those errors as well after running for sometime with cache being enabled. So, I had to disable it. Please see the tickets below. We have 96 partitions. So if

Re: How to force Spark Kafka Direct to start from the latest offset when the lag is huge in kafka 10?

2017-08-21 Thread swetha kasireddy
Hi Cody, I think the Assign is used if we want it to start from a specified offset. What if we want it to start it from the latest offset with something like returned by "auto.offset.reset" -> "latest",. Thanks! On Mon, Aug 21, 2017 at 9:06 AM, Cody Koeninger wrote: >

Fwd: Issues when trying to recover a textFileStream from checkpoint in Spark streaming

2017-08-11 Thread swetha kasireddy
beyond 2 minutes when trying to recover from checkpoint. Any suggestions on this would be of great help. sparkConf.set("spark.streaming.minRememberDuration","120s") sparkConf.set("spark.streaming.fileStream.minRememberDuration","120s") Thanks,

Re: Does mapWithState need checkpointing to be specified in Spark Streaming?

2017-07-13 Thread swetha kasireddy
saying. :) > > On Thu, Jul 13, 2017 at 1:01 PM, SRK <swethakasire...@gmail.com> wrote: > >> Hi, >> >> Do we need to specify checkpointing for mapWithState just like we do for >> updateStateByKey? >> >> Thanks, >> Swetha >> >&

Re: How do I find the time taken by each step in a stage in a Spark Job

2017-07-09 Thread swetha kasireddy
Yes, the Spark UI has some information but, it's not that helpful to find out which particular stage is taking time. On Wed, Jun 28, 2017 at 12:51 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > You can find the information from the spark UI > > ---Original--- > *From:* "SRK" >

Re: Exception which using ReduceByKeyAndWindow in Spark Streaming.

2017-06-22 Thread swetha kasireddy
at 8:43 AM, swetha kasireddy <swethakasire...@gmail.com> wrote: > I changed the datastructure to scala.collection.immutable.Set and I still > see the same issue. My key is a String. I do the following in my reduce > and invReduce. > > visitorSet1 ++visitorSet2.toTraversa

Re: Is Structured streaming ready for production usage

2017-06-08 Thread swetha kasireddy
OK. Can we use Spark Kafka Direct with Structured Streaming? On Thu, Jun 8, 2017 at 4:46 PM, swetha kasireddy <swethakasire...@gmail.com> wrote: > OK. Can we use Spark Kafka Direct as part of Structured Streaming? > > On Thu, Jun 8, 2017 at 3:35 PM, Tathagata Das <tathagata

Re: Is Structured streaming ready for production usage

2017-06-08 Thread swetha kasireddy
lion records. > > https://databricks.com/blog/2017/06/06/simple-super-fast- > streaming-engine-apache-spark.html > > On Thu, Jun 8, 2017 at 3:03 PM, SRK <swethakasire...@gmail.com> wrote: > >> Hi, >> >> Is structured streaming ready for production usage in Spa

Re: Exception which using ReduceByKeyAndWindow in Spark Streaming.

2017-06-07 Thread swetha kasireddy
<tathagata.das1...@gmail.com> wrote: > Yes, and in general any mutable data structure. You have to immutable data > structures whose hashcode and equals is consistent enough for being put in > a set. > > On Jun 6, 2017 4:50 PM, "swetha kasireddy" <swethakasire..

Re: Exception which using ReduceByKeyAndWindow in Spark Streaming.

2017-06-06 Thread swetha kasireddy
Are you suggesting against the usage of HashSet? On Tue, Jun 6, 2017 at 3:36 PM, Tathagata Das wrote: > This may be because of HashSet is a mutable data structure, and it seems > you are actually mutating it in "set1 ++set2". I suggest creating a new > HashMap in

Re: How to set hive configs in Spark 2.1?

2017-02-27 Thread swetha kasireddy
Even the hive configurations like the following would work with this? sqlContext.setConf("hive.default.fileformat", "Orc") sqlContext.setConf("hive.exec.orc.memory.pool", "1.0") sqlContext.setConf("hive.optimize.sort.dynamic.partition", "true")

Re: How to spin up Kafka using docker and use for Spark Streaming Integration tests

2016-07-06 Thread swetha kasireddy
gt; Data engineering consultant > www.mapflat.com > +46 70 7687109 > Calendar: https://goo.gl/6FBtlS > > > > On Thu, Jun 30, 2016 at 8:19 PM, SRK <swethakasire...@gmail.com> wrote: > > Hi, > > > > I need to do integration tests using Spark Streaming

Re: How to spin up Kafka using docker and use for Spark Streaming Integration tests

2016-07-06 Thread swetha kasireddy
.@gmail.com> wrote: > > Hi, > > > > I need to do integration tests using Spark Streaming. My idea is to spin > up > > kafka using docker locally and use it to feed the stream to my Streaming > > Job. Any suggestions on how to do this would be of great help. > > &g

Re: Kryo ClassCastException during Serialization/deserialization in Spark Streaming

2016-06-23 Thread swetha kasireddy
sampleMap is populated from inside a method that is getting called from updateStateByKey On Thu, Jun 23, 2016 at 1:13 PM, Ted Yu wrote: > Can you illustrate how sampleMap is populated ? > > Thanks > > On Thu, Jun 23, 2016 at 12:34 PM, SRK wrote:

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-15 Thread swetha kasireddy
Hi Mich, No I have not tried that. My requirement is to insert that from an hourly Spark Batch job. How is it different by trying to insert with Hive CLI or beeline? Thanks, Swetha On Tue, Jun 14, 2016 at 10:44 AM, Mich Talebzadeh <mich.talebza...@gmail.com > wrote: > Hi Swetha, >

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-14 Thread swetha kasireddy
Hi Bijay, This approach might not work for me as I have to do partial inserts/overwrites in a given table and data_frame.write.partitionBy will overwrite the entire table. Thanks, Swetha On Mon, Jun 13, 2016 at 9:25 PM, Bijay Pathak <bijay.pat...@cloudwick.com> wrote: > Hi Swetha

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-13 Thread swetha kasireddy
erRecord, ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner """.stripMargin) On Mon, Jun 13, 2016 at 10:57 AM, swetha kasireddy < swethakasire...@gmail.com> wrote: > Hi Bijay, > > If I am hitting this issue, > https://issues.apache.or

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-13 Thread swetha kasireddy
Hi Bijay, If I am hitting this issue, https://issues.apache.org/jira/browse/HIVE-11940. What needs to be done? Incrementing to higher version of hive is the only solution? Thanks! On Mon, Jun 13, 2016 at 10:47 AM, swetha kasireddy < swethakasire...@gmail.com> wrote: > Hi, >

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-13 Thread swetha kasireddy
e. >> >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >>

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread swetha kasireddy
No, I am reading the data from hdfs, transforming it , registering the data in a temp table using registerTempTable and then doing insert overwrite using Spark SQl' hiveContext. On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh wrote: > how are you doing the insert?

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread swetha kasireddy
400 cores are assigned to this job. On Thu, Jun 9, 2016 at 1:16 PM, Stephen Boesch wrote: > How many workers (/cpu cores) are assigned to this job? > > 2016-06-09 13:01 GMT-07:00 SRK : > >> Hi, >> >> How to insert data into 2000

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
ABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 22 May 2016 at 20:38, swetha kasireddy <swethakasire...@gmail.com> > wrote: > >> Around 14000 partitions need to be load

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
nkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 22 May 2016 at 20:24, swetha

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
The data is not very big. Say 1MB-10 MB at the max per partition. What is the best way to insert this 14k partitions with decent performance? On Sun, May 22, 2016 at 12:18 PM, Mich Talebzadeh wrote: > the acid question is how many rows are you going to insert in a

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
gt; > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 22 May 2016 at 19:43, swetha k

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
Wh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 22 May 2016 at 19:11, swetha kasireddy <swethakasire...@gmail.com> > wrote: > >> I am looking at ORC. I insert the data using the following query. >> >

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
I am looking at ORC. I insert the data using the following query. sqlContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS records (id STRING, record STRING) PARTITIONED BY (datePartition STRING, idPartition STRING) stored as ORC LOCATION '/user/users' ") sqlContext.sql(" orc.compress=

Re: Memory issues when trying to insert data in the form of ORC using Spark SQL

2016-05-20 Thread swetha kasireddy
Also, the Spark SQL insert seems to take only two tasks per stage. That might be the reason why it does not have sufficient memory. Is there a way to increase the number of tasks when doing the sql insert? Stage IdDescriptionSubmittedDurationTasks: Succeeded/TotalInputOutputShuffle ReadShuffle

Re: Unit testing framework for Spark Jobs?

2016-05-18 Thread swetha kasireddy
Hi Lars, Do you have any examples for the methods that you described for Spark batch and Streaming? Thanks! On Wed, Mar 30, 2016 at 2:41 AM, Lars Albertsson wrote: > Thanks! > > It is on my backlog to write a couple of blog posts on the topic, and > eventually some example

Re: What is the default value of rebalance.backoff.ms in Spark Kafka Direct?

2016-04-29 Thread swetha kasireddy
irely up to Kafka, which > is why I'm saying if you're losing leaders, you should look at Kafka. > > On Fri, Apr 29, 2016 at 11:21 AM, swetha kasireddy > <swethakasire...@gmail.com> wrote: > > OK. So the Kafka side they use rebalance.backoff.ms of 2000 which is a > > defa

Re: What is the default value of rebalance.backoff.ms in Spark Kafka Direct?

2016-04-29 Thread swetha kasireddy
refresh.leader.backoff.ms and then retry again depending on the number of retries? On Fri, Apr 29, 2016 at 8:14 AM, swetha kasireddy <swethakasire...@gmail.com > wrote: > OK. So the Kafka side they use rebalance.backoff.ms of 2000 which is a > default for rebalancing

Re: What is the default value of rebalance.backoff.ms in Spark Kafka Direct?

2016-04-29 Thread swetha kasireddy
OK. So the Kafka side they use rebalance.backoff.ms of 2000 which is a default for rebalancing and they say that refresh.leader.backoff.ms of 200 to refresh leader is very aggressive and suggested us to increase it to 2000. Even after increasing to 2500 I still get Leader Lost Errors. Is

Re: How to add an accumulator for a Set in Spark

2016-03-19 Thread swetha kasireddy
OK. I did take a look at them. So once I have an accumulater for a HashSet, how can I check if a particular key is already present in the HashSet accumulator? I don't see any .contains method there. My requirement is that I need to keep accumulating the keys in the HashSet across all the tasks in

Re: How to control the number of parquet files getting created under a partition ?

2016-03-02 Thread swetha kasireddy
Thanks. I tried this yesterday and it seems to be working. On Wed, Mar 2, 2016 at 1:49 AM, James Hammerton wrote: > Hi, > > Based on the behaviour I've seen using parquet, the number of partitions > in the DataFrame will determine the number of files in each parquet > partition.

Re: How to join multiple tables and use subqueries in Spark SQL using sqlContext?

2016-02-23 Thread swetha kasireddy
It seems to be failing when I do something like following in both sqlContext and hiveContext sqlContext.sql("SELECT ssd.savedDate from saveSessionDatesRecs ssd where ssd.partitioner in (SELECT sr1.partitioner from sparkSessionRecords1 sr1))") On Tue, Feb 23, 2016 at 5:57 PM, swetha

Re: How to join multiple tables and use subqueries in Spark SQL using sqlContext?

2016-02-23 Thread swetha kasireddy
These tables are stored in hdfs as parquet. Can sqlContext be applied for the subQueries? On Tue, Feb 23, 2016 at 5:31 PM, Mich Talebzadeh < mich.talebza...@cloudtechnologypartners.co.uk> wrote: > Assuming these are all in Hive, you can either use spark-sql or > spark-shell. > > HiveContext has

Re: How to use a custom partitioner in a dataframe in Spark

2016-02-17 Thread swetha kasireddy
of a number of small files and also to be able to scan faster. Something like ...df.write.format("parquet").partitionBy( "userIdHash" , "userId").mode(SaveMode.Append).save("userRecords"); On Wed, Feb 17, 2016 at 2:16 PM, swetha kasireddy <swethakasire

Re: How to use a custom partitioner in a dataframe in Spark

2016-02-17 Thread swetha kasireddy
a saveAsTable in a dataframe. >> >> >> Thanks, >> Swetha >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-d

Re: How to join an RDD with a hive table?

2016-02-16 Thread swetha kasireddy
How to use a customPartttioner hashed by userId inside saveAsTable using a dataframe? On Mon, Feb 15, 2016 at 11:24 AM, swetha kasireddy < swethakasire...@gmail.com> wrote: > How about saving the dataframe as a table partitioned by userId? My User > records have userId, number of ses

Re: How to join an RDD with a hive table?

2016-02-15 Thread swetha kasireddy
ENDAR_MONTH_DESC, t_c.CHANNEL_DESC > > ) rs > > LIMIT 10 > > > > [1998-01,Direct Sales,1823005210] > > [1998-01,Internet,248172522] > > [1998-01,Partners,474646900] > > [1998-02,Direct Sales,1819659036] > > [1998-02,Internet,298586496] > > [1998

Re: How to join an RDD with a hive table?

2016-02-15 Thread swetha kasireddy
OK. would it only query for the records that I want in hive as per filter or just load the entire table? My user table will have millions of records and I do not want to cause OOM errors by loading the entire table in memory. On Mon, Feb 15, 2016 at 12:51 AM, Mich Talebzadeh

Re: Driver not able to restart the job automatically after the application of Streaming with Kafka Direct went down

2016-02-05 Thread swetha kasireddy
uding > the --supervise option? > > > Thanks, > Swetha > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Driver-not-able-to-restart-the-job-automatically-after-the-application-of-Streaming-with-Kafka-Direcn-tp26155.h

Help needed in deleting a message posted in Spark User List

2016-02-05 Thread swetha kasireddy
Hi, I want to edit/delete a message posted in Spark User List. How do I do that? Thanks!

Re: How to load partial data from HDFS using Spark SQL

2016-01-02 Thread swetha kasireddy
OK. What should the table be? Suppose I have a bunch of parquet files, do I just specify the directory as the table? On Fri, Jan 1, 2016 at 11:32 PM, UMESH CHAUDHARY wrote: > Ok, so whats wrong in using : > > var df=HiveContext.sql("Select * from table where id = ") >

Re: Spark batch getting hung up

2015-12-20 Thread swetha kasireddy
next stage/exits. Basically it happens when it has >> mapPartition/foreachPartition in a stage. Any idea as to why this is >> happening? >> >> Thanks, >> Swetha >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-lis

Re: Spark metrics for ganglia

2015-12-08 Thread swetha kasireddy
Hi, How to verify whether the GangliaSink directory got created? Thanks, Swetha On Mon, Dec 15, 2014 at 11:29 AM, danilopds <danilob...@gmail.com> wrote: > Thanks tsingfu, > > I used this configuration based in your post: (with ganglia unicast mode) > # Enable GangliaSink

Re: How to build Spark with Ganglia to enable monitoring using Ganglia

2015-12-07 Thread swetha kasireddy
g Ganglia? What is the > command for the same? > > Thanks, > Swetha > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-build-Spark-with-Ganglia-to-enable-monitoring-using-Ganglia-tp25625.html > Sent from the Apache Sp

Re: Effective ways monitor and identify that a Streaming job has been failing for the last 5 minutes

2015-12-06 Thread swetha kasireddy
Any documentation/sample code on how to use Ganglia with Spark? On Sat, Dec 5, 2015 at 10:29 PM, manasdebashiskar wrote: > spark has capability to report to ganglia, graphite or jmx. > If none of that works for you you can register your own spark extra > listener > that

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-01 Thread swetha kasireddy
Hi Cody, How to look at Option 2(see the following)? Which portion of the code in Spark Kafka Direct to look at to handle this issue specific to our requirements. 2.Catch that exception and somehow force things to "reset" for that partition And how would it handle the offsets already calculated

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-01 Thread swetha kasireddy
at 3:40 PM, Cody Koeninger <c...@koeninger.org> wrote: > KafkaRDD.scala , handleFetchErr > > On Tue, Dec 1, 2015 at 3:39 PM, swetha kasireddy < > swethakasire...@gmail.com> wrote: > >> Hi Cody, >> >> How to look at Option 2(see the following)? Which

Re: Recovery for Spark Streaming Kafka Direct in case of issues with Kafka

2015-12-01 Thread swetha kasireddy
Following is the Option 2 that I was talking about: 2.Catch that exception and somehow force things to "reset" for that partition And how would it handle the offsets already calculated in the backlog (if there is one)? On Tue, Dec 1, 2015 at 1:39 PM, swetha kasireddy <swethakasire

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-30 Thread swetha kasireddy
s? Can >> someone manually delete folders from the checkpoint folder to help the job >> recover? E.g. Go 2 steps back, hoping that kafka has those offsets. >> >> -adrian >> >> From: swetha kasireddy >> Date: Monday, November 9, 2015 at 10:40 PM >> To: Cody Koeni

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-30 Thread swetha kasireddy
your situation. > > The KafkaRDD will use the value of refresh.leader.backoff.ms, so you can > try adjusting that to get a longer sleep before retrying the task. > > On Mon, Nov 30, 2015 at 1:50 PM, swetha kasireddy < > swethakasire...@gmail.com> wrote: > >> Hi Cody, &g

Re: Automatic driver restart does not seem to be working in Spark Standalone

2015-11-28 Thread swetha kasireddy
code. > > > > > On Wed, Nov 25, 2015 at 12:57 PM, swetha kasireddy < > swethakasire...@gmail.com> wrote: > >> I am killing my Streaming job using UI. What error code does UI provide >> if the job is killed from there? >> >> On Wed, Nov 25, 2015 at 11:

Re: Automatic driver restart does not seem to be working in Spark Standalone

2015-11-25 Thread swetha kasireddy
; I am submitting my Spark job with supervise option as shown below. When I >> kill the driver and the app from UI, the driver does not restart >> automatically. This is in a cluster mode. Any suggestion on how to make >> Automatic Driver Restart work would be of great help. >

Re: Spark Kafka Direct Error

2015-11-24 Thread swetha kasireddy
:31 AM, Cody Koeninger <c...@koeninger.org> wrote: > No, the direct stream only communicates with Kafka brokers, not Zookeeper > directly. It asks the leader for each topicpartition what the highest > available offsets are, using the Kafka offset api. > > On Mon, Nov 23, 201

Re: Spark Kafka Direct Error

2015-11-24 Thread swetha kasireddy
re the partitions it's failing for all on the same leader? > Have there been any leader rebalances? > Do you have enough log retention? > If you log the offset for each message as it's processed, when do you see > the problem? > > On Tue, Nov 24, 2015 at 10:28 AM, swetha kasired

Re: How to have Kafka Direct Consumers show up in Kafka Consumer reporting?

2015-11-23 Thread swetha kasireddy
intln(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}") } ... } On Mon, Nov 23, 2015 at 6:31 PM, swetha kasireddy <swethakasire...@gmail.com > wrote: > Also, does Kafka direct query the offsets from the zookeeper directly? > From where does it get the offset

Re: Spark Kafka Direct Error

2015-11-23 Thread swetha kasireddy
led, the kafka leader > reported the ending offset was 221572238, but during processing, kafka > stopped returning messages before reaching that ending offset. > > That probably means something got screwed up with Kafka - e.g. you lost a > leader and lost messages in the proces

How to have Kafka Direct Consumers show up in Kafka Consumer reporting?

2015-11-23 Thread swetha
Hi, We see a bunch of issues like the following in Our Spark Kafka Direct. Any idea as to how make Kafka Direct Consumers show up in Kafka Consumer reporting to debug this issue? Job aborted due to stage failure: Task 47 in stage 336.0 failed 4 times, most recent failure: Lost task 47.3 in

Re: How to have Kafka Direct Consumers show up in Kafka Consumer reporting?

2015-11-23 Thread swetha kasireddy
you mean by kafka consumer reporting? > > I'd log the offsets in your spark job and try running > > kafka-simple-consumer-shell.sh --partition $yourbadpartition > --print-offsets > > at the same time your spark job is running > > On Mon, Nov 23, 2015 at 7:37 PM, swet

Re: How to have Kafka Direct Consumers show up in Kafka Consumer reporting?

2015-11-23 Thread swetha kasireddy
Also, does Kafka direct query the offsets from the zookeeper directly? From where does it get the offsets? There is data in those offsets, but somehow Kafka Direct does not seem to pick it up? On Mon, Nov 23, 2015 at 6:18 PM, swetha kasireddy <swethakasire...@gmail.com > wrote: > I mea

Spark Kafka Direct Error

2015-11-23 Thread swetha
): java.lang.AssertionError: assertion failed: Ran out of messages before reaching ending offset 221572238 for topic hubble_stream partition 88 start 221563725. This should not happen, and indicates that messages may have been lost Thanks, Swetha -- View this message in context: http://apache

Re: Streaming Job gives error after changing to version 1.5.2

2015-11-19 Thread swetha kasireddy
That was actually an issue with our Mesos. On Wed, Nov 18, 2015 at 5:29 PM, Tathagata Das <t...@databricks.com> wrote: > If possible, could you give us the root cause and solution for future > readers of this thread. > > On Wed, Nov 18, 2015 at 6:37 AM, swetha kasiredd

Re: How to clear the temp files that gets created by shuffle in Spark Streaming

2015-11-19 Thread swetha kasireddy
TD's comment at the end. > > Cheers > > On Wed, Nov 18, 2015 at 7:28 PM, swetha <swethakasire...@gmail.com> wrote: > >> Hi, >> >> We have a lot of temp files that gets created due to shuffles caused by >> group by. How to clear the files tha

FastUtil DataStructures in Spark

2015-11-19 Thread swetha
Hi, Has anybody used FastUtil equivalent to HashSet for Strings in Spark? Any example would be of great help. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/FastUtil-DataStructures-in-Spark-tp25429.html Sent from the Apache Spark User List

Re: How to return a pair RDD from an RDD that has foreachPartition applied?

2015-11-18 Thread swetha kasireddy
Looks like I can use mapPartitions but can it be done using forEachPartition? On Tue, Nov 17, 2015 at 11:51 PM, swetha <swethakasire...@gmail.com> wrote: > Hi, > > How to return an RDD of key/value pairs from an RDD that has > foreachPartition applied. I have my code something

Re: Streaming Job gives error after changing to version 1.5.2

2015-11-18 Thread swetha kasireddy
It works fine after some changes. -Thanks, Swetha On Tue, Nov 17, 2015 at 10:22 PM, Tathagata Das <t...@databricks.com> wrote: > Can you verify that the cluster is running the correct version of Spark. > 1.5.2. > > On Tue, Nov 17, 2015 at 7:23 PM, swetha kasireddy < > s

How to clear the temp files that gets created by shuffle in Spark Streaming

2015-11-18 Thread swetha
Hi, We have a lot of temp files that gets created due to shuffles caused by group by. How to clear the files that gets created due to intermediate operations in group by? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-clear

Streaming Job gives error after changing to version 1.5.2

2015-11-17 Thread swetha
Hi, I see java.lang.NoClassDefFoundError after changing the Streaming job version to 1.5.2. Any idea as to why this is happening? Following are my dependencies and the error that I get. org.apache.spark spark-core_2.10 ${sparkVersion}

Re: Streaming Job gives error after changing to version 1.5.2

2015-11-17 Thread swetha kasireddy
This error I see locally. On Tue, Nov 17, 2015 at 5:44 PM, Tathagata Das <t...@databricks.com> wrote: > Are you running 1.5.2-compiled jar on a Spark 1.5.2 cluster? > > On Tue, Nov 17, 2015 at 5:34 PM, swetha <swethakasire...@gmail.com> wrote: >

How to return a pair RDD from an RDD that has foreachPartition applied?

2015-11-17 Thread swetha
Hi, How to return an RDD of key/value pairs from an RDD that has foreachPartition applied. I have my code something like the following. It looks like an RDD that has foreachPartition can have only the return type as Unit. How do I apply foreachPartition and do a save and at the same return a pair

Re: parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark

2015-11-09 Thread swetha kasireddy
I am using the following: com.twitter parquet-avro 1.6.0 On Mon, Nov 9, 2015 at 1:00 AM, Fengdong Yu <fengdo...@everstring.com> wrote: > Which Spark version used? > > It was fixed in Parquet-1.7x, so Spark-1.5.x will be work. > > > > > > O

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-09 Thread swetha kasireddy
heckpoint directory is a good way to restart > the streaming job, you should stop the spark context or at the very least > kill the driver process, then restart. > > On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy < > swethakasire...@gmail.com> wrote: > >> Hi Cody, &g

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-09 Thread swetha kasireddy
t;c...@koeninger.org> wrote: > Without knowing more about what's being stored in your checkpoint > directory / what the log output is, it's hard to say. But either way, just > deleting the checkpoint directory probably isn't sufficient to restart the > job... > > On Mon, Nov 9,

Spark IndexedRDD dependency in Maven

2015-11-09 Thread swetha
Hi , What is the appropriate dependency to include for Spark Indexed RDD? I get compilation error if I include 0.3 as the version as shown below: amplab spark-indexedrdd 0.3 Thanks, Swetha -- View this message in context: http://apache

Unwanted SysOuts in Spark Parquet

2015-11-08 Thread swetha
Hi, I see a lot of unwanted SysOuts when I try to save an RDD as parquet file. Following is the code and SysOuts. Any idea as to how to avoid the unwanted SysOuts? ParquetOutputFormat.setWriteSupportClass(job, classOf[AvroWriteSupport]) AvroParquetOutputFormat.setSchema(job,

parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark

2015-11-08 Thread swetha
Hi, I see unwanted Warning when I try to save a Parquet file in hdfs in Spark. Please find below the code and the Warning message. Any idea as to how to avoid the unwanted Warning message? activeSessionsToBeSaved.saveAsNewAPIHadoopFile("test", classOf[Void], classOf[ActiveSession],

What is the efficient way to Join two RDDs?

2015-11-06 Thread swetha
Hi, What is the efficient way to join two RDDs? Would converting both the RDDs to IndexedRDDs be of any help? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-efficient-way-to-Join-two-RDDs-tp25310.html Sent from the Apache Spark

Re: What is the efficient way to Join two RDDs?

2015-11-06 Thread swetha kasireddy
> On Fri, Nov 6, 2015 at 3:21 PM, swetha <swethakasire...@gmail.com> wrote: > >> Hi, >> >> What is the efficient way to join two RDDs? Would converting both the RDDs >> to IndexedRDDs be of any help? >> >> Thanks, >> Swetha >> >> >

Re: creating a distributed index

2015-11-06 Thread swetha kasireddy
an IndexedRDD for a certain set of data and then get those keys that are present in the IndexedRDD but not present in some other RDD. How would an IndexedRDD support such an usecase in an efficient manner? Thanks, Swetha On Wed, Jul 15, 2015 at 2:46 AM, Jem Tucker <jem.tuc...@gmail.com>

Re: How to unpersist a DStream in Spark Streaming

2015-11-05 Thread swetha kasireddy
Its just in the same thread for a particular RDD, I need to uncache it every 2 minutes to clear out the data that is present in a Map inside that. On Wed, Nov 4, 2015 at 11:54 PM, Saisai Shao <sai.sai.s...@gmail.com> wrote: > Hi Swetha, > > Would you mind elaborating your

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread swetha kasireddy
test/sql-programming-guide.html#parquet-files> > . > > On Thu, Nov 5, 2015 at 12:09 AM, swetha <swethakasire...@gmail.com> wrote: > >> Hi, >> >> What is the efficient approach to save an RDD as a file in HDFS and >> retrieve >> it back? I was thinkin

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread swetha kasireddy
't be a problem imho. > in general hdfs is pretty fast, s3 is less so > the issue with storing data is that you will loose your partitioner(even > though rdd has it) at loading moment. There is PR that tries to solve this. > > > On 5 November 2015 at 01:09, swetha <sweth

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread swetha kasireddy
t[T]]) sc.newAPIHadoopFile( parquetFile, classOf[ParquetInputFormat[T]], classOf[Void], tag.runtimeClass.asInstanceOf[Class[T]], jobConf) .map(_._2.asInstanceOf[T]) } On Thu, Nov 5, 2015 at 2:14 PM, swetha kasireddy <swethakasire...@gmail.com> wrote: > No scala. Sup

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread swetha kasireddy
ng from java - toJavaRDD > <https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html#toJavaRDD()>* > () > > On 5 November 2015 at 21:13, swetha kasireddy <swethakasire...@gmail.com> > wrote: > >> How to convert a parquet file that is saved

Re: How to lookup by a key in an RDD

2015-11-05 Thread swetha kasireddy
I read about the IndexedRDD. Is the IndexedRDD join with another RDD that is not an IndexedRDD efficient? On Mon, Nov 2, 2015 at 9:56 PM, Deenar Toraskar <deenar.toras...@gmail.com> wrote: > Swetha > > Currently IndexedRDD is an external library and not part of Spark Core

How to unpersist a DStream in Spark Streaming

2015-11-04 Thread swetha
Hi, How to unpersist a DStream in Spark Streaming? I know that we can persist using dStream.persist() or dStream.cache. But, I don't see any method to unPersist. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-unpersist-a-DStream

  1   2   >