Regarding structured streaming windows on older data

2019-12-23 Thread Hemant Bhanawat
For demonstration purpose, I was using data that had older timestamps with structured streaming. The data was for the year 2018, window was of 24 hours and watermark of 0 seconds. Few things that I saw and could not explain are: 1. The initial batch of streaming had around 60 windows. It processed

Re: mllib + SQL

2018-09-01 Thread Hemant Bhanawat
its > distributed execution environment. SQL-only analysts would struggle to be > effective with SQL-only access to Spark. > > On Fri, Aug 31, 2018 at 5:05 AM Hemant Bhanawat > wrote: > >> We allow our users to interact with spark cluster using SQL queries only. >> Tha

Re: mllib + SQL

2018-08-31 Thread Hemant Bhanawat
BTW, I can contribute if there is already an effort going on somewhere. On Fri, Aug 31, 2018 at 3:35 PM Hemant Bhanawat wrote: > We allow our users to interact with spark cluster using SQL queries only. > That's easy for them. MLLib does not have SQL extensions and we cannot > ex

Re: mllib + SQL

2018-08-31 Thread Hemant Bhanawat
ng, this is certainly the best place to start. > > See here: https://spark.apache.org/docs/latest/ml-guide.html > > > best, > wb > > > > On Thu, Aug 30, 2018 at 1:45 AM Hemant Bhanawat > wrote: > >> Is there a plan to support SQL extensions for mllib?

mllib + SQL

2018-08-29 Thread Hemant Bhanawat
Is there a plan to support SQL extensions for mllib? Or is there an effort already underway? Any information is appreciated. Thanks in advance. Hemant

Re: Sorting on a streaming dataframe

2018-05-01 Thread Hemant Bhanawat
! > > On Fri, Apr 27, 2018 at 3:59 AM Hemant Bhanawat > wrote: > >> I see. >> >> monotonically_increasing_id on streaming dataFrames will be really >> helpful to me and I believe to many more users. Adding this functionality >> in Spark would b

[jira] [Updated] (SPARK-24144) monotonically_increasing_id on streaming dataFrames

2018-05-01 Thread Hemant Bhanawat (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hemant Bhanawat updated SPARK-24144: Priority: Major (was: Minor) > monotonically_increasing_id on streaming dataFra

[jira] [Created] (SPARK-24144) monotonically_increasing_id on streaming dataFrames

2018-05-01 Thread Hemant Bhanawat (JIRA)
Hemant Bhanawat created SPARK-24144: --- Summary: monotonically_increasing_id on streaming dataFrames Key: SPARK-24144 URL: https://issues.apache.org/jira/browse/SPARK-24144 Project: Spark

Re: Sorting on a streaming dataframe

2018-04-27 Thread Hemant Bhanawat
; spec. However, from my experience with Spark, there are many good reasons >> why this requirement is not supported ;) >> >> Best, >> >> Chayapan (A) >> >> >> On Apr 24, 2018, at 2:18 PM, Hemant Bhanawat >> wrote: >> >> Thanks Chris. The

Re: Sorting on a streaming dataframe

2018-04-24 Thread Hemant Bhanawat
gt; all. For example, if you are using kafka, a proper partitioning scheme and > message offsets may be “good enough”. > ------ > *From:* Hemant Bhanawat > *Sent:* Thursday, April 12, 2018 11:42:59 PM > *To:* Reynold Xin > *Cc:* dev > *Subject:* Re: Sorti

Re: Sorting on a streaming dataframe

2018-04-12 Thread Hemant Bhanawat
the dataframe so that the records always get the same snapshot id. On Fri, Apr 13, 2018 at 11:43 AM, Reynold Xin wrote: > Can you describe your use case more? > > On Thu, Apr 12, 2018 at 11:12 PM Hemant Bhanawat > wrote: > >> Hi Guys, >> >> Why is sorting on s

Sorting on a streaming dataframe

2018-04-12 Thread Hemant Bhanawat
Hi Guys, Why is sorting on streaming dataframes not supported(unless it is complete mode)? My downstream needs me to sort the streaming dataframe. Hemant

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-10 Thread Hemant Bhanawat
BTW, aggregate push-down support is desirable and should be considered as an enhancement going forward. Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Sun, Sep 10, 2017 at 8:45 PM, vaquar khan wrote: > +1 > > Regards, > Vaquar khan >

Re: How to specify file

2016-09-23 Thread Hemant Bhanawat
Check out the READEME on the following page. This is the csv connector that you are using. I think you need to specify the delimiter option. https://github.com/databricks/spark-csv Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Fri, Sep 23, 2016

Re: CSV Reader with row numbers

2016-09-22 Thread Hemant Bhanawat
API documentation. https://spark.apache.org/docs/1.6.1/api/scala/index.html#org.apache.spark.rdd.RDD Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Thu, Sep 15, 2016 at 4:28 AM, Akshay Sachdeva wrote: > Environment: > Apache Spark 1.6.2 &

Memory usage by Spark jobs

2016-09-21 Thread Hemant Bhanawat
processing a specific data size of let's say parquet data? Also, has someone investigated memory usage for the individual SQL operators like Filter, group by, order by, Exchange etc.? Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io

Memory usage by Spark jobs

2016-09-21 Thread Hemant Bhanawat
processing a specific data size of let's say parquet data? Also, has someone investigated memory usage for the individual SQL operators like Filter, group by, order by, Exchange etc.? Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io

[jira] [Reopened] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite

2016-04-24 Thread Hemant Bhanawat (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hemant Bhanawat reopened SPARK-13693: - > Flaky test: o.a.s.streaming.MapWithStateSu

[jira] [Commented] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite

2016-04-24 Thread Hemant Bhanawat (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15255881#comment-15255881 ] Hemant Bhanawat commented on SPARK-13693: - Latest Jenkins builds are fai

[jira] [Created] (SPARK-14729) Implement an existing cluster manager with New ExternalClusterManager interface

2016-04-19 Thread Hemant Bhanawat (JIRA)
Hemant Bhanawat created SPARK-14729: --- Summary: Implement an existing cluster manager with New ExternalClusterManager interface Key: SPARK-14729 URL: https://issues.apache.org/jira/browse/SPARK-14729

[jira] [Commented] (SPARK-14729) Implement an existing cluster manager with New ExternalClusterManager interface

2016-04-19 Thread Hemant Bhanawat (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247543#comment-15247543 ] Hemant Bhanawat commented on SPARK-14729: - I am looking into this. > Im

[jira] [Commented] (SPARK-13904) Add support for pluggable cluster manager

2016-04-18 Thread Hemant Bhanawat (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15247096#comment-15247096 ] Hemant Bhanawat commented on SPARK-13904: - [~kiszk] Since the builds are pas

[jira] [Commented] (SPARK-13904) Add support for pluggable cluster manager

2016-04-18 Thread Hemant Bhanawat (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245581#comment-15245581 ] Hemant Bhanawat commented on SPARK-13904: - I ran the following command o

[jira] [Commented] (SPARK-13904) Add support for pluggable cluster manager

2016-04-17 Thread Hemant Bhanawat (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245118#comment-15245118 ] Hemant Bhanawat commented on SPARK-13904: - [~kiszk] I am looking into

Re: How to process one partition at a time?

2016-04-06 Thread Hemant Bhanawat
Apparently, there is another way to do it. You can try creating a PartitionPruningRDD and pass a partition filter function to it. This RDD will do the same thing that I suggested in my mail and you will not have to create a new RDD. Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhana

Re: Executor shutdown hooks?

2016-04-06 Thread Hemant Bhanawat
exit thread will wait for a certain period of time before the executor jvm exits to allow proper cleanups of the tasks. Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Thu, Apr 7, 2016 at 6:08 AM, Reynold Xin wrote: > > On Wed, Apr 6, 201

Re: Executor shutdown hooks?

2016-04-06 Thread Hemant Bhanawat
exit thread will wait for a certain period of time before the executor jvm exits to allow proper cleanups of the tasks. Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Thu, Apr 7, 2016 at 6:08 AM, Reynold Xin wrote: > > On Wed, Apr 6, 201

Re: How to process one partition at a time?

2016-04-06 Thread Hemant Bhanawat
hear if you take some other approach. Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Wed, Apr 6, 2016 at 3:49 PM, Andrei wrote: > I'm writing a kind of sampler which in most cases will require only 1 > partition, sometimes 2 and very r

Re: how about a custom coalesce() policy?

2016-04-02 Thread Hemant Bhanawat
correcting email id for Nezih Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Sun, Apr 3, 2016 at 11:09 AM, Hemant Bhanawat wrote: > Hi Nezih, > > Can you share JIRA and PR numbers? > > This partial de-coupling of data partitioni

Re: how about a custom coalesce() policy?

2016-04-02 Thread Hemant Bhanawat
Hi Nezih, Can you share JIRA and PR numbers? This partial de-coupling of data partitioning strategy and spark parallelism would be a useful feature for any data store. Hemant Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Fri, Apr 1, 2016 at

Re: SPARK-13900 - Join with simple OR conditions take too long

2016-04-01 Thread Hemant Bhanawat
e will be very little. I had mentioned that NL time will *vary *little with more number of conditions. What I meant was that instead of 3 conditions if you would have 15 conditions, the NL loop would still take 13-15 mins while the hash join would take more than that. Hemant Hemant Bhanawat

[jira] [Commented] (SPARK-13900) Spark SQL queries with OR condition is not optimized properly

2016-03-31 Thread Hemant Bhanawat (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221145#comment-15221145 ] Hemant Bhanawat commented on SPARK-13900: - As I understand, on table A and

Re: SPARK-13900 - Join with simple OR conditions take too long

2016-03-31 Thread Hemant Bhanawat
would increase linearly with number of conditions. So, when number of conditions are too many, nested loop join would be faster than the solution that you suggest. Now the question is, how should Spark decide when to do what? Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat

[jira] [Created] (SPARK-13904) Add support for pluggable cluster manager

2016-03-15 Thread Hemant Bhanawat (JIRA)
Hemant Bhanawat created SPARK-13904: --- Summary: Add support for pluggable cluster manager Key: SPARK-13904 URL: https://issues.apache.org/jira/browse/SPARK-13904 Project: Spark Issue Type

Re: Can we use spark inside a web service?

2016-03-11 Thread Hemant Bhanawat
to make Spark more suitable for such scenarios but it never made it to the Spark codebase. If Spark has to become a highly concurrent solution, scheduling has to be distributed. Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Fri, Mar 11, 2016 at 7

Re: S3 Zip File Loading Advice

2016-03-08 Thread Hemant Bhanawat
) and running a new job over the new folders created in an interval. This will have to be an automated using an external script. Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Wed, Mar 9, 2016 at 10:33 AM, Benjamin Kim wrote: > I am wondering

Re: Spark Streaming - graceful shutdown when stream has no more data

2016-02-23 Thread Hemant Bhanawat
A guess - parseRecord is returning None in some case (probaly empty lines). And then entry.get is throwing the exception. You may want to filter the None values from accessLogDStream before you run the map function over it. Hemant Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhana

Re: Specify number of executors in standalone cluster mode

2016-02-21 Thread Hemant Bhanawat
further processing.* Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Sun, Feb 21, 2016 at 11:01 PM, Saiph Kappa wrote: > Hi, > > I'm running a spark streaming application onto a spark cluster that spans > 6 machines/workers.

Re: Behind the scene of RDD to DataFrame

2016-02-20 Thread Hemant Bhanawat
toDF internally calls sqlcontext.createDataFrame which transforms the RDD to RDD[InternalRow]. This RDD[InternalRow] is then mapped to a dataframe. Type conversions (from scala types to catalyst types) are involved but no shuffling. Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhana

Re: spark stages in parallel

2016-02-20 Thread Hemant Bhanawat
Not possible as of today. See https://issues.apache.org/jira/browse/SPARK-2387 Hemant Bhanawat https://www.linkedin.com/in/hemant-bhanawat-92a3811 www.snappydata.io On Thu, Feb 18, 2016 at 1:19 PM, Shushant Arora wrote: > can two stages of single job run in parallel in spark? > > e.g

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Hemant Bhanawat
esult partitions is lower than `spark.sql.shuffle.partitions` > . > > > On Tue, Feb 9, 2016 at 9:35 PM, Hemant Bhanawat > wrote: > >> For sql shuffle operations like groupby, the number of output partitions >> is controlled by spark.sql.shuffle.partitions. But, it seem

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Hemant Bhanawat
For sql shuffle operations like groupby, the number of output partitions is controlled by spark.sql.shuffle.partitions. But, it seems orderBy does not honour this. In my small test, I could see that the number of partitions in DF returned by orderBy was equal to the total number of distinct keys.

Re: Spark Streaming with Druid?

2016-02-08 Thread Hemant Bhanawat
emant On Mon, Feb 8, 2016 at 10:04 PM, Umesh Kacha wrote: > Hi Hemant, thanks much can we use SnappyData on YARN. My Spark jobs run > using yarn client mode. Please guide. > > On Mon, Feb 8, 2016 at 9:46 AM, Hemant Bhanawat > wrote: > >> You may want to have a look at s

Re: Spark Streaming with Druid?

2016-02-07 Thread Hemant Bhanawat
You may want to have a look at spark druid project already in progress: https://github.com/SparklineData/spark-druid-olap You can also have a look at SnappyData , which is a low latency store tightly integrated with Spark, Spark SQL and Spark Streaming.

Re: DataFrame First method is resulting different results in each iteration

2016-02-03 Thread Hemant Bhanawat
ghest Sal. For getting EmpId with highest Sal, you will have to change your query to add filters or add subqueries. See the following thread: http://stackoverflow.com/questions/6841605/get-top-1-row-of-each-group Hemant Bhanawat SnappyData (http://snappydata.io/) On Wed, Feb 3, 2016 at 4:

Re: DataFrame First method is resulting different results in each iteration

2016-02-03 Thread Hemant Bhanawat
Missing order by? Hemant Bhanawat SnappyData (http://snappydata.io/) On Wed, Feb 3, 2016 at 3:45 PM, satish chandra j wrote: > HI All, > I have data in a emp_df (DataFrame) as mentioned below: > > EmpId Sal DeptNo > 001 100 10 > 002 120 20 > 003

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-07 Thread Hemant Bhanawat
Please find attached. On Wed, Oct 7, 2015 at 7:36 PM, Ted Yu wrote: > Hemant: > Can you post the code snippet to the mailing list - other people would be > interested. > > On Wed, Oct 7, 2015 at 5:50 AM, Hemant Bhanawat > wrote: > >> Will send you the code on your

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-07 Thread Hemant Bhanawat
Will send you the code on your email id. On Wed, Oct 7, 2015 at 4:37 PM, Ophir Cohen wrote: > Thanks! > Can you check if you can provide example of the conversion? > > > On Wed, Oct 7, 2015 at 2:05 PM, Hemant Bhanawat > wrote: > >> Oh, this is an internal class of ou

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-07 Thread Hemant Bhanawat
gt; The problem is that I get types mismatches... > > On Wed, Oct 7, 2015 at 8:03 AM, Hemant Bhanawat > wrote: > >> An approach can be to wrap your MutableRow in WrappedInternalRow which is >> a child class of Row. >> >> Hemant >> www.snappydata.io >&g

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-06 Thread Hemant Bhanawat
An approach can be to wrap your MutableRow in WrappedInternalRow which is a child class of Row. Hemant www.snappydata.io linkedin.com/company/snappydata On Tue, Oct 6, 2015 at 3:21 PM, Ophir Cohen wrote: > Hi Guys, > I'm upgrading to Spark 1.5. > > In our previous version (Spark 1.3 but it was

Re: [cache eviction] partition recomputation in big lineage RDDs

2015-10-01 Thread Hemant Bhanawat
As I understand, you don't need merge of your historical data RDD with your RDD_inc, what you need is merge of the computation results of the your historical RDD with RDD_inc and so on. IMO, you should consider having an external row store to hold your computations. I say this because you need to

Re: flatmap() and spark performance

2015-09-28 Thread Hemant Bhanawat
You can use spark.executor.memory to specify the memory of the executors which will hold this intermediate results. You may want to look at the section "Understanding Memory Management in Spark" of this link: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applica

Re: caching DataFrames

2015-09-23 Thread Hemant Bhanawat
this will run on the cached data of A. dfA1.count() On Thu, Sep 24, 2015 at 10:20 AM, Hemant Bhanawat wrote: > Two dataframes do not share cache storage in Spark. Hence it's immaterial > that how two dataFrames are related to each other. Both of them are going > to consume me

Re: caching DataFrames

2015-09-23 Thread Hemant Bhanawat
Two dataframes do not share cache storage in Spark. Hence it's immaterial that how two dataFrames are related to each other. Both of them are going to consume memory based on the data that they have. So for your A1 and B1 you would need extra memory that would be equivalent to half the memory of A

Re: Why are executors on slave never used?

2015-09-21 Thread Hemant Bhanawat
When you specify master as local[2], it starts the spark components in a single jvm. You need to specify the master correctly. I have a default AWS EMR cluster (1 master, 1 slave) with Spark. When I run a Spark process, it works fine -- but only on the master, as if it were standalone. The web-UI

Re: DataGenerator for streaming application

2015-09-21 Thread Hemant Bhanawat
Why are you using rawSocketStream to read the data? I believe rawSocketStream waits for a big chunk of data before it can start processing it. I think what you are writing is a String and you should use socketTextStream which reads the data on a per line basis. On Sun, Sep 20, 2015 at 9:56 AM, Sa

Re: A way to timeout and terminate a laggard 'Stage' ?

2015-09-17 Thread Hemant Bhanawat
Driver timing out laggards seems like a reasonable way of handling laggards. Are there any challenges because of which driver does not do it today? Is there a JIRA for this? I couldn't find one. On Tue, Sep 15, 2015 at 12:07 PM, Akhil Das wrote: > As of now i think its a no. Not sure if its

Re: Difference between sparkDriver and "executor ID driver"

2015-09-15 Thread Hemant Bhanawat
1. When you call new SparkContext(), spark driver is started which internally create Akka ActorSystem which registers on this port. 2. Since you are running in local mode, starting of executor is short circuited and an Executor object is created in the same process (see LocalEndpoint). This Execut

Re: How to Serialize and Reconstruct JavaRDD later?

2015-09-02 Thread Hemant Bhanawat
You want to persist the state between the execution of two rdds. So, I believe what you need is serialization of your model and not JavaRDD. If you can serialize your model, you can persist that in HDFS or some other datastore to be used by the next RDDs. If you are using Spark Streaming, doing th

Re: taking an n number of rows from and RDD starting from an index

2015-09-01 Thread Hemant Bhanawat
I think rdd.toLocalIterator is what you want. But it will keep one partition's data in-memory. On Wed, Sep 2, 2015 at 10:05 AM, Niranda Perera wrote: > Hi all, > > I have a large set of data which would not fit into the memory. So, I wan > to take n number of data from the RDD given a particular

Re: Performance issue with Spark join

2015-08-26 Thread Hemant Bhanawat
Spark joins are different than traditional database joins because of the lack of support of indexes. Spark has to shuffle data between various nodes to perform joins. Hence joins are bound to be much slower than count which is just a parallel scan of the data. Still, to ensure that nothing is wro

Re: How to set environment of worker applications

2015-08-25 Thread Hemant Bhanawat
> can use spark.executor.extraJavaOptions to pass system properties and > spark-env.sh to pass environment variables. > > -raghav > > On Mon, Aug 24, 2015 at 1:00 PM, Hemant Bhanawat > wrote: > >> That's surprising. Passing the environment variables using >> spark.executor.extraJ

Re: Exception throws when running spark pi in Intellij Idea that scala.collection.Seq is not found

2015-08-25 Thread Hemant Bhanawat
Go to the module settings of the project and in the dependencies section check the scope of scala jars. It would be either Test or Provided. Change it to compile and it should work. Check the following link to understand more about scope of modules: https://maven.apache.org/guides/introduction/int

Re: Joining using mulitimap or array

2015-08-24 Thread Hemant Bhanawat
In your example, a.attributes.name is a list and is not a string . Run this to find it out : a.select($"a.attributes.name").show() On Mon, Aug 24, 2015 at 2:51 PM, Ilya Karpov wrote: > Hi, guys > I'm confused about joining columns in SparkSQL and need your advice. > I want to join 2 datasets o

Re: How to set environment of worker applications

2015-08-24 Thread Hemant Bhanawat
to pass on environment variables to worker node is >> to write it in spark-env.sh file on each worker node. >> >> On Sun, Aug 23, 2015 at 8:16 PM, Hemant Bhanawat >> wrote: >> >>> Check for spark.driver.extraJavaOptions and >>> spark

Re: How to set environment of worker applications

2015-08-23 Thread Hemant Bhanawat
Check for spark.driver.extraJavaOptions and spark.executor.extraJavaOptions in the following article. I think you can use -D to pass system vars: spark.apache.org/docs/latest/configuration.html#runtime-environment Hi, I am starting a spark streaming job in standalone mode with spark-submit. Is t

Re: PySpark concurrent jobs using single SparkContext

2015-08-21 Thread Hemant Bhanawat
It seems like you want simultaneous processing of multiple jobs but at the same time serialization of few tasks within those jobs. I don't know how to achieve that in Spark. But, why would you bother about the inter-weaved processing when the data that is being aggregated in different jobs is per

Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

2015-08-20 Thread Hemant Bhanawat
Sorry, I misread your mail. Thanks for pointing that out. BTW, are the 8 files shuffle intermediate output and not the final output? I assume yes. I didn't know that you can keep intermediate output on HDFS and I don't think that is recommended. On Thu, Aug 20, 2015 at 2:43

Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

2015-08-20 Thread Hemant Bhanawat
Looks like you are using hash based shuffling and not sort based shuffling which creates a single file per maptask. On Thu, Aug 20, 2015 at 12:43 AM, unk1102 wrote: > Hi I have a Spark job which deals with large skewed dataset. I have around > 1000 Hive partitions to process in four different ta

Re: How to overwrite partition when writing Parquet?

2015-08-20 Thread Hemant Bhanawat
How can I overwrite only a given partition or manually remove a partition before writing? I don't know if (and I don't think) there is a way to do that using a mode. But doesn't manually deleting the directory of a particular partition help? For directory structure, check this out... http://spar

Re: persist for DStream

2015-08-20 Thread Hemant Bhanawat
Are you asking for something more than this? http://spark.apache.org/docs/latest/streaming-programming-guide.html#caching--persistence On Thu, Aug 20, 2015 at 2:09 PM, Deepesh Maheshwari < deepesh.maheshwar...@gmail.com> wrote: > Hi, > > there are function available tp cache() or persist() RDD

Re: global variable in spark streaming with no dependency on key

2015-08-18 Thread Hemant Bhanawat
See if SparkContext.accumulator helps. On Tue, Aug 18, 2015 at 2:27 PM, Joanne Contact wrote: > Hi Gurus, > > Please help. > > But please don't tell me to use updateStateByKey because I need a > global variable (something like the clock time) across the micro > batches but not depending on key.

Re: Regarding rdd.collect()

2015-08-18 Thread Hemant Bhanawat
matter of sharing an rdd between jobs you can have a look at Spark > Job Server(spark-jobserver <https://github.com/ooyala/spark-jobserver>) > or some In-Memory storages: Tachyon(http://tachyon-project.org/) or > Ignite(https://ignite.incubator.apache.org/) > > 2015-08-18 9:37 GM

Re: Regarding rdd.collect()

2015-08-18 Thread Hemant Bhanawat
It is still in memory for future rdd transformations and actions. This is interesting. You mean Spark holds the data in memory between two job executions. How does the second job get the handle of the data in memory? I am interested in knowing more about it. Can you forward me a spark article or

Re: registering an empty RDD as a temp table in a PySpark SQL context

2015-08-18 Thread Hemant Bhanawat
It is definitely not the case for Spark SQL. A temporary table (much like dataFrame) is a just a logical plan with a name and it is not iterated unless a query is fired on it. I am not sure if using rdd.take in py code to verify the schema is a right approach as it creates a spark job. BTW, why w

Re: Apache Spark - Parallel Processing of messages from Kafka - Java

2015-08-16 Thread Hemant Bhanawat
In spark, every action (foreach, collect etc.) gets converted into a spark job and jobs are executed sequentially. You may want to refactor your code in calculateUseCase? to just run transformations (map, flatmap) and call a single action in the end. On Sun, Aug 16, 2015 at 3:19 PM, mohanaugust

Re: Streaming on Exponential Data

2015-08-13 Thread Hemant Bhanawat
What does exponential data means? Does this mean that the amount of the data that is being received from the stream in a batchinterval is increasing exponentially as the time progresses? Does your process have enough memory to handle the data for a batch interval? You may want to share Spark task

Re: What is the Effect of Serialization within Stages?

2015-08-13 Thread Hemant Bhanawat
A chain of map and flatmap does not cause any serialization-deserialization. On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann wrote: > Hello everyone, > > I am wondering what the effect of serialization is within a stage. > > My understanding of Spark as an execution engine is that the data flow

Re: grouping by a partitioned key

2015-08-12 Thread Hemant Bhanawat
enced in that link. > > > Have you tried to use DataFrame API instead of SQL? I mean smth like > dataFrame.select(key).agg(count).distinct().agg(sum). > Could you print explain for this way and for SQL you tried? I’m just > curious of the difference. > > > On Tue, Aug 1

Re: grouping by a partitioned key

2015-08-11 Thread Hemant Bhanawat
As far as I know, Spark SQL cannot process data on a per-partition-basis. DataFrame.foreachPartition is the way. I haven't tried it, but, following looks like a not-so-sophisticated way of making spark sql partition aware. http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-d

Re: How to minimize shuffling on Spark dataframe Join?

2015-08-11 Thread Hemant Bhanawat
Is the source of your dataframe partitioned on key? As per your mail, it looks like it is not. If that is the case, for partitioning the data, you will have to shuffle the data anyway. Another part of your question is - how to co-group data from two dataframes based on a key? I think for RDD's co

Re: Partitioning in spark streaming

2015-08-11 Thread Hemant Bhanawat
Posting a comment from my previous mail post: When data is received from a stream source, receiver creates blocks of data. A new block of data is generated every blockInterval milliseconds. N blocks of data are created during the batchInterval where N = batchInterval/blockInterval. A RDD is creat

Re: [VOTE] Grandfathering forgotten Geode contributors

2015-06-11 Thread Hemant Bhanawat
+1 On Fri, Jun 12, 2015 at 11:15 AM, Rajesh Kumar wrote: > +1 > > On Fri, Jun 12, 2015 at 10:43 AM, Suranjan Kumar > > wrote: > > > +1 > > > > On Fri, Jun 12, 2015 at 10:42 AM, Shirish Deshmukh > > > wrote: > > > > > +1 > > > > > > On Fri, Jun 12, 2015 at 4:45 AM, Roman Shaposhnik > > wrote:

Re: [VOTE] Grandfathering forgotten Geode contributors

2015-06-11 Thread Hemant Bhanawat
+1 On Thu, Jun 11, 2015 at 3:37 AM, William A Rowe Jr wrote: > On Wed, Jun 10, 2015 at 5:03 PM, William A Rowe Jr > wrote: > > > > > If you hold a public vote to make them committers, they are not on the > > PPMC. > > If you hold a private vote, likewise. If you hold a vote to make them > > co

Re: Spark Streaming - Design considerations/Knobs

2015-05-21 Thread Hemant Bhanawat
o remember about Spark Streaming. > > > On Wed, May 20, 2015 at 3:40 AM, Hemant Bhanawat > wrote: > >> Hi, >> >> I have compiled a list (from online sources) of knobs/design >> considerations that need to be taken care of by applications running on >> spark

Re: Spark Streaming - Design considerations/Knobs

2015-05-21 Thread Hemant Bhanawat
o remember about Spark Streaming. > > > On Wed, May 20, 2015 at 3:40 AM, Hemant Bhanawat > wrote: > >> Hi, >> >> I have compiled a list (from online sources) of knobs/design >> considerations that need to be taken care of by applications running on >> spark

Spark Streaming - Design considerations/Knobs

2015-05-20 Thread Hemant Bhanawat
Hi, I have compiled a list (from online sources) of knobs/design considerations that need to be taken care of by applications running on spark streaming. Is my understanding correct? Any other important design consideration that I should take care of? - A DStream is associated with a single

Spark Streaming - Design considerations/Knobs

2015-05-20 Thread Hemant Bhanawat
Hi, I have compiled a list (from online sources) of knobs/design considerations that need to be taken care of by applications running on spark streaming. Is my understanding correct? Any other important design consideration that I should take care of? - A DStream is associated with a single

Re: Our JIRA is fully functional now

2015-05-05 Thread Hemant Bhanawat
hbhanawat Thanks, Hemant On Wed, May 6, 2015 at 10:32 AM, Edin Zulich wrote: > ezulich > > Thanks, > Edin > > > > On May 5, 2015, at 4:35 PM, Roman Shaposhnik wrote: > > > > https://issues.apache.org/jira/browse/GEODE-1 > > > > Please make sure to register accounts there > > and reply back to

Re: Regarding hsync

2013-07-11 Thread Hemant Bhanawat
Hi, Any help? Thanks in advance, Hemant - Original Message - From: "Hemant Bhanawat" To: hdfs-dev@hadoop.apache.org Sent: Tuesday, July 9, 2013 12:55:23 PM Subject: Regarding hsync Hi, I am currently working on hadoop version 2.0.*. Currently, hsync does not

Regarding hsync

2013-07-09 Thread Hemant Bhanawat
Hi, I am currently working on hadoop version 2.0.*. Currently, hsync does not update the file size on namenode. So, if my process dies after calling hsync but before calling file close, the file is left with an inconsistent file size. I would like to fix this file size. Is there a way to do

Partially written SequenceFile

2013-07-04 Thread Hemant Bhanawat
n = fs.append(path); fs.close(); break; } catch (Recovery In Progress Exception) { } catch (Already Being Created Exception) { } catch (Exception) { break; } } Would it be possible for you to let me know if this approach has any shortcomings or if there are any other better alternatives available? Thanks, Hemant Bhanawat

HBase and MapReduce

2012-05-23 Thread Hemant Bhanawat
I have couple of questions related to MapReduce over HBase 1. HBase guarantees data locality of store files and Regionserver only if it stays up for long. If there are too many region movements or the server has been recycled recently, there is a high probability that store file blocks are not