Regarding structured streaming windows on older data

2019-12-23 Thread Hemant Bhanawat
For demonstration purpose, I was using data that had older timestamps with structured streaming. The data was for the year 2018, window was of 24 hours and watermark of 0 seconds. Few things that I saw and could not explain are: 1. The initial batch of streaming had around 60 windows. It processed

Re: mllib + SQL

2018-09-01 Thread Hemant Bhanawat
ironment. SQL-only analysts would struggle to be > effective with SQL-only access to Spark. > > On Fri, Aug 31, 2018 at 5:05 AM Hemant Bhanawat > wrote: > >> We allow our users to interact with spark cluster using SQL queries only. >> That's easy for them. MLLib does not have S

Re: mllib + SQL

2018-08-31 Thread Hemant Bhanawat
BTW, I can contribute if there is already an effort going on somewhere. On Fri, Aug 31, 2018 at 3:35 PM Hemant Bhanawat wrote: > We allow our users to interact with spark cluster using SQL queries only. > That's easy for them. MLLib does not have SQL extensions and we cannot > expose

Re: mllib + SQL

2018-08-31 Thread Hemant Bhanawat
is is certainly the best place to start. > > See here: https://spark.apache.org/docs/latest/ml-guide.html > > > best, > wb > > > > On Thu, Aug 30, 2018 at 1:45 AM Hemant Bhanawat > wrote: > >> Is there a plan to support SQL extensions for mllib? Or is t

mllib + SQL

2018-08-30 Thread Hemant Bhanawat
Is there a plan to support SQL extensions for mllib? Or is there an effort already underway? Any information is appreciated. Thanks in advance. Hemant

Re: Sorting on a streaming dataframe

2018-05-01 Thread Hemant Bhanawat
t; Please open a JIRA then! > > On Fri, Apr 27, 2018 at 3:59 AM Hemant Bhanawat <hemant9...@gmail.com> > wrote: > >> I see. >> >> monotonically_increasing_id on streaming dataFrames will be really >> helpful to me and I believe to many more users. Addin

[jira] [Updated] (SPARK-24144) monotonically_increasing_id on streaming dataFrames

2018-05-01 Thread Hemant Bhanawat (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-24144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hemant Bhanawat updated SPARK-24144: Priority: Major (was: Minor) > monotonically_increasing_id on streaming dataFra

[jira] [Created] (SPARK-24144) monotonically_increasing_id on streaming dataFrames

2018-05-01 Thread Hemant Bhanawat (JIRA)
Hemant Bhanawat created SPARK-24144: --- Summary: monotonically_increasing_id on streaming dataFrames Key: SPARK-24144 URL: https://issues.apache.org/jira/browse/SPARK-24144 Project: Spark

Re: Sorting on a streaming dataframe

2018-04-27 Thread Hemant Bhanawat
to the architecture design and algorithm >> spec. However, from my experience with Spark, there are many good reasons >> why this requirement is not supported ;) >> >> Best, >> >> Chayapan (A) >> >> >> On Apr 24, 2018, at 2:18 PM, Hemant Bhanawat <

Re: Sorting on a streaming dataframe

2018-04-24 Thread Hemant Bhanawat
m may not need to be solved at > all. For example, if you are using kafka, a proper partitioning scheme and > message offsets may be “good enough”. > ------ > *From:* Hemant Bhanawat <hemant9...@gmail.com> > *Sent:* Thursday, April 12, 2018 11:42:59 PM

Re: Sorting on a streaming dataframe

2018-04-13 Thread Hemant Bhanawat
tid>. So we want to sort the dataframe so that the records always get the same snapshot id. On Fri, Apr 13, 2018 at 11:43 AM, Reynold Xin <r...@databricks.com> wrote: > Can you describe your use case more? > > On Thu, Apr 12, 2018 at 11:12 PM Hemant Bhanawat <hemant9...@gma

Sorting on a streaming dataframe

2018-04-13 Thread Hemant Bhanawat
Hi Guys, Why is sorting on streaming dataframes not supported(unless it is complete mode)? My downstream needs me to sort the streaming dataframe. Hemant

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-10 Thread Hemant Bhanawat
, aggregate push-down support is desirable and should be considered as an enhancement going forward. Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Sun, Sep 10, 2017 at 8:45 PM, vaquar khan <vaquar.k...@gmail.com> wrote: > +1 > > Regards, >

Re: How to specify file

2016-09-23 Thread Hemant Bhanawat
Check out the READEME on the following page. This is the csv connector that you are using. I think you need to specify the delimiter option. https://github.com/databricks/spark-csv Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Fri, Sep 23, 2016

Re: CSV Reader with row numbers

2016-09-22 Thread Hemant Bhanawat
in the API documentation. https://spark.apache.org/docs/1.6.1/api/scala/index.html#org.apache.spark.rdd.RDD Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Thu, Sep 15, 2016 at 4:28 AM, Akshay Sachdeva <akshay.sachd...@gmail.com> wrote: &

Memory usage by Spark jobs

2016-09-22 Thread Hemant Bhanawat
for processing a specific data size of let's say parquet data? Also, has someone investigated memory usage for the individual SQL operators like Filter, group by, order by, Exchange etc.? Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io

Memory usage by Spark jobs

2016-09-22 Thread Hemant Bhanawat
for processing a specific data size of let's say parquet data? Also, has someone investigated memory usage for the individual SQL operators like Filter, group by, order by, Exchange etc.? Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io

[jira] [Reopened] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite

2016-04-24 Thread Hemant Bhanawat (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hemant Bhanawat reopened SPARK-13693: - > Flaky test: o.a.s.streaming.MapWithStateSu

[jira] [Commented] (SPARK-13693) Flaky test: o.a.s.streaming.MapWithStateSuite

2016-04-24 Thread Hemant Bhanawat (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255881#comment-15255881 ] Hemant Bhanawat commented on SPARK-13693: - Latest Jenkins builds are failing with this issue. See

[jira] [Created] (SPARK-14729) Implement an existing cluster manager with New ExternalClusterManager interface

2016-04-19 Thread Hemant Bhanawat (JIRA)
Hemant Bhanawat created SPARK-14729: --- Summary: Implement an existing cluster manager with New ExternalClusterManager interface Key: SPARK-14729 URL: https://issues.apache.org/jira/browse/SPARK-14729

[jira] [Commented] (SPARK-14729) Implement an existing cluster manager with New ExternalClusterManager interface

2016-04-19 Thread Hemant Bhanawat (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-14729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247543#comment-15247543 ] Hemant Bhanawat commented on SPARK-14729: - I am looking into this. > Implement an exist

[jira] [Commented] (SPARK-13904) Add support for pluggable cluster manager

2016-04-18 Thread Hemant Bhanawat (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247096#comment-15247096 ] Hemant Bhanawat commented on SPARK-13904: - [~kiszk] Since the builds are passing now, can I

[jira] [Commented] (SPARK-13904) Add support for pluggable cluster manager

2016-04-18 Thread Hemant Bhanawat (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245581#comment-15245581 ] Hemant Bhanawat commented on SPARK-13904: - I ran the following command on my machine build/sbt

[jira] [Commented] (SPARK-13904) Add support for pluggable cluster manager

2016-04-17 Thread Hemant Bhanawat (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245118#comment-15245118 ] Hemant Bhanawat commented on SPARK-13904: - [~kiszk] I am looking into this. > Add supp

Re: How to process one partition at a time?

2016-04-06 Thread Hemant Bhanawat
Apparently, there is another way to do it. You can try creating a PartitionPruningRDD and pass a partition filter function to it. This RDD will do the same thing that I suggested in my mail and you will not have to create a new RDD. Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhana

Re: Executor shutdown hooks?

2016-04-06 Thread Hemant Bhanawat
. The exit thread will wait for a certain period of time before the executor jvm exits to allow proper cleanups of the tasks. Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Thu, Apr 7, 2016 at 6:08 AM, Reynold Xin <r...@databricks.com> wrote: >

Re: Executor shutdown hooks?

2016-04-06 Thread Hemant Bhanawat
. The exit thread will wait for a certain period of time before the executor jvm exits to allow proper cleanups of the tasks. Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Thu, Apr 7, 2016 at 6:08 AM, Reynold Xin <r...@databricks.com> wrote: >

Re: How to process one partition at a time?

2016-04-06 Thread Hemant Bhanawat
to hear if you take some other approach. Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Wed, Apr 6, 2016 at 3:49 PM, Andrei <faithlessfri...@gmail.com> wrote: > I'm writing a kind of sampler which in most cases will require only 1 > part

Re: how about a custom coalesce() policy?

2016-04-02 Thread Hemant Bhanawat
correcting email id for Nezih Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Sun, Apr 3, 2016 at 11:09 AM, Hemant Bhanawat <hemant9...@gmail.com> wrote: > Hi Nezih, > > Can you share JIRA and PR numbers? > > This

Re: how about a custom coalesce() policy?

2016-04-02 Thread Hemant Bhanawat
Hi Nezih, Can you share JIRA and PR numbers? This partial de-coupling of data partitioning strategy and spark parallelism would be a useful feature for any data store. Hemant Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Fri, Apr 1, 2016 at

Re: SPARK-13900 - Join with simple OR conditions take too long

2016-04-01 Thread Hemant Bhanawat
p time will be very little. I had mentioned that NL time will *vary *little with more number of conditions. What I meant was that instead of 3 conditions if you would have 15 conditions, the NL loop would still take 13-15 mins while the hash join would take more than that. Hemant Hemant Bhanawat

Re: SPARK-13900 - Join with simple OR conditions take too long

2016-03-31 Thread Hemant Bhanawat
increase linearly with number of conditions. So, when number of conditions are too many, nested loop join would be faster than the solution that you suggest. Now the question is, how should Spark decide when to do what? Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3

[jira] [Created] (SPARK-13904) Add support for pluggable cluster manager

2016-03-15 Thread Hemant Bhanawat (JIRA)
Hemant Bhanawat created SPARK-13904: --- Summary: Add support for pluggable cluster manager Key: SPARK-13904 URL: https://issues.apache.org/jira/browse/SPARK-13904 Project: Spark Issue Type

Re: Can we use spark inside a web service?

2016-03-11 Thread Hemant Bhanawat
to make Spark more suitable for such scenarios but it never made it to the Spark codebase. If Spark has to become a highly concurrent solution, scheduling has to be distributed. Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Fri, Mar 11, 2016 at 7

Re: S3 Zip File Loading Advice

2016-03-08 Thread Hemant Bhanawat
that) and running a new job over the new folders created in an interval. This will have to be an automated using an external script. Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Wed, Mar 9, 2016 at 10:33 AM, Benjamin Kim <bbuil...@gmail.com> wr

Re: Spark Streaming - graceful shutdown when stream has no more data

2016-02-23 Thread Hemant Bhanawat
A guess - parseRecord is returning None in some case (probaly empty lines). And then entry.get is throwing the exception. You may want to filter the None values from accessLogDStream before you run the map function over it. Hemant Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhana

Re: Specify number of executors in standalone cluster mode

2016-02-21 Thread Hemant Bhanawat
further processing.* Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Sun, Feb 21, 2016 at 11:01 PM, Saiph Kappa <saiph.ka...@gmail.com> wrote: > Hi, > > I'm running a spark streaming application onto a spark cluster that spans > 6 mac

Re: Behind the scene of RDD to DataFrame

2016-02-20 Thread Hemant Bhanawat
toDF internally calls sqlcontext.createDataFrame which transforms the RDD to RDD[InternalRow]. This RDD[InternalRow] is then mapped to a dataframe. Type conversions (from scala types to catalyst types) are involved but no shuffling. Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhana

Re: spark stages in parallel

2016-02-20 Thread Hemant Bhanawat
Not possible as of today. See https://issues.apache.org/jira/browse/SPARK-2387 Hemant Bhanawat https://www.linkedin.com/in/hemant-bhanawat-92a3811 www.snappydata.io On Thu, Feb 18, 2016 at 1:19 PM, Shushant Arora <shushantaror...@gmail.com> wrote: > can two stages of single job run in

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Hemant Bhanawat
For sql shuffle operations like groupby, the number of output partitions is controlled by spark.sql.shuffle.partitions. But, it seems orderBy does not honour this. In my small test, I could see that the number of partitions in DF returned by orderBy was equal to the total number of distinct

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Hemant Bhanawat
ing empty > (that is, # of result partitions is lower than `spark.sql.shuffle.partitions` > . > > > On Tue, Feb 9, 2016 at 9:35 PM, Hemant Bhanawat <hemant9...@gmail.com> > wrote: > >> For sql shuffle operations like groupby, the number of output partitions >> i

Re: Spark Streaming with Druid?

2016-02-08 Thread Hemant Bhanawat
On Mon, Feb 8, 2016 at 10:04 PM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > Hi Hemant, thanks much can we use SnappyData on YARN. My Spark jobs run > using yarn client mode. Please guide. > > On Mon, Feb 8, 2016 at 9:46 AM, Hemant Bhanawat <hemant9...@gmail.com> > wrote

Re: Spark Streaming with Druid?

2016-02-07 Thread Hemant Bhanawat
You may want to have a look at spark druid project already in progress: https://github.com/SparklineData/spark-druid-olap You can also have a look at SnappyData , which is a low latency store tightly integrated with Spark, Spark SQL and Spark

Re: DataFrame First method is resulting different results in each iteration

2016-02-03 Thread Hemant Bhanawat
Missing order by? Hemant Bhanawat SnappyData (http://snappydata.io/) On Wed, Feb 3, 2016 at 3:45 PM, satish chandra j <jsatishchan...@gmail.com> wrote: > HI All, > I have data in a emp_df (DataFrame) as mentioned below: > > EmpId Sal DeptNo > 001 100 10 >

Re: DataFrame First method is resulting different results in each iteration

2016-02-03 Thread Hemant Bhanawat
For getting EmpId with highest Sal, you will have to change your query to add filters or add subqueries. See the following thread: http://stackoverflow.com/questions/6841605/get-top-1-row-of-each-group Hemant Bhanawat SnappyData (http://snappydata.io/) On Wed, Feb 3, 2016 at 4:33 PM, satish

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-07 Thread Hemant Bhanawat
Please find attached. On Wed, Oct 7, 2015 at 7:36 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Hemant: > Can you post the code snippet to the mailing list - other people would be > interested. > > On Wed, Oct 7, 2015 at 5:50 AM, Hemant Bhanawat <hemant9...@gmail.com>

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-07 Thread Hemant Bhanawat
problem is that I get types mismatches... > > On Wed, Oct 7, 2015 at 8:03 AM, Hemant Bhanawat <hemant9...@gmail.com> > wrote: > >> An approach can be to wrap your MutableRow in WrappedInternalRow which is >> a child class of Row. >> >> Hemant >> www.sn

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-07 Thread Hemant Bhanawat
Will send you the code on your email id. On Wed, Oct 7, 2015 at 4:37 PM, Ophir Cohen <oph...@gmail.com> wrote: > Thanks! > Can you check if you can provide example of the conversion? > > > On Wed, Oct 7, 2015 at 2:05 PM, Hemant Bhanawat <hemant9...@gma

Re: GenericMutableRow and Row mismatch on Spark 1.5?

2015-10-06 Thread Hemant Bhanawat
An approach can be to wrap your MutableRow in WrappedInternalRow which is a child class of Row. Hemant www.snappydata.io linkedin.com/company/snappydata On Tue, Oct 6, 2015 at 3:21 PM, Ophir Cohen wrote: > Hi Guys, > I'm upgrading to Spark 1.5. > > In our previous version

Re: [cache eviction] partition recomputation in big lineage RDDs

2015-10-01 Thread Hemant Bhanawat
As I understand, you don't need merge of your historical data RDD with your RDD_inc, what you need is merge of the computation results of the your historical RDD with RDD_inc and so on. IMO, you should consider having an external row store to hold your computations. I say this because you need

Re: flatmap() and spark performance

2015-09-28 Thread Hemant Bhanawat
You can use spark.executor.memory to specify the memory of the executors which will hold this intermediate results. You may want to look at the section "Understanding Memory Management in Spark" of this link:

Re: caching DataFrames

2015-09-23 Thread Hemant Bhanawat
Two dataframes do not share cache storage in Spark. Hence it's immaterial that how two dataFrames are related to each other. Both of them are going to consume memory based on the data that they have. So for your A1 and B1 you would need extra memory that would be equivalent to half the memory of

Re: DataGenerator for streaming application

2015-09-21 Thread Hemant Bhanawat
Why are you using rawSocketStream to read the data? I believe rawSocketStream waits for a big chunk of data before it can start processing it. I think what you are writing is a String and you should use socketTextStream which reads the data on a per line basis. On Sun, Sep 20, 2015 at 9:56 AM,

Re: Why are executors on slave never used?

2015-09-21 Thread Hemant Bhanawat
When you specify master as local[2], it starts the spark components in a single jvm. You need to specify the master correctly. I have a default AWS EMR cluster (1 master, 1 slave) with Spark. When I run a Spark process, it works fine -- but only on the master, as if it were standalone. The web-UI

Re: A way to timeout and terminate a laggard 'Stage' ?

2015-09-17 Thread Hemant Bhanawat
Driver timing out laggards seems like a reasonable way of handling laggards. Are there any challenges because of which driver does not do it today? Is there a JIRA for this? I couldn't find one. On Tue, Sep 15, 2015 at 12:07 PM, Akhil Das wrote: > As of now i

Re: Difference between sparkDriver and "executor ID driver"

2015-09-16 Thread Hemant Bhanawat
1. When you call new SparkContext(), spark driver is started which internally create Akka ActorSystem which registers on this port. 2. Since you are running in local mode, starting of executor is short circuited and an Executor object is created in the same process (see LocalEndpoint). This

Re: taking an n number of rows from and RDD starting from an index

2015-09-02 Thread Hemant Bhanawat
I think rdd.toLocalIterator is what you want. But it will keep one partition's data in-memory. On Wed, Sep 2, 2015 at 10:05 AM, Niranda Perera wrote: > Hi all, > > I have a large set of data which would not fit into the memory. So, I wan > to take n number of data from

Re: How to Serialize and Reconstruct JavaRDD later?

2015-09-02 Thread Hemant Bhanawat
You want to persist the state between the execution of two rdds. So, I believe what you need is serialization of your model and not JavaRDD. If you can serialize your model, you can persist that in HDFS or some other datastore to be used by the next RDDs. If you are using Spark Streaming, doing

Re: Performance issue with Spark join

2015-08-26 Thread Hemant Bhanawat
Spark joins are different than traditional database joins because of the lack of support of indexes. Spark has to shuffle data between various nodes to perform joins. Hence joins are bound to be much slower than count which is just a parallel scan of the data. Still, to ensure that nothing is

Re: Exception throws when running spark pi in Intellij Idea that scala.collection.Seq is not found

2015-08-25 Thread Hemant Bhanawat
Go to the module settings of the project and in the dependencies section check the scope of scala jars. It would be either Test or Provided. Change it to compile and it should work. Check the following link to understand more about scope of modules:

Re: How to set environment of worker applications

2015-08-25 Thread Hemant Bhanawat
spark.executor.extraJavaOptions to pass system properties and spark-env.sh to pass environment variables. -raghav On Mon, Aug 24, 2015 at 1:00 PM, Hemant Bhanawat hemant9...@gmail.com wrote: That's surprising. Passing the environment variables using spark.executor.extraJavaOptions=-Dmyenvvar=xxx to the executor

Re: Joining using mulitimap or array

2015-08-24 Thread Hemant Bhanawat
In your example, a.attributes.name is a list and is not a string . Run this to find it out : a.select($a.attributes.name).show() On Mon, Aug 24, 2015 at 2:51 PM, Ilya Karpov i.kar...@cleverdata.ru wrote: Hi, guys I'm confused about joining columns in SparkSQL and need your advice. I want to

Re: How to set environment of worker applications

2015-08-24 Thread Hemant Bhanawat
it in spark-env.sh file on each worker node. On Sun, Aug 23, 2015 at 8:16 PM, Hemant Bhanawat hemant9...@gmail.com wrote: Check for spark.driver.extraJavaOptions and spark.executor.extraJavaOptions in the following article. I think you can use -D to pass system vars: spark.apache.org/docs/latest

Re: How to set environment of worker applications

2015-08-23 Thread Hemant Bhanawat
Check for spark.driver.extraJavaOptions and spark.executor.extraJavaOptions in the following article. I think you can use -D to pass system vars: spark.apache.org/docs/latest/configuration.html#runtime-environment Hi, I am starting a spark streaming job in standalone mode with spark-submit. Is

Re: PySpark concurrent jobs using single SparkContext

2015-08-21 Thread Hemant Bhanawat
It seems like you want simultaneous processing of multiple jobs but at the same time serialization of few tasks within those jobs. I don't know how to achieve that in Spark. But, why would you bother about the inter-weaved processing when the data that is being aggregated in different jobs is per

Re: persist for DStream

2015-08-20 Thread Hemant Bhanawat
Are you asking for something more than this? http://spark.apache.org/docs/latest/streaming-programming-guide.html#caching--persistence On Thu, Aug 20, 2015 at 2:09 PM, Deepesh Maheshwari deepesh.maheshwar...@gmail.com wrote: Hi, there are function available tp cache() or persist() RDD in

Re: spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

2015-08-20 Thread Hemant Bhanawat
Sorry, I misread your mail. Thanks for pointing that out. BTW, are the 8 files shuffle intermediate output and not the final output? I assume yes. I didn't know that you can keep intermediate output on HDFS and I don't think that is recommended. On Thu, Aug 20, 2015 at 2:43 PM, Hemant

Re: Regarding rdd.collect()

2015-08-18 Thread Hemant Bhanawat
between jobs you can have a look at Spark Job Server(spark-jobserver https://github.com/ooyala/spark-jobserver) or some In-Memory storages: Tachyon(http://tachyon-project.org/) or Ignite(https://ignite.incubator.apache.org/) 2015-08-18 9:37 GMT+02:00 Hemant Bhanawat hemant9...@gmail.com

Re: global variable in spark streaming with no dependency on key

2015-08-18 Thread Hemant Bhanawat
See if SparkContext.accumulator helps. On Tue, Aug 18, 2015 at 2:27 PM, Joanne Contact joannenetw...@gmail.com wrote: Hi Gurus, Please help. But please don't tell me to use updateStateByKey because I need a global variable (something like the clock time) across the micro batches but not

Re: Regarding rdd.collect()

2015-08-18 Thread Hemant Bhanawat
It is still in memory for future rdd transformations and actions. This is interesting. You mean Spark holds the data in memory between two job executions. How does the second job get the handle of the data in memory? I am interested in knowing more about it. Can you forward me a spark article or

Re: registering an empty RDD as a temp table in a PySpark SQL context

2015-08-18 Thread Hemant Bhanawat
It is definitely not the case for Spark SQL. A temporary table (much like dataFrame) is a just a logical plan with a name and it is not iterated unless a query is fired on it. I am not sure if using rdd.take in py code to verify the schema is a right approach as it creates a spark job. BTW, why

Re: Apache Spark - Parallel Processing of messages from Kafka - Java

2015-08-16 Thread Hemant Bhanawat
In spark, every action (foreach, collect etc.) gets converted into a spark job and jobs are executed sequentially. You may want to refactor your code in calculateUseCase? to just run transformations (map, flatmap) and call a single action in the end. On Sun, Aug 16, 2015 at 3:19 PM, mohanaugust

Re: Streaming on Exponential Data

2015-08-14 Thread Hemant Bhanawat
What does exponential data means? Does this mean that the amount of the data that is being received from the stream in a batchinterval is increasing exponentially as the time progresses? Does your process have enough memory to handle the data for a batch interval? You may want to share Spark

Re: What is the Effect of Serialization within Stages?

2015-08-13 Thread Hemant Bhanawat
A chain of map and flatmap does not cause any serialization-deserialization. On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann mark.heim...@kard.info wrote: Hello everyone, I am wondering what the effect of serialization is within a stage. My understanding of Spark as an execution engine is

Re: grouping by a partitioned key

2015-08-12 Thread Hemant Bhanawat
, Hemant Bhanawat hemant9...@gmail.com wrote: As far as I know, Spark SQL cannot process data on a per-partition-basis. DataFrame.foreachPartition is the way. What do you mean by “cannot process on per-partition-basis”? DataFrame is an RDD on steroids. I meant that Spark SQL cannot process data

Re: grouping by a partitioned key

2015-08-12 Thread Hemant Bhanawat
As far as I know, Spark SQL cannot process data on a per-partition-basis. DataFrame.foreachPartition is the way. I haven't tried it, but, following looks like a not-so-sophisticated way of making spark sql partition aware.

Re: How to minimize shuffling on Spark dataframe Join?

2015-08-11 Thread Hemant Bhanawat
Is the source of your dataframe partitioned on key? As per your mail, it looks like it is not. If that is the case, for partitioning the data, you will have to shuffle the data anyway. Another part of your question is - how to co-group data from two dataframes based on a key? I think for RDD's

Re: Partitioning in spark streaming

2015-08-11 Thread Hemant Bhanawat
Posting a comment from my previous mail post: When data is received from a stream source, receiver creates blocks of data. A new block of data is generated every blockInterval milliseconds. N blocks of data are created during the batchInterval where N = batchInterval/blockInterval. A RDD is

Re: [VOTE] Grandfathering forgotten Geode contributors

2015-06-11 Thread Hemant Bhanawat
+1 On Thu, Jun 11, 2015 at 3:37 AM, William A Rowe Jr wr...@rowe-clan.net wrote: On Wed, Jun 10, 2015 at 5:03 PM, William A Rowe Jr wr...@rowe-clan.net wrote: If you hold a public vote to make them committers, they are not on the PPMC. If you hold a private vote, likewise. If you

Re: [VOTE] Grandfathering forgotten Geode contributors

2015-06-11 Thread Hemant Bhanawat
+1 On Fri, Jun 12, 2015 at 11:15 AM, Rajesh Kumar rku...@pivotal.io wrote: +1 On Fri, Jun 12, 2015 at 10:43 AM, Suranjan Kumar suranjan.ku...@gmail.com wrote: +1 On Fri, Jun 12, 2015 at 10:42 AM, Shirish Deshmukh sdeshm...@pivotal.io wrote: +1 On Fri, Jun 12, 2015 at

Re: Spark Streaming - Design considerations/Knobs

2015-05-21 Thread Hemant Bhanawat
to remember about Spark Streaming. On Wed, May 20, 2015 at 3:40 AM, Hemant Bhanawat hemant9...@gmail.com wrote: Hi, I have compiled a list (from online sources) of knobs/design considerations that need to be taken care of by applications running on spark streaming. Is my understanding correct

Re: Spark Streaming - Design considerations/Knobs

2015-05-21 Thread Hemant Bhanawat
to remember about Spark Streaming. On Wed, May 20, 2015 at 3:40 AM, Hemant Bhanawat hemant9...@gmail.com wrote: Hi, I have compiled a list (from online sources) of knobs/design considerations that need to be taken care of by applications running on spark streaming. Is my understanding correct

Spark Streaming - Design considerations/Knobs

2015-05-20 Thread Hemant Bhanawat
Hi, I have compiled a list (from online sources) of knobs/design considerations that need to be taken care of by applications running on spark streaming. Is my understanding correct? Any other important design consideration that I should take care of? - A DStream is associated with a single

Spark Streaming - Design considerations/Knobs

2015-05-20 Thread Hemant Bhanawat
Hi, I have compiled a list (from online sources) of knobs/design considerations that need to be taken care of by applications running on spark streaming. Is my understanding correct? Any other important design consideration that I should take care of? - A DStream is associated with a single

Re: Regarding hsync

2013-07-11 Thread Hemant Bhanawat
Hi, Any help? Thanks in advance, Hemant - Original Message - From: Hemant Bhanawat hema...@vmware.com To: hdfs-dev@hadoop.apache.org Sent: Tuesday, July 9, 2013 12:55:23 PM Subject: Regarding hsync Hi, I am currently working on hadoop version 2.0.*. Currently, hsync does

Regarding hsync

2013-07-09 Thread Hemant Bhanawat
Hi, I am currently working on hadoop version 2.0.*. Currently, hsync does not update the file size on namenode. So, if my process dies after calling hsync but before calling file close, the file is left with an inconsistent file size. I would like to fix this file size. Is there a way to do

Partially written SequenceFile

2013-07-04 Thread Hemant Bhanawat
In Progress Exception) { } catch (Already Being Created Exception) { } catch (Exception) { break; } } Would it be possible for you to let me know if this approach has any shortcomings or if there are any other better alternatives available? Thanks, Hemant Bhanawat

HBase and MapReduce

2012-05-23 Thread Hemant Bhanawat
I have couple of questions related to MapReduce over HBase 1. HBase guarantees data locality of store files and Regionserver only if it stays up for long. If there are too many region movements or the server has been recycled recently, there is a high probability that store file blocks are not