Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Nitin Siwach
== Analyzed Logical Plan == index: string, 0: string Relation [index#50,0#51] csv == Optimized Logical Plan == Relation [index#50,0#51] csv == Physical Plan == FileScan csv [index#50,0#51] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/home/nitin/work/df1.cs

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach
bodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no c

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach
or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sun, 7 May 2023

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach
ain in the second run. You can > also confirm it in other metrics from Spark UI. > > That is my personal understanding based on what I have read and seen on my > job runs. If there is any mistake, be free to correct me. > > Thank You & Best Regards > Winston Lai > --

What is DataFilters and while joining why is the filter isnotnull[joinKey] applied twice

2023-01-31 Thread Nitin Siwach
/dfr_key_int], PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struct -- Regards, Nitin

bucketBy in pyspark not retaining partition information

2022-01-31 Thread Nitin Siwach
can parquet [a#24,b#25,c#26] Batched: true, DataFilters: [isnotnull(a#24)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/home/nitin/pymonsoon/bucket_test_parquet1], PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struct +- Sort [a#33 ASC NULLS FIRST], false

understanding iterator of series to iterator of series pandasUDF

2022-01-04 Thread Nitin Siwach
I understand pandasUDF as follows: 1. There are multiple partitions per worker 2. Multiple arrow batches are converted per partition 3. Sent to python process 4. In the case of Series to Series the pandasUDF is applied to each arrow batch one after the other? **(So, is it that (a) - The

Re: Read hdfs files in spark streaming

2019-06-11 Thread nitin jain
Hi Deepak, Please let us know - how you managed it ? Thanks, NJ On Mon, Jun 10, 2019 at 4:42 PM Deepak Sharma wrote: > Thanks All. > I managed to get this working. > Marking this thread as closed. > > On Mon, Jun 10, 2019 at 4:14 PM Deepak Sharma > wrote: > >> This is the project requirement

Re: MLib : Non Linear Optimization

2016-09-09 Thread Nitin Sareen
away from SAS due to the cost, it would be really good to have these algorithms in Spark ML. Let me know if you need any more info, i can share some snippets if required. Thanks, Nitin On Thu, Sep 8, 2016 at 2:08 PM, Robin East <robin.e...@xense.co.uk> wrote: > Do you have any particul

Re: Populating tables using hive and spark

2016-08-22 Thread Nitin Kumar
others can concur we can go ahead and report it as a bug. Regards, Nitin On Mon, Aug 22, 2016 at 4:15 PM, Furcy Pin <furcy@flaminem.com> wrote: > Hi Nitin, > > I confirm that there is something odd here. > > I did the following test : > > create table test_orc (id

Populating tables using hive and spark

2016-08-22 Thread Nitin Kumar
rmats as well (textFile etc.) Is this because of the different naming conventions used by hive and spark to write records to hdfs? Or maybe it is not a recommended practice to write tables using different services? Your thoughts and comments on this matter would be highly appreciated! Thanks! Nitin

Re: Apache Spark : spark.eventLog.dir on Windows Environment

2015-07-21 Thread Nitin Kalra
Hi Akhil, I don't have HADOOP_HOME or HADOOP_CONF_DIR and even winutils.exe ? What's the configuration required for this ? From where can I get winutils.exe ? Thanks and Regards, Nitin Kalra On Tue, Jul 21, 2015 at 1:30 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Do you have HADOOP_HOME

Re: Does HiveContext connect to HiveServer2?

2015-06-24 Thread Nitin kak
Hi Marcelo, The issue does not happen while connecting to the hive metstore, that works fine. It seems that HiveContext only uses Hive CLI to execute the queries while HiveServer2 does not support it. I dont think you can specify any configuration in hive-site.xml which can make it connect to

Re: Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-22 Thread Nitin kak
Any response to this guys? On Fri, Jun 19, 2015 at 2:34 PM, Nitin kak nitinkak...@gmail.com wrote: Any other suggestions guys? On Wed, Jun 17, 2015 at 7:54 PM, Nitin kak nitinkak...@gmail.com wrote: With Sentry, only hive user has the permission for read/write/execute on the subdirectories

Re: Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-19 Thread Nitin kak
Any other suggestions guys? On Wed, Jun 17, 2015 at 7:54 PM, Nitin kak nitinkak...@gmail.com wrote: With Sentry, only hive user has the permission for read/write/execute on the subdirectories of warehouse. All the users get translated to hive when interacting with hiveserver2. But i think

Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-17 Thread Nitin kak
I am trying to run a hive query from Spark code using HiveContext object. It was running fine earlier but since the Apache Sentry has been set installed the process is failing with this exception : *org.apache.hadoop.security.AccessControlException: Permission denied: user=kakn,

Re: Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-17 Thread Nitin kak
: Try to grant read execute access through sentry. On 18 Jun 2015 05:47, Nitin kak nitinkak...@gmail.com javascript:_e(%7B%7D,'cvml','nitinkak...@gmail.com'); wrote: I am trying to run a hive query from Spark code using HiveContext object. It was running fine earlier but since the Apache Sentry

Re: Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-17 Thread Nitin kak
: Try to grant read execute access through sentry. On 18 Jun 2015 05:47, Nitin kak nitinkak...@gmail.com javascript:_e(%7B%7D,'cvml','nitinkak...@gmail.com'); wrote: I am trying to run a hive query from Spark code using HiveContext object. It was running fine earlier but since the Apache Sentry

Re: HiveContext test, Spark Context did not initialize after waiting 10000ms

2015-05-26 Thread Nitin kak
That is a much better solution than how I resolved it. I got around it by placing comma separated jar paths for all the hive related jars in --jars clause. I will try your solution. Thanks for sharing it. On Tue, May 26, 2015 at 4:14 AM, Mohammad Islam misla...@yahoo.com wrote: I got a similar

Re: how to clean shuffle write each iteration

2015-03-03 Thread nitin
Shuffle write will be cleaned if it is not referenced by any object directly/indirectly. There is a garbage collector written inside spark which periodically checks for weak references to RDDs/shuffle write/broadcast and deletes them. -- View this message in context:

Re: SQLContext.applySchema strictness

2015-02-14 Thread nitin
AFAIK, this is the expected behavior. You have to make sure that the schema matches the row. It won't give any error when you apply the schema as it doesn't validate the nature of data. -- View this message in context:

Re: Spark SQL - Point lookup optimisation in SchemaRDD?

2015-02-11 Thread nitin
I was able to resolve this use case (Thanks Cheng Lian) where I wanted to launch executor on just the specific partition while also getting the batch pruning optimisations of Spark SQL by doing following :- val query = sql(SELECT * FROM cac hedTable WHERE key = 1) val plannedRDD =

Spark SQL - Point lookup optimisation in SchemaRDD?

2015-02-09 Thread nitin
cached twice. My question is that can we create a PartitionPruningCachedSchemaRDD like class which can prune the partitions of InMemoryColumnarTableScan's RDD[CachedBatch] and launch executor on just the selected partition(s)? Thanks -Nitin -- View this message in context: http://apache-spark

Re: Spark Driver Host under Yarn

2015-02-09 Thread nitin
Are you running in yarn-cluster or yarn-client mode? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Driver-Host-under-Yarn-tp21536p21556.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark (yarn-client mode) Hangs in final stages of Collect or Reduce

2015-02-09 Thread nitin
Have you checked the corresponding executor logs as well? I think information provided by you here is less to actually understand your issue. -- View this message in context:

Re: Where can I find logs set inside RDD processing functions?

2015-02-06 Thread Nitin kak
The yarn log aggregation is enabled and the logs which I get through yarn logs -applicationId your_application_id are no different than what I get through logs in Yarn Application tracking URL. They still dont have the above logs. On Fri, Feb 6, 2015 at 3:36 PM, Petar Zecevic

Re: Where can I find logs set inside RDD processing functions?

2015-02-06 Thread Nitin kak
yarn.nodemanager.remote-app-log-dir is set to /tmp/logs On Fri, Feb 6, 2015 at 4:14 PM, Ted Yu yuzhih...@gmail.com wrote: To add to What Petar said, when YARN log aggregation is enabled, consider specifying yarn.nodemanager.remote-app-log-dir which is where aggregated logs are saved.

Re: Sort based shuffle not working properly?

2015-02-03 Thread Nitin kak
bother to also sort them within each partition On Tue, Feb 3, 2015 at 5:41 PM, Nitin kak nitinkak...@gmail.com wrote: I thought thats what sort based shuffled did, sort the keys going to the same partition. I have tried (c1, c2) as (Int, Int) tuple as well. I don't think that ordering of c2

Re: Sort based shuffle not working properly?

2015-02-03 Thread Nitin kak
I thought thats what sort based shuffled did, sort the keys going to the same partition. I have tried (c1, c2) as (Int, Int) tuple as well. I don't think that ordering of c2 type is the problem here. On Tue, Feb 3, 2015 at 5:21 PM, Sean Owen so...@cloudera.com wrote: Hm, I don't think the sort

Re: Running beyond memory limits in ConnectedComponents

2015-01-15 Thread Nitin kak
memory asked by Spark to approximately 22G. On Thu, Jan 15, 2015 at 12:54 PM, Nitin kak nitinkak...@gmail.com wrote: Is this Overhead memory allocation used for any specific purpose. For example, will it be any different if I do *--executor-memory 22G *with overhead set to 0%(hypothetically) vs

Re: Running beyond memory limits in ConnectedComponents

2015-01-15 Thread Nitin kak
I am sorry for the formatting error, the value for *yarn.scheduler.maximum-allocation-mb = 28G* On Thu, Jan 15, 2015 at 11:31 AM, Nitin kak nitinkak...@gmail.com wrote: Thanks for sticking to this thread. I am guessing what memory my app requests and what Yarn requests on my part should

Re: Running beyond memory limits in ConnectedComponents

2015-01-15 Thread Nitin kak
20G or about 1.4G. You might set this higher to 2G to give more overhead. See the --config property=value syntax documented in http://spark.apache.org/docs/latest/submitting-applications.html On Thu, Jan 15, 2015 at 3:47 AM, Nitin kak nitinkak...@gmail.com wrote: Thanks Sean. I guess

Re: Running beyond memory limits in ConnectedComponents

2015-01-14 Thread Nitin kak
Thanks Sean. I guess Cloudera Manager has parameters executor_total_max_heapsize and worker_max_heapsize which point to the parameters you mentioned above. How much should that cushon between the jvm heap size and yarn memory limit be? I tried setting jvm memory to 20g and yarn to 24g, but it

Re: Spark 1.2 Release Date

2014-12-18 Thread nitin
Soon enough :) http://apache-spark-developers-list.1001551.n3.nabble.com/RESULT-VOTE-Release-Apache-Spark-1-2-0-RC2-td9815.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-Release-Date-tp20765p20766.html Sent from the Apache Spark User List

Re: spark-sql with join terribly slow.

2014-12-17 Thread nitin
and could prevent the shuffle by passing the partition information to in-memory caching. See - https://issues.apache.org/jira/browse/SPARK-4849 Thanks -Nitin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-with-join-terribly-slow-tp20751p20756.html

Re: SchemaRDD partition on specific column values?

2014-12-15 Thread Nitin Goyal
Hi Michael, I have opened following JIRA for the same :- https://issues.apache.org/jira/browse/SPARK-4849 I am having a look at the code to see what can be done and then we can have a discussion over the approach. Let me know if you have any comments/suggestions. Thanks -Nitin On Sun, Dec 14

Re: SchemaRDD partition on specific column values?

2014-12-11 Thread nitin
Can we take this as a performance improvement task in Spark-1.2.1? I can help contribute for this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20623.html Sent from the Apache Spark User List mailing

Re: registerTempTable: Table not found

2014-12-09 Thread nitin
Looks like this issue has been fixed very recently and should be available in next RC :- http://apache-spark-developers-list.1001551.n3.nabble.com/CREATE-TABLE-AS-SELECT-does-not-work-with-temp-tables-in-1-2-0-td9662.html -- View this message in context:

Re: PhysicalRDD problem?

2014-12-09 Thread Nitin Goyal
constructor argument as configurable/parameterized (also written as TODO). Do we have a plan to do this in 1.2 release? Or I can take this up as a task for myself if you want (since this is very crucial for our release). Thanks -Nitin On Wed, Dec 10, 2014 at 1:06 AM, Michael Armbrust mich

Re: PhysicalRDD problem?

2014-12-09 Thread Nitin Goyal
I see that somebody had already raised a PR for this but it hasn't been merged. https://issues.apache.org/jira/browse/SPARK-4339 Can we merge this in next 1.2 RC? Thanks -Nitin On Wed, Dec 10, 2014 at 11:50 AM, Nitin Goyal nitin2go...@gmail.com wrote: Hi Michael, I think I have found

PhysicalRDD problem?

2014-12-08 Thread nitin
RDD by applying schema again and using the existing schema RDD further(in case of simple queries) but then for complex queries, I get TreenodeException (Unresolved Attributes) as I mentioned. Let me know if you need any more info around my problem. Thanks in Advance -Nitin -- View this message

SchemaRDD partition on specific column values?

2014-12-04 Thread nitin
Hi All, I want to hash partition (and then cache) a schema RDD in way that partitions are based on hash of the values of a column (ID column in my case). e.g. if my table has ID column with values as 1,2,3,4,5,6,7,8,9 and spark.sql.shuffle.partitions is configured as 3, then there should be 3

Re: SchemaRDD partition on specific column values?

2014-12-04 Thread nitin
With some quick googling, I learnt that I can we can provide distribute by coulmn_name in hive ql to distribute data based on a column values. My question now if I use distribute by id, will there be any performance improvements? Will I be able to avoid data movement in shuffle(Excahnge before

Re: Is Spark 1.1.0 incompatible with Hive?

2014-10-27 Thread Nitin kak
Yes, I added all the Hive jars present in Cloudera distribution of Hadoop. I added them because I was getting ClassNotFoundException for many required classes(one example stack trace below). So, someone on the community suggested to include the hive jars: *Exception in thread main

Re: Is Spark 1.1.0 incompatible with Hive?

2014-10-27 Thread Nitin kak
is to deploy the plain Apache version of Spark on CDH Yarn. On Mon, Oct 27, 2014 at 11:10 AM, Nitin kak nitinkak...@gmail.com wrote: Yes, I added all the Hive jars present in Cloudera distribution of Hadoop. I added them because I was getting ClassNotFoundException for many required classes(one

Re: Is Spark 1.1.0 incompatible with Hive?

2014-10-27 Thread Nitin kak
Somehow worked by placing all the jars(except guava) in hive lib after --jars. Had initially tried to place the jars under another temporary folder and pointing the executor and driver extraClassPath to that director, but didnt work. On Mon, Oct 27, 2014 at 2:21 PM, Nitin kak nitinkak