== Analyzed Logical Plan ==
index: string, 0: string
Relation [index#50,0#51] csv
== Optimized Logical Plan ==
Relation [index#50,0#51] csv
== Physical Plan ==
FileScan csv [index#50,0#51] Batched: false, DataFilters: [], Format:
CSV, Location: InMemoryFileIndex(1
paths)[file:/home/nitin/work/df1.cs
bodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no c
or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 7 May 2023
ain in the second run. You can
> also confirm it in other metrics from Spark UI.
>
> That is my personal understanding based on what I have read and seen on my
> job runs. If there is any mistake, be free to correct me.
>
> Thank You & Best Regards
> Winston Lai
> --
/dfr_key_int],
PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema:
struct
--
Regards,
Nitin
can parquet [a#24,b#25,c#26] Batched: true,
DataFilters: [isnotnull(a#24)], Format: Parquet, Location:
InMemoryFileIndex(1
paths)[file:/home/nitin/pymonsoon/bucket_test_parquet1],
PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema:
struct
+- Sort [a#33 ASC NULLS FIRST], false
I understand pandasUDF as follows:
1. There are multiple partitions per worker
2. Multiple arrow batches are converted per partition
3. Sent to python process
4. In the case of Series to Series the pandasUDF is applied to each arrow
batch one after the other? **(So, is it that (a) - The
Hi Deepak,
Please let us know - how you managed it ?
Thanks,
NJ
On Mon, Jun 10, 2019 at 4:42 PM Deepak Sharma wrote:
> Thanks All.
> I managed to get this working.
> Marking this thread as closed.
>
> On Mon, Jun 10, 2019 at 4:14 PM Deepak Sharma
> wrote:
>
>> This is the project requirement
away from SAS
due to the cost, it would be really good to have these algorithms in Spark
ML.
Let me know if you need any more info, i can share some snippets if
required.
Thanks,
Nitin
On Thu, Sep 8, 2016 at 2:08 PM, Robin East <robin.e...@xense.co.uk> wrote:
> Do you have any particul
others can concur we can go ahead and report it as a bug.
Regards,
Nitin
On Mon, Aug 22, 2016 at 4:15 PM, Furcy Pin <furcy@flaminem.com> wrote:
> Hi Nitin,
>
> I confirm that there is something odd here.
>
> I did the following test :
>
> create table test_orc (id
rmats as
well (textFile etc.)
Is this because of the different naming conventions used by hive and spark
to write records to hdfs? Or maybe it is not a recommended practice to
write tables using different services?
Your thoughts and comments on this matter would be highly appreciated!
Thanks!
Nitin
Hi Akhil,
I don't have HADOOP_HOME or HADOOP_CONF_DIR and even winutils.exe ? What's
the configuration required for this ? From where can I get winutils.exe ?
Thanks and Regards,
Nitin Kalra
On Tue, Jul 21, 2015 at 1:30 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Do you have HADOOP_HOME
Hi Marcelo,
The issue does not happen while connecting to the hive metstore, that works
fine. It seems that HiveContext only uses Hive CLI to execute the queries
while HiveServer2 does not support it. I dont think you can specify any
configuration in hive-site.xml which can make it connect to
Any response to this guys?
On Fri, Jun 19, 2015 at 2:34 PM, Nitin kak nitinkak...@gmail.com wrote:
Any other suggestions guys?
On Wed, Jun 17, 2015 at 7:54 PM, Nitin kak nitinkak...@gmail.com wrote:
With Sentry, only hive user has the permission for read/write/execute on
the subdirectories
Any other suggestions guys?
On Wed, Jun 17, 2015 at 7:54 PM, Nitin kak nitinkak...@gmail.com wrote:
With Sentry, only hive user has the permission for read/write/execute on
the subdirectories of warehouse. All the users get translated to hive
when interacting with hiveserver2. But i think
I am trying to run a hive query from Spark code using HiveContext object.
It was running fine earlier but since the Apache Sentry has been set
installed the process is failing with this exception :
*org.apache.hadoop.security.AccessControlException: Permission denied:
user=kakn,
:
Try to grant read execute access through sentry.
On 18 Jun 2015 05:47, Nitin kak nitinkak...@gmail.com
javascript:_e(%7B%7D,'cvml','nitinkak...@gmail.com'); wrote:
I am trying to run a hive query from Spark code using HiveContext object.
It was running fine earlier but since the Apache Sentry
:
Try to grant read execute access through sentry.
On 18 Jun 2015 05:47, Nitin kak nitinkak...@gmail.com
javascript:_e(%7B%7D,'cvml','nitinkak...@gmail.com'); wrote:
I am trying to run a hive query from Spark code using HiveContext object.
It was running fine earlier but since the Apache Sentry
That is a much better solution than how I resolved it. I got around it by
placing comma separated jar paths for all the hive related jars in --jars
clause.
I will try your solution. Thanks for sharing it.
On Tue, May 26, 2015 at 4:14 AM, Mohammad Islam misla...@yahoo.com wrote:
I got a similar
Shuffle write will be cleaned if it is not referenced by any object
directly/indirectly. There is a garbage collector written inside spark which
periodically checks for weak references to RDDs/shuffle write/broadcast and
deletes them.
--
View this message in context:
AFAIK, this is the expected behavior. You have to make sure that the schema
matches the row. It won't give any error when you apply the schema as it
doesn't validate the nature of data.
--
View this message in context:
I was able to resolve this use case (Thanks Cheng Lian) where I wanted to
launch executor on just the specific partition while also getting the batch
pruning optimisations of Spark SQL by doing following :-
val query = sql(SELECT * FROM cac
hedTable WHERE key = 1)
val plannedRDD =
cached twice.
My question is that can we create a PartitionPruningCachedSchemaRDD like
class which can prune the partitions of InMemoryColumnarTableScan's
RDD[CachedBatch] and launch executor on just the selected partition(s)?
Thanks
-Nitin
--
View this message in context:
http://apache-spark
Are you running in yarn-cluster or yarn-client mode?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Driver-Host-under-Yarn-tp21536p21556.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Have you checked the corresponding executor logs as well? I think information
provided by you here is less to actually understand your issue.
--
View this message in context:
The yarn log aggregation is enabled and the logs which I get through yarn
logs -applicationId your_application_id
are no different than what I get through logs in Yarn Application tracking
URL. They still dont have the above logs.
On Fri, Feb 6, 2015 at 3:36 PM, Petar Zecevic
yarn.nodemanager.remote-app-log-dir is set to /tmp/logs
On Fri, Feb 6, 2015 at 4:14 PM, Ted Yu yuzhih...@gmail.com wrote:
To add to What Petar said, when YARN log aggregation is enabled, consider
specifying yarn.nodemanager.remote-app-log-dir which is where aggregated
logs are saved.
bother to also sort them within each
partition
On Tue, Feb 3, 2015 at 5:41 PM, Nitin kak nitinkak...@gmail.com wrote:
I thought thats what sort based shuffled did, sort the keys going to the
same partition.
I have tried (c1, c2) as (Int, Int) tuple as well. I don't think that
ordering of c2
I thought thats what sort based shuffled did, sort the keys going to the
same partition.
I have tried (c1, c2) as (Int, Int) tuple as well. I don't think that
ordering of c2 type is the problem here.
On Tue, Feb 3, 2015 at 5:21 PM, Sean Owen so...@cloudera.com wrote:
Hm, I don't think the sort
memory asked by Spark to approximately 22G.
On Thu, Jan 15, 2015 at 12:54 PM, Nitin kak nitinkak...@gmail.com wrote:
Is this Overhead memory allocation used for any specific purpose.
For example, will it be any different if I do *--executor-memory 22G *with
overhead set to 0%(hypothetically) vs
I am sorry for the formatting error, the value for
*yarn.scheduler.maximum-allocation-mb
= 28G*
On Thu, Jan 15, 2015 at 11:31 AM, Nitin kak nitinkak...@gmail.com wrote:
Thanks for sticking to this thread.
I am guessing what memory my app requests and what Yarn requests on my
part should
20G or
about 1.4G. You might set this higher to 2G to give more overhead.
See the --config property=value syntax documented in
http://spark.apache.org/docs/latest/submitting-applications.html
On Thu, Jan 15, 2015 at 3:47 AM, Nitin kak nitinkak...@gmail.com wrote:
Thanks Sean.
I guess
Thanks Sean.
I guess Cloudera Manager has parameters executor_total_max_heapsize
and worker_max_heapsize
which point to the parameters you mentioned above.
How much should that cushon between the jvm heap size and yarn memory limit
be?
I tried setting jvm memory to 20g and yarn to 24g, but it
Soon enough :)
http://apache-spark-developers-list.1001551.n3.nabble.com/RESULT-VOTE-Release-Apache-Spark-1-2-0-RC2-td9815.html
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-Release-Date-tp20765p20766.html
Sent from the Apache Spark User List
and could prevent the shuffle by passing
the partition information to in-memory caching.
See - https://issues.apache.org/jira/browse/SPARK-4849
Thanks
-Nitin
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-with-join-terribly-slow-tp20751p20756.html
Hi Michael,
I have opened following JIRA for the same :-
https://issues.apache.org/jira/browse/SPARK-4849
I am having a look at the code to see what can be done and then we can have
a discussion over the approach.
Let me know if you have any comments/suggestions.
Thanks
-Nitin
On Sun, Dec 14
Can we take this as a performance improvement task in Spark-1.2.1? I can help
contribute for this.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20623.html
Sent from the Apache Spark User List mailing
Looks like this issue has been fixed very recently and should be available in
next RC :-
http://apache-spark-developers-list.1001551.n3.nabble.com/CREATE-TABLE-AS-SELECT-does-not-work-with-temp-tables-in-1-2-0-td9662.html
--
View this message in context:
constructor argument as
configurable/parameterized (also written as TODO). Do we have a plan to do
this in 1.2 release? Or I can take this up as a task for myself if you want
(since this is very crucial for our release).
Thanks
-Nitin
On Wed, Dec 10, 2014 at 1:06 AM, Michael Armbrust mich
I see that somebody had already raised a PR for this but it hasn't been
merged.
https://issues.apache.org/jira/browse/SPARK-4339
Can we merge this in next 1.2 RC?
Thanks
-Nitin
On Wed, Dec 10, 2014 at 11:50 AM, Nitin Goyal nitin2go...@gmail.com wrote:
Hi Michael,
I think I have found
RDD by
applying schema again and using the existing schema RDD further(in case of
simple queries) but then for complex queries, I get TreenodeException
(Unresolved Attributes) as I mentioned.
Let me know if you need any more info around my problem.
Thanks in Advance
-Nitin
--
View this message
Hi All,
I want to hash partition (and then cache) a schema RDD in way that
partitions are based on hash of the values of a column (ID column in my
case).
e.g. if my table has ID column with values as 1,2,3,4,5,6,7,8,9 and
spark.sql.shuffle.partitions is configured as 3, then there should be 3
With some quick googling, I learnt that I can we can provide distribute by
coulmn_name in hive ql to distribute data based on a column values. My
question now if I use distribute by id, will there be any performance
improvements? Will I be able to avoid data movement in shuffle(Excahnge
before
Yes, I added all the Hive jars present in Cloudera distribution of Hadoop.
I added them because I was getting ClassNotFoundException for many required
classes(one example stack trace below). So, someone on the community
suggested to include the hive jars:
*Exception in thread main
is
to deploy the plain Apache version of Spark on CDH Yarn.
On Mon, Oct 27, 2014 at 11:10 AM, Nitin kak nitinkak...@gmail.com wrote:
Yes, I added all the Hive jars present in Cloudera distribution of
Hadoop. I added them because I was getting ClassNotFoundException for many
required classes(one
Somehow worked by placing all the jars(except guava) in hive lib after
--jars. Had initially tried to place the jars under another temporary
folder and pointing the executor and driver extraClassPath to that
director, but didnt work.
On Mon, Oct 27, 2014 at 2:21 PM, Nitin kak nitinkak
46 matches
Mail list logo