Spark Converting dataframe to Rdd reduces partitions

2017-01-02 Thread manish jaiswal
Hi,

I am getting issue while converting dataframe to Rdd, it reduces partitions.


In our code, Dataframe was created as :

DataFrame DF = hiveContext.sql("select * from table_instance");

When I convert my dataframe to rdd and try to get its number of partitions
as

RDD newRDD = Df.rdd();
System.out.println(newRDD.getNumPartitions());

It reduces the number of partitions to 1(1 is printed in the console).
Originally my dataframe has 102 partitions . Any reasons for this? Please
let me know if my understanding is wrong here.


Re: SPARK SQL READING FROM HIVE

2016-08-08 Thread manish jaiswal
correct its creating delta file in hdfs.but after compaction it merge all
data and create extra directory where all bucketed data present.( i am able
to read data from hive but not from sparksql).


Re: SPARK SQL READING FROM HIVE

2016-08-08 Thread manish jaiswal
i am using spark 1.6.0 and hive 1.2.1.

reading from hive transactional table is not supported yet by sparl sql?

On Tue, Aug 9, 2016 at 12:18 AM, manish jaiswal 
wrote:

> Hi,
>
> I am not able to read data from hive transactional table using sparksql.
> (i don't want read via hive jdbc)
>
>
>
> Please help.
>


SPARK SQL READING FROM HIVE

2016-08-08 Thread manish jaiswal
Hi,

I am not able to read data from hive transactional table using sparksql.
(i don't want read via hive jdbc)



Please help.


Spark Job trigger in production

2016-07-18 Thread manish jaiswal
Hi,


What is the best approach to trigger spark job in production cluster?


HiveContext

2016-07-01 Thread manish jaiswal
Hi,

Using sparkHiveContext when we read all rows where age was between 0 and
100, even though we requested rows where age was less than 15. Such full
table scanning is an expensive operation.

ORC avoids this type of overhead by using predicate push-down with three
levels of built-in indexes within each file: file level, stripe level, and
row level:

   -

   File and stripe level statistics are in the file footer, making it easy
   to determine if the rest of the file needs to be read.
   -

   Row level indexes include column statistics for each row group and
   position, for seeking to the start of the row group.

ORC utilizes these indexes to move the filter operation to the data loading
phase, by reading only data that potentially includes required rows.


My doubt is when we give some query to hiveContext in orc table using spark
with

sqlContext.setConf("spark.sql.orc.filterPushdown", "true")

how it will perform

1.it will fetch only those record from orc file according to query.or

2.it will take orc file in spark and then perform spark job using
predicate push-down

and give you the records.

(I am aware of hiveContext gives spark only metadata and location of the data)


Thanks

Manish


HiveContext

2016-06-30 Thread manish jaiswal
-- Forwarded message --
From: "manish jaiswal" 
Date: Jun 30, 2016 17:35
Subject: HiveContext
To: , , <
user-h...@spark.apache.org>
Cc:

Hi,


I am new to Spark.I found using HiveContext we can connect to hive and run
HiveQLs. I run it and it worked.

My doubt is when we are using hiveContext and run hive query like(select
distinct column from table).

how it will perform it will take all data stored in hdfs into spark
engine(memory) and perform (select distinct column from table) or
it will give to hive and get result from hive.?



Thanks