Re: Spark Streaming Small files in Hive

2017-10-29 Thread Siva Gudavalli
Hello Asmath, We had a similar challenge recently. When you write back to hive, you are creating files on HDFS, and it depends on your batch window. If you increase your batch window lets say from 1 min to 5 mins you will end up creating 5x times less. The other factor is your partitioning.

Re: Orc predicate pushdown with Spark Sql

2017-10-27 Thread Siva Gudavalli
t it reads > the file, but it should not read all the content, which is probably also not > happening. > > On 24. Oct 2017, at 18:16, Siva Gudavalli <gudavalli.s...@yahoo.com.INVALID > <mailto:gudavalli.s...@yahoo.com.INVALID>> wrote: > >> >> Hello, >> &

Re: Orc predicate pushdown with Spark Sql

2017-10-24 Thread Siva Gudavalli
92 DESC], output=[id#192]) +- ConvertToSafe +- Project [id#192] +- Filter (usr#199 = AA0YP) +- HiveTableScan [id#192,usr#199], MetastoreRelation default, hlogsv5, None, [(cdt#189 = 20171003),(usrpartkey#191 = hhhUsers)]   please let me know if i am missing anything here. thank you On Monday,

Orc predicate pushdown with Spark Sql

2017-10-23 Thread Siva Gudavalli
Hello, I am working with Spark SQL to query Hive Managed Table (in Orc Format) I have my data organized by partitions and asked to set indexes for each 50,000 Rows by setting ('orc.row.index.stride'='5') lets say -> after evaluating partition there are around 50 files in which data is

Partition and Sort by together

2017-10-12 Thread Siva Gudavalli
Hello, I have my data stored in parquet file format. My data Is already partitioned by dates and keyNow I want my data in each file to be sorted by a new Code column.  date1    -> key1             -> paqfile1             ->paqfile2     ->key2             ->paqfile1             ->paqfile2 date2 

Re: how to deploy new code with checkpointing

2016-04-11 Thread Siva Gudavalli
e changes will break > Java serialization. > > On Mon, Apr 11, 2016 at 4:30 PM, Siva Gudavalli <gss.su...@gmail.com> > wrote: > >> hello, >> >> i am writing a spark streaming application to read data from kafka. I am >> using no receiver approach and

how to deploy new code with checkpointing

2016-04-11 Thread Siva Gudavalli
hello, i am writing a spark streaming application to read data from kafka. I am using no receiver approach and enabled checkpointing to make sure I am not reading messages again in case of failure. (exactly once semantics) i have a quick question how checkpointing needs to be configured to

Spark sql-1.4.1 DataFrameWrite.jdbc() SaveMode.Append

2015-11-24 Thread Siva Gudavalli
Ref:https://issues.apache.org/jira/browse/SPARK-11953 In Spark 1.3.1 we have 2 methods i.e.. CreateJdbcTable and InsertIntoJdbc They are replaced with write.jdbc() in Spark 1.4.1 CreateJDBCTable allows to perform CREATE TABLE ... i.e... DDL on the table followed by INSERT (DML) InsertIntoJDBC

spark 1.4.1 to oracle 11g write to an existing table

2015-11-23 Thread Siva Gudavalli
Hi, I am trying to write a dataframe from Spark 1.4.1 to oracle 11g I am using dataframe.write.mode(SaveMode.Append).jdbc(url,tablename, properties) this is always trying to create a Table. I would like to insert records to an existing table instead of creating a new one each single time.