Re: Spark Streaming with Files

2021-04-30 Thread muru
Yes, trigger (once=True) set to all streaming sources and it will treat as
a batch mode. Then you can use any scheduler (e.g airflow) to run it
whatever time window. With checkpointing, in the next run it will start
processing files from the last checkpoint.

On Fri, Apr 23, 2021 at 8:13 AM Mich Talebzadeh 
wrote:

> Interesting.
>
> If we go back to classic Lambda architecture on premise, you could Flume
> API to Kafka to add files to HDFS in time series bases.
>
> Most higher CDC vendors do exactly that. Oracle GoldenGate (OGG) classic
> gets data from Oracle redo logs and sends them to subscribers. One can
> deploy OGC for Big Data to enable these files to be read and processed for
> Kafka, Hive, HDFS etc.
>
> So let us assume that we create these files and stream them on object
> storage in Cloud. Then we can use Spark Structure Streaming (SSS) to act as
> ETL tool. Assuming that streaming interval to be 10 minutes, we can still
> read them but ensure that we only trigger SSS reads every 4 hours.
>
>  writeStream. \
>  outputMode('append'). \
>  option("truncate", "false"). \
>  foreachBatch(sendToSink). \
>  trigger(processingTime='14400 seconds'). \
>  queryName('readFiles'). \
>  start()
>
> This will ensure that spark only processes them every 4 hours.
>
>
> HTH
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 23 Apr 2021 at 15:40, ayan guha  wrote:
>
>> Hi
>>
>> In one of the spark summit demo, it is been alluded that we should think
>> batch jobs in streaming pattern, using "run once" in a schedule.
>> I find this idea very interesting and I understand how this can be
>> achieved for sources like kafka, kinesis or similar. in fact we have
>> implemented this model for cosmos changefeed.
>>
>> My question is: can this model extend to file based sources? I understand
>> it can be for append only file  streams. The use case I have is: A CDC tool
>> like aws dms or shareplex or similar writing changes to a stream of files,
>> in date based folders. So it just goes on like T1, T2 etc folders. Also,
>> lets assume files are written every 10 mins, but I want to process them
>> every 4 hours.
>> Can I use streaming method so that it can manage checkpoints on its own?
>>
>> Best - Ayan
>> --
>> Best Regards,
>> Ayan Guha
>>
>


Re: Spark Streaming with Files

2021-04-23 Thread Mich Talebzadeh
Interesting.

If we go back to classic Lambda architecture on premise, you could Flume
API to Kafka to add files to HDFS in time series bases.

Most higher CDC vendors do exactly that. Oracle GoldenGate (OGG) classic
gets data from Oracle redo logs and sends them to subscribers. One can
deploy OGC for Big Data to enable these files to be read and processed for
Kafka, Hive, HDFS etc.

So let us assume that we create these files and stream them on object
storage in Cloud. Then we can use Spark Structure Streaming (SSS) to act as
ETL tool. Assuming that streaming interval to be 10 minutes, we can still
read them but ensure that we only trigger SSS reads every 4 hours.

 writeStream. \
 outputMode('append'). \
 option("truncate", "false"). \
 foreachBatch(sendToSink). \
 trigger(processingTime='14400 seconds'). \
 queryName('readFiles'). \
 start()

This will ensure that spark only processes them every 4 hours.


HTH

   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 23 Apr 2021 at 15:40, ayan guha  wrote:

> Hi
>
> In one of the spark summit demo, it is been alluded that we should think
> batch jobs in streaming pattern, using "run once" in a schedule.
> I find this idea very interesting and I understand how this can be
> achieved for sources like kafka, kinesis or similar. in fact we have
> implemented this model for cosmos changefeed.
>
> My question is: can this model extend to file based sources? I understand
> it can be for append only file  streams. The use case I have is: A CDC tool
> like aws dms or shareplex or similar writing changes to a stream of files,
> in date based folders. So it just goes on like T1, T2 etc folders. Also,
> lets assume files are written every 10 mins, but I want to process them
> every 4 hours.
> Can I use streaming method so that it can manage checkpoints on its own?
>
> Best - Ayan
> --
> Best Regards,
> Ayan Guha
>


Spark Streaming with Files

2021-04-23 Thread ayan guha
Hi

In one of the spark summit demo, it is been alluded that we should think
batch jobs in streaming pattern, using "run once" in a schedule.
I find this idea very interesting and I understand how this can be achieved
for sources like kafka, kinesis or similar. in fact we have implemented
this model for cosmos changefeed.

My question is: can this model extend to file based sources? I understand
it can be for append only file  streams. The use case I have is: A CDC tool
like aws dms or shareplex or similar writing changes to a stream of files,
in date based folders. So it just goes on like T1, T2 etc folders. Also,
lets assume files are written every 10 mins, but I want to process them
every 4 hours.
Can I use streaming method so that it can manage checkpoints on its own?

Best - Ayan
-- 
Best Regards,
Ayan Guha


Re: Spark Streaming Small files in Hive

2017-10-29 Thread Siva Gudavalli
Hello Asmath,

We had a similar challenge recently.

When you write back to hive, you are creating files on HDFS, and it depends on 
your batch window. 
If you increase your batch window lets say from 1 min to 5 mins you will end up 
creating 5x times less.

The other factor is your partitioning. For instance, if your spark application 
is working on 5 partitions, you can repartition to 1, this will again reduce 
the number of files to 5x.

You can create staging to hold small files and once a decent amount of data is 
accumulated you can prepare large files and load to your final hive table.

hope this helps.

Regards
Shiv


> On Oct 29, 2017, at 11:03 AM, KhajaAsmath Mohammed  
> wrote:
> 
> Hi,
> 
> I am using spark streaming to write data back into hive with the below code 
> snippet
> 
> 
> eventHubsWindowedStream.map(x => EventContent(new String(x)))
> 
>   .foreachRDD(rdd => {
> 
> val sparkSession = SparkSession.builder.enableHiveSupport.getOrCreate
> 
> import sparkSession.implicits._
> 
> 
> rdd.toDS.write.mode(org.apache.spark.sql.SaveMode.Append).insertInto(hiveTableName)
> 
>   })
> 
> 
> Hive table is partitioned by year,month,day so we end up getting less data 
> for some days and it in turn results in smaller files inside hive. Since the 
> data is being written in smaller files, there is lot of performance on 
> Impala/Hive when reading it? is there a way to merge files while inserting 
> data into hive?
> 
> It would be really helpful too if you anyone can provide suggestions on how 
> to design it in better way. we cannot use Hbase/kudu in this current scenario 
> due to space issue with clusters .
> 
> Thanks,
> 
> Asmath



Spark Streaming Small files in Hive

2017-10-29 Thread KhajaAsmath Mohammed
Hi,

I am using spark streaming to write data back into hive with the below code
snippet


eventHubsWindowedStream.map(x => EventContent(new String(x)))

  .foreachRDD(rdd => {

val sparkSession = SparkSession
.builder.enableHiveSupport.getOrCreate

import sparkSession.implicits._

rdd.toDS.write.mode(org.apache.spark.sql.SaveMode.Append
).insertInto(hiveTableName)

  })

Hive table is partitioned by year,month,day so we end up getting less data
for some days and it in turn results in smaller files inside hive. Since
the data is being written in smaller files, there is lot of performance on
Impala/Hive when reading it? is there a way to merge files while inserting
data into hive?

It would be really helpful too if you anyone can provide suggestions on how
to design it in better way. we cannot use Hbase/kudu in this current
scenario due to space issue with clusters .

Thanks,

Asmath


spark streaming python files not packaged in assembly jar

2015-01-15 Thread jamborta
Hi all, 

just discovered that the streaming folder in pyspark is not included in the
assembly jar (spark-assembly-1.2.0-hadoop2.3.0.jar), but included in the
python folder. Any reason why?

Thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-python-files-not-packaged-in-assembly-jar-tp21177.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org