Re: [pyspark] Read multiple files parallely into a single dataframe

2018-05-04 Thread Irving Duran
I could be wrong, but I think you can do a wild card.

df = spark.read.format('csv').load('/path/to/file*.csv.gz')

Thank You,

Irving Duran


On Fri, May 4, 2018 at 4:38 AM Shuporno Choudhury <
shuporno.choudh...@gmail.com> wrote:

> Hi,
>
> I want to read multiple files parallely into 1 dataframe. But the files
> have random names and cannot confirm to any pattern (so I can't use
> wildcard). Also, the files can be in different directories.
> If I provide the file names in a list to the dataframe reader, it reads
> then sequentially.
> Eg:
> df=spark.read.format('csv').load(['/path/to/file1.csv.gz','/path/to/file2.csv.gz','/path/to/file3.csv.gz'])
> This reads the files sequentially. What can I do to read the files
> parallely?
> I noticed that spark reads files parallely if provided directly the
> directory location. How can that be extended to multiple random files?
> Suppose if my system has 4 cores, how can I make spark read 4 files at a
> time?
>
> Please suggest.
>


[pyspark] Read multiple files parallely into a single dataframe

2018-05-04 Thread Shuporno Choudhury
Hi,

I want to read multiple files parallely into 1 dataframe. But the files
have random names and cannot confirm to any pattern (so I can't use
wildcard). Also, the files can be in different directories.
If I provide the file names in a list to the dataframe reader, it reads
then sequentially.
Eg:
df=spark.read.format('csv').load(['/path/to/file1.csv.gz','/path/to/file2.csv.gz','/path/to/file3.csv.gz'])
This reads the files sequentially. What can I do to read the files
parallely?
I noticed that spark reads files parallely if provided directly the
directory location. How can that be extended to multiple random files?
Suppose if my system has 4 cores, how can I make spark read 4 files at a
time?

Please suggest.


Re: read multiple files

2016-09-27 Thread Mich Talebzadeh
Hi Divya,

There are a number of ways you can do this

Get today's date in epoch format. These are my package imports

import java.util.Calendar
import org.joda.time._
import java.math.BigDecimal
import java.sql.{Timestamp, Date}
import org.joda.time.format.DateTimeFormat

// Get epoch time now

scala> val epoch = System.currentTimeMillis
epoch: Long = 1474996552292

//get thirty days ago in epoch time

scala> val thirtydaysago = epoch - (30 * 24 * 60 * 60 * 1000L)
thirtydaysago: Long = 1472404552292

// *note that L for Long at the end*

// Define a function to convert date to str to double check if indeed it is
30 days ago

scala> def timeToStr(epochMillis: Long): String = {
 | DateTimeFormat.forPattern("-MM-dd HH:mm:ss").print(epochMillis)}
timeToStr: (epochMillis: Long)String


scala> timeToStr(epoch)
res4: String = 2016-09-27 18:15:52

So you need to pick files >= file_thirtydaysago UP to  file_epoch

Regardless I think you can do better with partitioning of directories. With
a file created every 5 minutes you will have 288 files generated daily
(12*24). Just partition the sub-directory daily. Flume can do that for you
or you can do it in a shell script.

HTH











Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 September 2016 at 15:53, Peter Figliozzi 
wrote:

> If you're up for a fancy but excellent solution:
>
>- Store your data in Cassandra.
>- Use the expiring data feature (TTL)
> so
>data will automatically be removed a month later.
>- Now in your Spark process, just read from the database and you don't
>have to worry about the timestamp.
>- You'll still have all your old files if you need to refer back them.
>
> Pete
>
> On Tue, Sep 27, 2016 at 2:52 AM, Divya Gehlot 
> wrote:
>
>> Hi,
>> The input data files for my spark job generated at every five minutes
>> file name follows epoch time convention  as below :
>>
>> InputFolder/batch-147495960
>> InputFolder/batch-147495990
>> InputFolder/batch-147496020
>> InputFolder/batch-147496050
>> InputFolder/batch-147496080
>> InputFolder/batch-147496110
>> InputFolder/batch-147496140
>> InputFolder/batch-147496170
>> InputFolder/batch-147496200
>> InputFolder/batch-147496230
>>
>> As per requirement I need to read one month of data from current
>> timestamp.
>>
>> Would really appreciate if anybody could help me .
>>
>> Thanks,
>> Divya
>>
>
>


Re: read multiple files

2016-09-27 Thread Peter Figliozzi
If you're up for a fancy but excellent solution:

   - Store your data in Cassandra.
   - Use the expiring data feature (TTL)
    so
   data will automatically be removed a month later.
   - Now in your Spark process, just read from the database and you don't
   have to worry about the timestamp.
   - You'll still have all your old files if you need to refer back them.

Pete

On Tue, Sep 27, 2016 at 2:52 AM, Divya Gehlot 
wrote:

> Hi,
> The input data files for my spark job generated at every five minutes file
> name follows epoch time convention  as below :
>
> InputFolder/batch-147495960
> InputFolder/batch-147495990
> InputFolder/batch-147496020
> InputFolder/batch-147496050
> InputFolder/batch-147496080
> InputFolder/batch-147496110
> InputFolder/batch-147496140
> InputFolder/batch-147496170
> InputFolder/batch-147496200
> InputFolder/batch-147496230
>
> As per requirement I need to read one month of data from current timestamp.
>
> Would really appreciate if anybody could help me .
>
> Thanks,
> Divya
>


read multiple files

2016-09-27 Thread Divya Gehlot
Hi,
The input data files for my spark job generated at every five minutes file
name follows epoch time convention  as below :

InputFolder/batch-147495960
InputFolder/batch-147495990
InputFolder/batch-147496020
InputFolder/batch-147496050
InputFolder/batch-147496080
InputFolder/batch-147496110
InputFolder/batch-147496140
InputFolder/batch-147496170
InputFolder/batch-147496200
InputFolder/batch-147496230

As per requirement I need to read one month of data from current timestamp.

Would really appreciate if anybody could help me .

Thanks,
Divya


Re: Read multiple files from S3

2015-05-21 Thread Akhil Das
textFile does reads all files in a directory.

We have modified the sparkstreaming code base to read nested files from S3,
you can check this function
https://github.com/sigmoidanalytics/spark-modified/blob/8074620414df6bbed81ac855067600573a7b22ca/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L206
which does that and implement something similar for your usecase.

Or if your job is just a batch job and you don't bother processing file by
file, then may be you can iterate over your list and create a sc.textFile
for each file entry and do the computing too. something like:

for(file - fileNames){

 // Create sparkContext
 // do sc.textFile(file)
 // do your computing
 // sc.stop

}



Thanks
Best Regards

On Thu, May 21, 2015 at 1:45 AM, lovelylavs lxn130...@utdallas.edu wrote:

 Hi,

 I am trying to get a collection of files according to LastModifiedDate from
 S3

 List String  FileNames = new ArrayListString();

 ListObjectsRequest listObjectsRequest = new ListObjectsRequest()
 .withBucketName(s3_bucket)
 .withPrefix(logs_dir);

 ObjectListing objectListing;


 do {
 objectListing = s3Client.listObjects(listObjectsRequest);
 for (S3ObjectSummary objectSummary :
 objectListing.getObjectSummaries()) {

 if
 ((objectSummary.getLastModified().compareTo(dayBefore)  0)  
 (objectSummary.getLastModified().compareTo(dayAfter) 1) 
 objectSummary.getKey().contains(.log))
 FileNames.add(objectSummary.getKey());
 }

 listObjectsRequest.setMarker(objectListing.getNextMarker());
 } while (objectListing.isTruncated());

 I would like to process these files using Spark

 I understand that textFile reads a single text file. Is there any way to
 read all these files that are part of the List?

 Thanks for your help.





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Read-multiple-files-from-S3-tp22965.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Read multiple files from S3

2015-05-20 Thread lovelylavs
Hi,

I am trying to get a collection of files according to LastModifiedDate from
S3

List String  FileNames = new ArrayListString();

ListObjectsRequest listObjectsRequest = new ListObjectsRequest()
.withBucketName(s3_bucket)
.withPrefix(logs_dir);

ObjectListing objectListing;


do {
objectListing = s3Client.listObjects(listObjectsRequest);
for (S3ObjectSummary objectSummary :
objectListing.getObjectSummaries()) {

if
((objectSummary.getLastModified().compareTo(dayBefore)  0)  
(objectSummary.getLastModified().compareTo(dayAfter) 1) 
objectSummary.getKey().contains(.log))
FileNames.add(objectSummary.getKey());
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());

I would like to process these files using Spark

I understand that textFile reads a single text file. Is there any way to
read all these files that are part of the List?

Thanks for your help.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Read-multiple-files-from-S3-tp22965.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org