Re: S3 Zip File Loading Advice

2016-03-15 Thread Benjamin Kim
Hi Xinh, I tried to wrap it, but it still didn’t work. I got a "java.util.ConcurrentModificationException”. All, I have been trying and trying with some help of a coworker, but it’s slow going. I have been able to gather a list of the s3 files I need to download. ### S3 Lists ### import

Re: S3 Zip File Loading Advice

2016-03-09 Thread Xinh Huynh
Could you wrap the ZipInputStream in a List, since a subtype of TraversableOnce[?] is required? case (name, content) => List(new ZipInputStream(content.open)) Xinh On Wed, Mar 9, 2016 at 7:07 AM, Benjamin Kim wrote: > Hi Sabarish, > > I found a similar posting online where

Re: S3 Zip File Loading Advice

2016-03-09 Thread Benjamin Kim
Hi Sabarish, I found a similar posting online where I should use the S3 listKeys. http://stackoverflow.com/questions/24029873/how-to-read-multiple-text-files-into-a-single-rdd. Is this what you were thinking? And, your assumption is correct. The zipped CSV file contains only a single file. I

Re: S3 Zip File Loading Advice

2016-03-09 Thread Jörn Franke
Oozie may be able to do this for you and integrate with Spark. > On 09 Mar 2016, at 06:03, Benjamin Kim wrote: > > I am wondering if anyone can help. > > Our company stores zipped CSV files in S3, which has been a big headache from > the start. I was wondering if anyone

Re: S3 Zip File Loading Advice

2016-03-09 Thread Sabarish Sasidharan
You can use S3's listKeys API and do a diff between consecutive listKeys to identify what's new. Are there multiple files in each zip? Single file archives are processed just like text as long as it is one of the supported compression formats. Regards Sab On Wed, Mar 9, 2016 at 10:33 AM,

Re: S3 Zip File Loading Advice

2016-03-08 Thread Hemant Bhanawat
https://issues.apache.org/jira/browse/SPARK-3586 talks about creating a file dstream which can monitor for new files recursively but this functionality is not yet added. I don't see an easy way out. You will have to create your folders based on timeline (looks like you are already doing that) and

S3 Zip File Loading Advice

2016-03-08 Thread Benjamin Kim
I am wondering if anyone can help. Our company stores zipped CSV files in S3, which has been a big headache from the start. I was wondering if anyone has created a way to iterate through several subdirectories (s3n://events/2016/03/01/00, s3n://2016/03/01/01, etc.) in S3 to find the newest