Re: [Rails] What's the best way to approach reading and parse large XLSX files?

2014-03-25 Thread Ted Kaplan
Creek is good, I'd also recommend dullard, a gem that I wrote.  Its output 
format may be more convenient for your case.

https://github.com/thirtyseven/dullard
http://rubygems.org/gems/dullard

-Ted

On Friday, October 11, 2013 1:35:39 PM UTC-7, Monserrat Foster wrote:

 I forgot to say after it reads all rows and writes the file, throws

  [1m [35m (600.1ms) [0m  begin transaction
[1m [36m (52.0ms) [0m   [1mcommit transaction [0m
 failed to allocate memory
 Redirected to http://localhost:3000/upload_files/110
 Completed 406 Not Acceptable in 1207471ms (ActiveRecord: 693.1ms)

 On Friday, October 11, 2013 4:03:12 PM UTC-4:30, Monserrat Foster wrote:

 This is an everyday, initially maybe a couple people at the same time 
 uploading and parsing files to generate the new one, but eventually it will 
 extend to other people, so...

 I used a logger and It does retrieve and save the files using the 
 comparation. But it takes forever, like 30min or so in generating the file. 
 The process starts as soon as the files are uploaded but it seems to be 
 taking most of the time into opening the file, once it's opened it takes 
 maybe 5min at most to generate the new file.

 Do you know where can i find an example on how to read an xlsx file with 
 nokogiri? I can't seem to find one

 On Friday, October 11, 2013 11:12:20 AM UTC-4:30, Walter Lee Davis wrote:


 On Oct 11, 2013, at 11:30 AM, Monserrat Foster wrote: 

  One 3+ row file and another with just over 200. How much memory 
 should I need for this not to take forever parsing? (I'm currently using my 
 computer as server and I can see ruby taking about 1GB in the task manager 
 when processing this (and it takes forever). 
  
  The 3+ row file is about 7MB, which is not that much (I think) 

 I have a collection of 1200 XML files, ranging in size from 3MB to 12MB 
 each (they're books, in TEI encoding) that I parse with Nokogiri on a 2GB 
 Joyent SmartMachine to convert them to XHTML and then on to Epub. This 
 process takes 17 minutes for the first pass, and 24 minutes for the second 
 pass. It does not crash, but the server is unable to do much of anything 
 else while the loop is running. 

 My question here was, is this something that is a self-serve web 
 service, or an admin-level (one-privileged-user-once-in-a-while) type 
 thing? In my case, there's one admin who adds maybe two or three books per 
 month to the collection, and the 40-minute do-everything loop was used only 
 for development purposes -- it was my test cycle as I checked all of the 
 titles against a validator to ensure that my adjustments to the transcoding 
 process didn't result in invalid code. I would not advise putting something 
 like this live against the world, as the potential for DOS is extremely 
 great. Anything that can pull the kinds of loads you get when you load a 
 huge file into memory and start fiddling with it should not be public! 

 Walter 

  
  On Friday, October 11, 2013 8:44:22 AM UTC-4:30, Walter Lee Davis 
 wrote: 
  
  On Oct 10, 2013, at 4:50 PM, Monserrat Foster wrote: 
  
   A coworker suggested I should use just basic OOP for this, to create 
 a class that reads files, and then another to load the files into memory. 
 Could please point me in the right direction for this (where can I read 
 about it)? I have no idea what's he talking about, as I've never done this 
 before. 
  
  How many of these files are you planning to parse at any one time? Do 
 you have the memory on your server to deal with this load? I can see this 
 approach working, but getting slow and process-bound very quickly. Lots of 
 edge cases to deal with when parsing big uploaded files. 
  
  Walter 
  
   
   I'll look up nokogiri and SAX 
   
   On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee Davis 
 wrote: 
   On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote: 
   
Hello, I'm developing an app that basically, receives a 10MB or 
 less XLSX files with +3 rows or so, and another XLSX file with about 
 200rows, I have to read one row of the smallest file, look it up on the 
 largest file and write data from both files to a new one. 
   
   Wow. Do you have to do all this in a single request? 
   
   You may want to look at Nokogiri and its SAX parser. SAX parsers 
 don't care about the size of the document they operate on, because they 
 work one node at a time, and don't load the whole thing into memory at 
 once. There are some limitations on what kind of work a SAX parser can 
 perform, because it isn't able to see the entire document and know where 
 it is within the document at any point. But for certain kinds of problems, 
 it can be the only way to go. Sounds like you may need something like this. 
   
   Walter 
   

I just did a test reading a few rows from the largest file using 
 ROO (Spreadsheet doesn't support XSLX and Creek look good but I can't find 
 a way to read row by row) 
and it basically made my computer crash, 

Re: [Rails] What's the best way to approach reading and parse large XLSX files?

2013-10-11 Thread Martin Streicher


 I highly recommend the RubyXL gem. It opens xlsx files and seems very 
 reliable. I use it all the time. 

 

-- 
You received this message because you are subscribed to the Google Groups Ruby 
on Rails: Talk group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to rubyonrails-talk+unsubscr...@googlegroups.com.
To post to this group, send email to rubyonrails-talk@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/rubyonrails-talk/60f3ea2d-8246-447d-bc96-dc7e974beae3%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: [Rails] What's the best way to approach reading and parse large XLSX files?

2013-10-11 Thread Walter Lee Davis

On Oct 10, 2013, at 4:50 PM, Monserrat Foster wrote:

 A coworker suggested I should use just basic OOP for this, to create a class 
 that reads files, and then another to load the files into memory. Could 
 please point me in the right direction for this (where can I read about it)? 
 I have no idea what's he talking about, as I've never done this before. 

How many of these files are you planning to parse at any one time? Do you have 
the memory on your server to deal with this load? I can see this approach 
working, but getting slow and process-bound very quickly. Lots of edge cases to 
deal with when parsing big uploaded files.

Walter

 
 I'll look up nokogiri and SAX
 
 On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee Davis wrote:
 On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote: 
 
  Hello, I'm developing an app that basically, receives a 10MB or less XLSX 
  files with +3 rows or so, and another XLSX file with about 200rows, I 
  have to read one row of the smallest file, look it up on the largest file 
  and write data from both files to a new one. 
 
 Wow. Do you have to do all this in a single request? 
 
 You may want to look at Nokogiri and its SAX parser. SAX parsers don't care 
 about the size of the document they operate on, because they work one node at 
 a time, and don't load the whole thing into memory at once. There are some 
 limitations on what kind of work a SAX parser can perform, because it isn't 
 able to see the entire document and know where it is within the document at 
 any point. But for certain kinds of problems, it can be the only way to go. 
 Sounds like you may need something like this. 
 
 Walter 
 
  
  I just did a test reading a few rows from the largest file using ROO 
  (Spreadsheet doesn't support XSLX and Creek look good but I can't find a 
  way to read row by row) 
  and it basically made my computer crash, the server crashed, I tried 
  rebooting it and it said It was already started, anyway, it was a disaster. 
  
  So, my question was, is there gem that works best with large XLSX files or 
  is there another way to approach this withouth crashing my computer? 
  
  This is what I had (It's very possible I'm doing it wrong, help is welcome) 
  What i was trying to do here, was to process the files and create the new 
  XLS file after both of the XLSX files were uploaded: 
  
  
  require 'roo' 
  require 'spreadsheet' 
  require 'creek' 
  class UploadFiles  ActiveRecord::Base 
after_commit :process_files 
attr_accessible :inventory, :material_list 
has_one :inventory 
has_one :material_list 
has_attached_file :inventory, :url=/:current_user/inventory, 
  :path=:rails_root/tmp/users/uploaded_files/inventory/inventory.:extension
   
has_attached_file :material_list, :url=/:current_user/material_list, 
  :path=:rails_root/tmp/users/uploaded_files/material_list/material_list.:extension
   
validates_attachment_presence :material_list 
accepts_nested_attributes_for :material_list, :allow_destroy = true   
accepts_nested_attributes_for :inventory, :allow_destroy = true   
validates_attachment_content_type :inventory, :content_type = 
  [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet], 
  :message = Only .XSLX files are accepted as Inventory 
validates_attachment_content_type :material_list, :content_type = 
  [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet], 
  :message = Only .XSLX files are accepted as Material List 


def process_files 
  inventory =  Creek::Book.new(Rails.root.to_s + 
  /tmp/users/uploaded_files/inventory/inventory.xlsx) 
  material_list = Creek::Book.new(Rails.root.to_s + 
  /tmp/users/uploaded_files/material_list/material_list.xlsx) 
  inventory = inventory.sheets[0] 
  scl = Spreadsheet::Workbook.new 
  sheet1 = scl.create_worksheet 
  inventory.rows.each do |row| 
row.inspect 
sheet1.row(1).push(row) 
  end 
  
  sheet1.name = Site Configuration List 
  scl.write(Rails.root.to_s + 
  /tmp/users/generated/siteconfigurationlist.xls) 
end 
  end 
  
  
  -- 
  You received this message because you are subscribed to the Google Groups 
  Ruby on Rails: Talk group. 
  To unsubscribe from this group and stop receiving emails from it, send an 
  email to rubyonrails-ta...@googlegroups.com. 
  To post to this group, send email to rubyonra...@googlegroups.com. 
  To view this discussion on the web visit 
  https://groups.google.com/d/msgid/rubyonrails-talk/bc470d4d-19c4-4969-8ba7-4ead7a35d40c%40googlegroups.com.
   
  For more options, visit https://groups.google.com/groups/opt_out. 
 
 
 -- 
 You received this message because you are subscribed to the Google Groups 
 Ruby on Rails: Talk group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to rubyonrails-talk+unsubscr...@googlegroups.com.
 To post to this group, send email to 

Re: [Rails] What's the best way to approach reading and parse large XLSX files?

2013-10-11 Thread Monserrat Foster
One 3+ row file and another with just over 200. How much memory should 
I need for this not to take forever parsing? (I'm currently using my 
computer as server and I can see ruby taking about 1GB in the task manager 
when processing this (and it takes forever).

The 3+ row file is about 7MB, which is not that much (I think) 

On Friday, October 11, 2013 8:44:22 AM UTC-4:30, Walter Lee Davis wrote:


 On Oct 10, 2013, at 4:50 PM, Monserrat Foster wrote: 

  A coworker suggested I should use just basic OOP for this, to create a 
 class that reads files, and then another to load the files into memory. 
 Could please point me in the right direction for this (where can I read 
 about it)? I have no idea what's he talking about, as I've never done this 
 before. 

 How many of these files are you planning to parse at any one time? Do you 
 have the memory on your server to deal with this load? I can see this 
 approach working, but getting slow and process-bound very quickly. Lots of 
 edge cases to deal with when parsing big uploaded files. 

 Walter 

  
  I'll look up nokogiri and SAX 
  
  On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee Davis 
 wrote: 
  On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote: 
  
   Hello, I'm developing an app that basically, receives a 10MB or less 
 XLSX files with +3 rows or so, and another XLSX file with about 
 200rows, I have to read one row of the smallest file, look it up on the 
 largest file and write data from both files to a new one. 
  
  Wow. Do you have to do all this in a single request? 
  
  You may want to look at Nokogiri and its SAX parser. SAX parsers don't 
 care about the size of the document they operate on, because they work one 
 node at a time, and don't load the whole thing into memory at once. There 
 are some limitations on what kind of work a SAX parser can perform, because 
 it isn't able to see the entire document and know where it is within the 
 document at any point. But for certain kinds of problems, it can be the 
 only way to go. Sounds like you may need something like this. 
  
  Walter 
  
   
   I just did a test reading a few rows from the largest file using ROO 
 (Spreadsheet doesn't support XSLX and Creek look good but I can't find a 
 way to read row by row) 
   and it basically made my computer crash, the server crashed, I tried 
 rebooting it and it said It was already started, anyway, it was a disaster. 
   
   So, my question was, is there gem that works best with large XLSX 
 files or is there another way to approach this withouth crashing my 
 computer? 
   
   This is what I had (It's very possible I'm doing it wrong, help is 
 welcome) 
   What i was trying to do here, was to process the files and create the 
 new XLS file after both of the XLSX files were uploaded: 
   
   
   require 'roo' 
   require 'spreadsheet' 
   require 'creek' 
   class UploadFiles  ActiveRecord::Base 
 after_commit :process_files 
 attr_accessible :inventory, :material_list 
 has_one :inventory 
 has_one :material_list 
 has_attached_file :inventory, :url=/:current_user/inventory, 
 :path=:rails_root/tmp/users/uploaded_files/inventory/inventory.:extension 

 has_attached_file :material_list, 
 :url=/:current_user/material_list, 
 :path=:rails_root/tmp/users/uploaded_files/material_list/material_list.:extension
  

 validates_attachment_presence :material_list 
 accepts_nested_attributes_for :material_list, :allow_destroy = true 
   
 accepts_nested_attributes_for :inventory, :allow_destroy = true   
 validates_attachment_content_type :inventory, :content_type = 
 [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet], 
 :message = Only .XSLX files are accepted as Inventory 
 validates_attachment_content_type :material_list, :content_type = 
 [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet], 
 :message = Only .XSLX files are accepted as Material List 
 
 
 def process_files 
   inventory =  Creek::Book.new(Rails.root.to_s + 
 /tmp/users/uploaded_files/inventory/inventory.xlsx) 
   material_list = Creek::Book.new(Rails.root.to_s + 
 /tmp/users/uploaded_files/material_list/material_list.xlsx) 
   inventory = inventory.sheets[0] 
   scl = Spreadsheet::Workbook.new 
   sheet1 = scl.create_worksheet 
   inventory.rows.each do |row| 
 row.inspect 
 sheet1.row(1).push(row) 
   end 
   
   sheet1.name = Site Configuration List 
   scl.write(Rails.root.to_s + 
 /tmp/users/generated/siteconfigurationlist.xls) 
 end 
   end 
   
   
   -- 
   You received this message because you are subscribed to the Google 
 Groups Ruby on Rails: Talk group. 
   To unsubscribe from this group and stop receiving emails from it, send 
 an email to rubyonrails-ta...@googlegroups.com. 
   To post to this group, send email to rubyonra...@googlegroups.com. 
   To view this discussion on the web visit 
 

Re: [Rails] What's the best way to approach reading and parse large XLSX files?

2013-10-11 Thread Walter Lee Davis

On Oct 11, 2013, at 11:30 AM, Monserrat Foster wrote:

 One 3+ row file and another with just over 200. How much memory should I 
 need for this not to take forever parsing? (I'm currently using my computer 
 as server and I can see ruby taking about 1GB in the task manager when 
 processing this (and it takes forever).
 
 The 3+ row file is about 7MB, which is not that much (I think) 

I have a collection of 1200 XML files, ranging in size from 3MB to 12MB each 
(they're books, in TEI encoding) that I parse with Nokogiri on a 2GB Joyent 
SmartMachine to convert them to XHTML and then on to Epub. This process takes 
17 minutes for the first pass, and 24 minutes for the second pass. It does not 
crash, but the server is unable to do much of anything else while the loop is 
running.

My question here was, is this something that is a self-serve web service, or an 
admin-level (one-privileged-user-once-in-a-while) type thing? In my case, 
there's one admin who adds maybe two or three books per month to the 
collection, and the 40-minute do-everything loop was used only for development 
purposes -- it was my test cycle as I checked all of the titles against a 
validator to ensure that my adjustments to the transcoding process didn't 
result in invalid code. I would not advise putting something like this live 
against the world, as the potential for DOS is extremely great. Anything that 
can pull the kinds of loads you get when you load a huge file into memory and 
start fiddling with it should not be public!

Walter

 
 On Friday, October 11, 2013 8:44:22 AM UTC-4:30, Walter Lee Davis wrote:
 
 On Oct 10, 2013, at 4:50 PM, Monserrat Foster wrote: 
 
  A coworker suggested I should use just basic OOP for this, to create a 
  class that reads files, and then another to load the files into memory. 
  Could please point me in the right direction for this (where can I read 
  about it)? I have no idea what's he talking about, as I've never done this 
  before. 
 
 How many of these files are you planning to parse at any one time? Do you 
 have the memory on your server to deal with this load? I can see this 
 approach working, but getting slow and process-bound very quickly. Lots of 
 edge cases to deal with when parsing big uploaded files. 
 
 Walter 
 
  
  I'll look up nokogiri and SAX 
  
  On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee Davis wrote: 
  On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote: 
  
   Hello, I'm developing an app that basically, receives a 10MB or less XLSX 
   files with +3 rows or so, and another XLSX file with about 200rows, I 
   have to read one row of the smallest file, look it up on the largest file 
   and write data from both files to a new one. 
  
  Wow. Do you have to do all this in a single request? 
  
  You may want to look at Nokogiri and its SAX parser. SAX parsers don't care 
  about the size of the document they operate on, because they work one node 
  at a time, and don't load the whole thing into memory at once. There are 
  some limitations on what kind of work a SAX parser can perform, because it 
  isn't able to see the entire document and know where it is within the 
  document at any point. But for certain kinds of problems, it can be the 
  only way to go. Sounds like you may need something like this. 
  
  Walter 
  
   
   I just did a test reading a few rows from the largest file using ROO 
   (Spreadsheet doesn't support XSLX and Creek look good but I can't find a 
   way to read row by row) 
   and it basically made my computer crash, the server crashed, I tried 
   rebooting it and it said It was already started, anyway, it was a 
   disaster. 
   
   So, my question was, is there gem that works best with large XLSX files 
   or is there another way to approach this withouth crashing my computer? 
   
   This is what I had (It's very possible I'm doing it wrong, help is 
   welcome) 
   What i was trying to do here, was to process the files and create the new 
   XLS file after both of the XLSX files were uploaded: 
   
   
   require 'roo' 
   require 'spreadsheet' 
   require 'creek' 
   class UploadFiles  ActiveRecord::Base 
 after_commit :process_files 
 attr_accessible :inventory, :material_list 
 has_one :inventory 
 has_one :material_list 
 has_attached_file :inventory, :url=/:current_user/inventory, 
   :path=:rails_root/tmp/users/uploaded_files/inventory/inventory.:extension

 has_attached_file :material_list, :url=/:current_user/material_list, 
   :path=:rails_root/tmp/users/uploaded_files/material_list/material_list.:extension

 validates_attachment_presence :material_list 
 accepts_nested_attributes_for :material_list, :allow_destroy = true   
 accepts_nested_attributes_for :inventory, :allow_destroy = true   
 validates_attachment_content_type :inventory, :content_type = 
   [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet], 
   :message = Only 

Re: [Rails] What's the best way to approach reading and parse large XLSX files?

2013-10-11 Thread Jordon Bedwell
On Fri, Oct 11, 2013 at 10:30 AM, Monserrat Foster
monsefos...@gmail.com wrote:
 One 3+ row file and another with just over 200. How much memory should I
 need for this not to take forever parsing? (I'm currently using my computer
 as server and I can see ruby taking about 1GB in the task manager when
 processing this (and it takes forever).

 The 3+ row file is about 7MB, which is not that much (I think)

Check for a memory leak.

-- 
You received this message because you are subscribed to the Google Groups Ruby 
on Rails: Talk group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to rubyonrails-talk+unsubscr...@googlegroups.com.
To post to this group, send email to rubyonrails-talk@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/rubyonrails-talk/CAM5XQnzR9KyRzfvTOHabUifVuRMSQH0EsSnB4AarCy-5dZXXOA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: [Rails] What's the best way to approach reading and parse large XLSX files?

2013-10-11 Thread Donald Ziesig

On 10/11/2013 11:30 AM, Monserrat Foster wrote:
One 3+ row file and another with just over 200. How much memory 
should I need for this not to take forever parsing? (I'm currently 
using my computer as server and I can see ruby taking about 1GB in the 
task manager when processing this (and it takes forever).


The 3+ row file is about 7MB, which is not that much (I think)

On Friday, October 11, 2013 8:44:22 AM UTC-4:30, Walter Lee Davis wrote:


On Oct 10, 2013, at 4:50 PM, Monserrat Foster wrote:

 A coworker suggested I should use just basic OOP for this, to
create a class that reads files, and then another to load the
files into memory. Could please point me in the right direction
for this (where can I read about it)? I have no idea what's he
talking about, as I've never done this before.

How many of these files are you planning to parse at any one time?
Do you have the memory on your server to deal with this load? I
can see this approach working, but getting slow and process-bound
very quickly. Lots of edge cases to deal with when parsing big
uploaded files.

Walter


 I'll look up nokogiri and SAX

 On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee
Davis wrote:
 On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote:

  Hello, I'm developing an app that basically, receives a 10MB
or less XLSX files with +3 rows or so, and another XLSX file
with about 200rows, I have to read one row of the smallest file,
look it up on the largest file and write data from both files to a
new one.

 Wow. Do you have to do all this in a single request?

 You may want to look at Nokogiri and its SAX parser. SAX parsers
don't care about the size of the document they operate on, because
they work one node at a time, and don't load the whole thing into
memory at once. There are some limitations on what kind of work a
SAX parser can perform, because it isn't able to see the entire
document and know where it is within the document at any point.
But for certain kinds of problems, it can be the only way to go.
Sounds like you may need something like this.

 Walter

 
  I just did a test reading a few rows from the largest file
using ROO (Spreadsheet doesn't support XSLX and Creek look good
but I can't find a way to read row by row)
  and it basically made my computer crash, the server crashed, I
tried rebooting it and it said It was already started, anyway, it
was a disaster.
 
  So, my question was, is there gem that works best with large
XLSX files or is there another way to approach this withouth
crashing my computer?
 
  This is what I had (It's very possible I'm doing it wrong,
help is welcome)
  What i was trying to do here, was to process the files and
create the new XLS file after both of the XLSX files were uploaded:
 
 
  require 'roo'
  require 'spreadsheet'
  require 'creek'
  class UploadFiles  ActiveRecord::Base
after_commit :process_files
attr_accessible :inventory, :material_list
has_one :inventory
has_one :material_list
has_attached_file :inventory,
:url=/:current_user/inventory,
:path=:rails_root/tmp/users/uploaded_files/inventory/inventory.:extension

has_attached_file :material_list,
:url=/:current_user/material_list,

:path=:rails_root/tmp/users/uploaded_files/material_list/material_list.:extension

validates_attachment_presence :material_list
accepts_nested_attributes_for :material_list, :allow_destroy
= true
accepts_nested_attributes_for :inventory, :allow_destroy =
true
validates_attachment_content_type :inventory, :content_type
=
[application/vnd.openxmlformats-officedocument.spreadsheetml.sheet],
:message = Only .XSLX files are accepted as Inventory
validates_attachment_content_type :material_list,
:content_type =
[application/vnd.openxmlformats-officedocument.spreadsheetml.sheet],
:message = Only .XSLX files are accepted as Material List
 
 
def process_files
  inventory =  Creek::Book.new(Rails.root.to_s +
/tmp/users/uploaded_files/inventory/inventory.xlsx)
  material_list = Creek::Book.new(Rails.root.to_s +
/tmp/users/uploaded_files/material_list/material_list.xlsx)
  inventory = inventory.sheets[0]
  scl = Spreadsheet::Workbook.new
  sheet1 = scl.create_worksheet
  inventory.rows.each do |row|
row.inspect
sheet1.row(1).push(row)
  end
 
  sheet1.name http://sheet1.name = Site Configuration List
  scl.write(Rails.root.to_s +
/tmp/users/generated/siteconfigurationlist.xls)
end
  end
 
 
  --
  You received this message because you are subscribed to the

Re: [Rails] What's the best way to approach reading and parse large XLSX files?

2013-10-11 Thread Monserrat Foster
Hi, the files automatically download in .XLSX formats, I can't change them 
and I can't force the users to change it in order to make my job easier. 
Thanks for the suggestion. 

On Friday, October 11, 2013 11:34:39 AM UTC-4:30, donz wrote:

  On 10/11/2013 11:30 AM, Monserrat Foster wrote:
  
 One 3+ row file and another with just over 200. How much memory should 
 I need for this not to take forever parsing? (I'm currently using my 
 computer as server and I can see ruby taking about 1GB in the task manager 
 when processing this (and it takes forever). 

  The 3+ row file is about 7MB, which is not that much (I think) 

 On Friday, October 11, 2013 8:44:22 AM UTC-4:30, Walter Lee Davis wrote: 


 On Oct 10, 2013, at 4:50 PM, Monserrat Foster wrote: 

  A coworker suggested I should use just basic OOP for this, to create a 
 class that reads files, and then another to load the files into memory. 
 Could please point me in the right direction for this (where can I read 
 about it)? I have no idea what's he talking about, as I've never done this 
 before. 

 How many of these files are you planning to parse at any one time? Do you 
 have the memory on your server to deal with this load? I can see this 
 approach working, but getting slow and process-bound very quickly. Lots of 
 edge cases to deal with when parsing big uploaded files. 

 Walter 

  
  I'll look up nokogiri and SAX 
  
  On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee Davis 
 wrote: 
  On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote: 
  
   Hello, I'm developing an app that basically, receives a 10MB or less 
 XLSX files with +3 rows or so, and another XLSX file with about 
 200rows, I have to read one row of the smallest file, look it up on the 
 largest file and write data from both files to a new one. 
  
  Wow. Do you have to do all this in a single request? 
  
  You may want to look at Nokogiri and its SAX parser. SAX parsers don't 
 care about the size of the document they operate on, because they work one 
 node at a time, and don't load the whole thing into memory at once. There 
 are some limitations on what kind of work a SAX parser can perform, because 
 it isn't able to see the entire document and know where it is within the 
 document at any point. But for certain kinds of problems, it can be the 
 only way to go. Sounds like you may need something like this. 
  
  Walter 
  
   
   I just did a test reading a few rows from the largest file using ROO 
 (Spreadsheet doesn't support XSLX and Creek look good but I can't find a 
 way to read row by row) 
   and it basically made my computer crash, the server crashed, I tried 
 rebooting it and it said It was already started, anyway, it was a disaster. 
   
   So, my question was, is there gem that works best with large XLSX 
 files or is there another way to approach this withouth crashing my 
 computer? 
   
   This is what I had (It's very possible I'm doing it wrong, help is 
 welcome) 
   What i was trying to do here, was to process the files and create the 
 new XLS file after both of the XLSX files were uploaded: 
   
   
   require 'roo' 
   require 'spreadsheet' 
   require 'creek' 
   class UploadFiles  ActiveRecord::Base 
 after_commit :process_files 
 attr_accessible :inventory, :material_list 
 has_one :inventory 
 has_one :material_list 
 has_attached_file :inventory, :url=/:current_user/inventory, 
 :path=:rails_root/tmp/users/uploaded_files/inventory/inventory.:extension 

 has_attached_file :material_list, 
 :url=/:current_user/material_list, 
 :path=:rails_root/tmp/users/uploaded_files/material_list/material_list.:extension
  

 validates_attachment_presence :material_list 
 accepts_nested_attributes_for :material_list, :allow_destroy = 
 true   
 accepts_nested_attributes_for :inventory, :allow_destroy = true   
 validates_attachment_content_type :inventory, :content_type = 
 [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet], 
 :message = Only .XSLX files are accepted as Inventory 
 validates_attachment_content_type :material_list, :content_type = 
 [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet], 
 :message = Only .XSLX files are accepted as Material List 
 
 
 def process_files 
   inventory =  Creek::Book.new(Rails.root.to_s + 
 /tmp/users/uploaded_files/inventory/inventory.xlsx) 
   material_list = Creek::Book.new(Rails.root.to_s + 
 /tmp/users/uploaded_files/material_list/material_list.xlsx) 
   inventory = inventory.sheets[0] 
   scl = Spreadsheet::Workbook.new 
   sheet1 = scl.create_worksheet 
   inventory.rows.each do |row| 
 row.inspect 
 sheet1.row(1).push(row) 
   end 
   
   sheet1.name = Site Configuration List 
   scl.write(Rails.root.to_s + 
 /tmp/users/generated/siteconfigurationlist.xls) 
 end 
   end 
   
   
   -- 
   You received this message because you are 

Re: [Rails] What's the best way to approach reading and parse large XLSX files?

2013-10-11 Thread Monserrat Foster
This is an everyday, initially maybe a couple people at the same time 
uploading and parsing files to generate the new one, but eventually it will 
extend to other people, so...

I used a logger and It does retrieve and save the files using the 
comparation. But it takes forever, like 30min or so in generating the file. 
The process starts as soon as the files are uploaded but it seems to be 
taking most of the time into opening the file, once it's opened it takes 
maybe 5min at most to generate the new file.

Do you know where can i find an example on how to read an xlsx file with 
nokogiri? I can't seem to find one

On Friday, October 11, 2013 11:12:20 AM UTC-4:30, Walter Lee Davis wrote:


 On Oct 11, 2013, at 11:30 AM, Monserrat Foster wrote: 

  One 3+ row file and another with just over 200. How much memory 
 should I need for this not to take forever parsing? (I'm currently using my 
 computer as server and I can see ruby taking about 1GB in the task manager 
 when processing this (and it takes forever). 
  
  The 3+ row file is about 7MB, which is not that much (I think) 

 I have a collection of 1200 XML files, ranging in size from 3MB to 12MB 
 each (they're books, in TEI encoding) that I parse with Nokogiri on a 2GB 
 Joyent SmartMachine to convert them to XHTML and then on to Epub. This 
 process takes 17 minutes for the first pass, and 24 minutes for the second 
 pass. It does not crash, but the server is unable to do much of anything 
 else while the loop is running. 

 My question here was, is this something that is a self-serve web service, 
 or an admin-level (one-privileged-user-once-in-a-while) type thing? In my 
 case, there's one admin who adds maybe two or three books per month to the 
 collection, and the 40-minute do-everything loop was used only for 
 development purposes -- it was my test cycle as I checked all of the titles 
 against a validator to ensure that my adjustments to the transcoding 
 process didn't result in invalid code. I would not advise putting something 
 like this live against the world, as the potential for DOS is extremely 
 great. Anything that can pull the kinds of loads you get when you load a 
 huge file into memory and start fiddling with it should not be public! 

 Walter 

  
  On Friday, October 11, 2013 8:44:22 AM UTC-4:30, Walter Lee Davis wrote: 
  
  On Oct 10, 2013, at 4:50 PM, Monserrat Foster wrote: 
  
   A coworker suggested I should use just basic OOP for this, to create a 
 class that reads files, and then another to load the files into memory. 
 Could please point me in the right direction for this (where can I read 
 about it)? I have no idea what's he talking about, as I've never done this 
 before. 
  
  How many of these files are you planning to parse at any one time? Do 
 you have the memory on your server to deal with this load? I can see this 
 approach working, but getting slow and process-bound very quickly. Lots of 
 edge cases to deal with when parsing big uploaded files. 
  
  Walter 
  
   
   I'll look up nokogiri and SAX 
   
   On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee Davis 
 wrote: 
   On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote: 
   
Hello, I'm developing an app that basically, receives a 10MB or less 
 XLSX files with +3 rows or so, and another XLSX file with about 
 200rows, I have to read one row of the smallest file, look it up on the 
 largest file and write data from both files to a new one. 
   
   Wow. Do you have to do all this in a single request? 
   
   You may want to look at Nokogiri and its SAX parser. SAX parsers don't 
 care about the size of the document they operate on, because they work one 
 node at a time, and don't load the whole thing into memory at once. There 
 are some limitations on what kind of work a SAX parser can perform, because 
 it isn't able to see the entire document and know where it is within the 
 document at any point. But for certain kinds of problems, it can be the 
 only way to go. Sounds like you may need something like this. 
   
   Walter 
   

I just did a test reading a few rows from the largest file using ROO 
 (Spreadsheet doesn't support XSLX and Creek look good but I can't find a 
 way to read row by row) 
and it basically made my computer crash, the server crashed, I tried 
 rebooting it and it said It was already started, anyway, it was a disaster. 

So, my question was, is there gem that works best with large XLSX 
 files or is there another way to approach this withouth crashing my 
 computer? 

This is what I had (It's very possible I'm doing it wrong, help is 
 welcome) 
What i was trying to do here, was to process the files and create 
 the new XLS file after both of the XLSX files were uploaded: 


require 'roo' 
require 'spreadsheet' 
require 'creek' 
class UploadFiles  ActiveRecord::Base 
  after_commit :process_files 
  attr_accessible :inventory, 

Re: [Rails] What's the best way to approach reading and parse large XLSX files?

2013-10-11 Thread Monserrat Foster
I forgot to say after it reads all rows and writes the file, throws

 [1m [35m (600.1ms) [0m  begin transaction
   [1m [36m (52.0ms) [0m   [1mcommit transaction [0m
failed to allocate memory
Redirected to http://localhost:3000/upload_files/110
Completed 406 Not Acceptable in 1207471ms (ActiveRecord: 693.1ms)

On Friday, October 11, 2013 4:03:12 PM UTC-4:30, Monserrat Foster wrote:

 This is an everyday, initially maybe a couple people at the same time 
 uploading and parsing files to generate the new one, but eventually it will 
 extend to other people, so...

 I used a logger and It does retrieve and save the files using the 
 comparation. But it takes forever, like 30min or so in generating the file. 
 The process starts as soon as the files are uploaded but it seems to be 
 taking most of the time into opening the file, once it's opened it takes 
 maybe 5min at most to generate the new file.

 Do you know where can i find an example on how to read an xlsx file with 
 nokogiri? I can't seem to find one

 On Friday, October 11, 2013 11:12:20 AM UTC-4:30, Walter Lee Davis wrote:


 On Oct 11, 2013, at 11:30 AM, Monserrat Foster wrote: 

  One 3+ row file and another with just over 200. How much memory 
 should I need for this not to take forever parsing? (I'm currently using my 
 computer as server and I can see ruby taking about 1GB in the task manager 
 when processing this (and it takes forever). 
  
  The 3+ row file is about 7MB, which is not that much (I think) 

 I have a collection of 1200 XML files, ranging in size from 3MB to 12MB 
 each (they're books, in TEI encoding) that I parse with Nokogiri on a 2GB 
 Joyent SmartMachine to convert them to XHTML and then on to Epub. This 
 process takes 17 minutes for the first pass, and 24 minutes for the second 
 pass. It does not crash, but the server is unable to do much of anything 
 else while the loop is running. 

 My question here was, is this something that is a self-serve web service, 
 or an admin-level (one-privileged-user-once-in-a-while) type thing? In my 
 case, there's one admin who adds maybe two or three books per month to the 
 collection, and the 40-minute do-everything loop was used only for 
 development purposes -- it was my test cycle as I checked all of the titles 
 against a validator to ensure that my adjustments to the transcoding 
 process didn't result in invalid code. I would not advise putting something 
 like this live against the world, as the potential for DOS is extremely 
 great. Anything that can pull the kinds of loads you get when you load a 
 huge file into memory and start fiddling with it should not be public! 

 Walter 

  
  On Friday, October 11, 2013 8:44:22 AM UTC-4:30, Walter Lee Davis 
 wrote: 
  
  On Oct 10, 2013, at 4:50 PM, Monserrat Foster wrote: 
  
   A coworker suggested I should use just basic OOP for this, to create 
 a class that reads files, and then another to load the files into memory. 
 Could please point me in the right direction for this (where can I read 
 about it)? I have no idea what's he talking about, as I've never done this 
 before. 
  
  How many of these files are you planning to parse at any one time? Do 
 you have the memory on your server to deal with this load? I can see this 
 approach working, but getting slow and process-bound very quickly. Lots of 
 edge cases to deal with when parsing big uploaded files. 
  
  Walter 
  
   
   I'll look up nokogiri and SAX 
   
   On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee Davis 
 wrote: 
   On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote: 
   
Hello, I'm developing an app that basically, receives a 10MB or 
 less XLSX files with +3 rows or so, and another XLSX file with about 
 200rows, I have to read one row of the smallest file, look it up on the 
 largest file and write data from both files to a new one. 
   
   Wow. Do you have to do all this in a single request? 
   
   You may want to look at Nokogiri and its SAX parser. SAX parsers 
 don't care about the size of the document they operate on, because they 
 work one node at a time, and don't load the whole thing into memory at 
 once. There are some limitations on what kind of work a SAX parser can 
 perform, because it isn't able to see the entire document and know where 
 it is within the document at any point. But for certain kinds of problems, 
 it can be the only way to go. Sounds like you may need something like this. 
   
   Walter 
   

I just did a test reading a few rows from the largest file using 
 ROO (Spreadsheet doesn't support XSLX and Creek look good but I can't find 
 a way to read row by row) 
and it basically made my computer crash, the server crashed, I 
 tried rebooting it and it said It was already started, anyway, it was a 
 disaster. 

So, my question was, is there gem that works best with large XLSX 
 files or is there another way to approach this withouth crashing my 
 computer? 

This 

Re: [Rails] What's the best way to approach reading and parse large XLSX files?

2013-10-11 Thread Walter Lee Davis

On Oct 11, 2013, at 4:33 PM, Monserrat Foster wrote:

 This is an everyday, initially maybe a couple people at the same time 
 uploading and parsing files to generate the new one, but eventually it will 
 extend to other people, so...
 
 I used a logger and It does retrieve and save the files using the 
 comparation. But it takes forever, like 30min or so in generating the file. 
 The process starts as soon as the files are uploaded but it seems to be 
 taking most of the time into opening the file, once it's opened it takes 
 maybe 5min at most to generate the new file.
 
 Do you know where can i find an example on how to read an xlsx file with 
 nokogiri? I can't seem to find one

XSLX is just an Excel file expressed in XML. It's no different than parsing any 
other XML file. First, find a good basic example of file parsing with Nokogiri. 
http://nokogiri.org/tutorials/searching_a_xml_html_document.html Next, open up 
your file in a text editor, and look for the elements you want to access. You 
can use either xpath or css syntax to locate your elements, and Nokogiri allows 
you to access either attributes or content of any element you can locate. If 
you run into trouble with all the prefixes that Microsoft like to litter their 
formats with, you can pass remove_namespaces to clean that right up.

Walter


 
 On Friday, October 11, 2013 11:12:20 AM UTC-4:30, Walter Lee Davis wrote:
 
 On Oct 11, 2013, at 11:30 AM, Monserrat Foster wrote: 
 
  One 3+ row file and another with just over 200. How much memory should 
  I need for this not to take forever parsing? (I'm currently using my 
  computer as server and I can see ruby taking about 1GB in the task manager 
  when processing this (and it takes forever). 
  
  The 3+ row file is about 7MB, which is not that much (I think) 
 
 I have a collection of 1200 XML files, ranging in size from 3MB to 12MB each 
 (they're books, in TEI encoding) that I parse with Nokogiri on a 2GB Joyent 
 SmartMachine to convert them to XHTML and then on to Epub. This process takes 
 17 minutes for the first pass, and 24 minutes for the second pass. It does 
 not crash, but the server is unable to do much of anything else while the 
 loop is running. 
 
 My question here was, is this something that is a self-serve web service, or 
 an admin-level (one-privileged-user-once-in-a-while) type thing? In my case, 
 there's one admin who adds maybe two or three books per month to the 
 collection, and the 40-minute do-everything loop was used only for 
 development purposes -- it was my test cycle as I checked all of the titles 
 against a validator to ensure that my adjustments to the transcoding process 
 didn't result in invalid code. I would not advise putting something like this 
 live against the world, as the potential for DOS is extremely great. Anything 
 that can pull the kinds of loads you get when you load a huge file into 
 memory and start fiddling with it should not be public! 
 
 Walter 
 
  
  On Friday, October 11, 2013 8:44:22 AM UTC-4:30, Walter Lee Davis wrote: 
  
  On Oct 10, 2013, at 4:50 PM, Monserrat Foster wrote: 
  
   A coworker suggested I should use just basic OOP for this, to create a 
   class that reads files, and then another to load the files into memory. 
   Could please point me in the right direction for this (where can I read 
   about it)? I have no idea what's he talking about, as I've never done 
   this before. 
  
  How many of these files are you planning to parse at any one time? Do you 
  have the memory on your server to deal with this load? I can see this 
  approach working, but getting slow and process-bound very quickly. Lots of 
  edge cases to deal with when parsing big uploaded files. 
  
  Walter 
  
   
   I'll look up nokogiri and SAX 
   
   On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee Davis 
   wrote: 
   On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote: 
   
Hello, I'm developing an app that basically, receives a 10MB or less 
XLSX files with +3 rows or so, and another XLSX file with about 
200rows, I have to read one row of the smallest file, look it up on the 
largest file and write data from both files to a new one. 
   
   Wow. Do you have to do all this in a single request? 
   
   You may want to look at Nokogiri and its SAX parser. SAX parsers don't 
   care about the size of the document they operate on, because they work 
   one node at a time, and don't load the whole thing into memory at once. 
   There are some limitations on what kind of work a SAX parser can perform, 
   because it isn't able to see the entire document and know where it is 
   within the document at any point. But for certain kinds of problems, it 
   can be the only way to go. Sounds like you may need something like this. 
   
   Walter 
   

I just did a test reading a few rows from the largest file using ROO 
(Spreadsheet doesn't support XSLX and Creek look good but I can't find 

[Rails] What's the best way to approach reading and parse large XLSX files?

2013-10-10 Thread Monserrat Foster
Hello, I'm developing an app that basically, receives a 10MB or less XLSX 
files with +3 rows or so, and another XLSX file with about 200rows, I 
have to read one row of the smallest file, look it up on the largest file 
and write data from both files to a new one. 

I just did a test reading a few rows from the largest file using ROO 
(Spreadsheet doesn't support XSLX and Creek look good but I can't find a 
way to read row by row)
and it basically made my computer crash, the server crashed, I tried 
rebooting it and it said It was already started, anyway, it was a disaster.

So, my question was, is there gem that works best with large XLSX files or 
is there another way to approach this withouth crashing my computer?

This is what I had (It's very possible I'm doing it wrong, help is welcome)
*What i was trying to do here, was to process the files and create the new 
XLS file after both of the XLSX files were uploaded:*


require 'roo'
require 'spreadsheet'
require 'creek'
class UploadFiles  ActiveRecord::Base
  after_commit :process_files
  attr_accessible :inventory, :material_list
  has_one :inventory
  has_one :material_list
  has_attached_file :inventory, :url=/:current_user/inventory, 
:path=:rails_root/tmp/users/uploaded_files/inventory/inventory.:extension
  has_attached_file :material_list, :url=/:current_user/material_list, 
:path=:rails_root/tmp/users/uploaded_files/material_list/material_list.:extension
  validates_attachment_presence :material_list
  accepts_nested_attributes_for :material_list, :allow_destroy = true  
  accepts_nested_attributes_for :inventory, :allow_destroy = true  
  validates_attachment_content_type :inventory, :content_type = 
[application/vnd.openxmlformats-officedocument.spreadsheetml.sheet], 
:message = Only .XSLX files are accepted as Inventory
  validates_attachment_content_type :material_list, :content_type = 
[application/vnd.openxmlformats-officedocument.spreadsheetml.sheet], 
:message = Only .XSLX files are accepted as Material List
  
  
  def process_files
inventory =  Creek::Book.new(Rails.root.to_s + 
/tmp/users/uploaded_files/inventory/inventory.xlsx)
material_list = Creek::Book.new(Rails.root.to_s + 
/tmp/users/uploaded_files/material_list/material_list.xlsx)
inventory = inventory.sheets[0]
scl = Spreadsheet::Workbook.new
sheet1 = scl.create_worksheet
inventory.rows.each do |row|
  row.inspect
  sheet1.row(1).push(row)
end

sheet1.name = Site Configuration List
scl.write(Rails.root.to_s + 
/tmp/users/generated/siteconfigurationlist.xls)
  end
end

-- 
You received this message because you are subscribed to the Google Groups Ruby 
on Rails: Talk group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to rubyonrails-talk+unsubscr...@googlegroups.com.
To post to this group, send email to rubyonrails-talk@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/rubyonrails-talk/bc470d4d-19c4-4969-8ba7-4ead7a35d40c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: [Rails] What's the best way to approach reading and parse large XLSX files?

2013-10-10 Thread Walter Lee Davis
On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote:

 Hello, I'm developing an app that basically, receives a 10MB or less XLSX 
 files with +3 rows or so, and another XLSX file with about 200rows, I 
 have to read one row of the smallest file, look it up on the largest file and 
 write data from both files to a new one. 

Wow. Do you have to do all this in a single request?

You may want to look at Nokogiri and its SAX parser. SAX parsers don't care 
about the size of the document they operate on, because they work one node at a 
time, and don't load the whole thing into memory at once. There are some 
limitations on what kind of work a SAX parser can perform, because it isn't 
able to see the entire document and know where it is within the document at 
any point. But for certain kinds of problems, it can be the only way to go. 
Sounds like you may need something like this.

Walter

 
 I just did a test reading a few rows from the largest file using ROO 
 (Spreadsheet doesn't support XSLX and Creek look good but I can't find a way 
 to read row by row)
 and it basically made my computer crash, the server crashed, I tried 
 rebooting it and it said It was already started, anyway, it was a disaster.
 
 So, my question was, is there gem that works best with large XLSX files or is 
 there another way to approach this withouth crashing my computer?
 
 This is what I had (It's very possible I'm doing it wrong, help is welcome)
 What i was trying to do here, was to process the files and create the new XLS 
 file after both of the XLSX files were uploaded:
 
 
 require 'roo'
 require 'spreadsheet'
 require 'creek'
 class UploadFiles  ActiveRecord::Base
   after_commit :process_files
   attr_accessible :inventory, :material_list
   has_one :inventory
   has_one :material_list
   has_attached_file :inventory, :url=/:current_user/inventory, 
 :path=:rails_root/tmp/users/uploaded_files/inventory/inventory.:extension
   has_attached_file :material_list, :url=/:current_user/material_list, 
 :path=:rails_root/tmp/users/uploaded_files/material_list/material_list.:extension
   validates_attachment_presence :material_list
   accepts_nested_attributes_for :material_list, :allow_destroy = true  
   accepts_nested_attributes_for :inventory, :allow_destroy = true  
   validates_attachment_content_type :inventory, :content_type = 
 [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet], 
 :message = Only .XSLX files are accepted as Inventory
   validates_attachment_content_type :material_list, :content_type = 
 [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet], 
 :message = Only .XSLX files are accepted as Material List
   
   
   def process_files
 inventory =  Creek::Book.new(Rails.root.to_s + 
 /tmp/users/uploaded_files/inventory/inventory.xlsx)
 material_list = Creek::Book.new(Rails.root.to_s + 
 /tmp/users/uploaded_files/material_list/material_list.xlsx)
 inventory = inventory.sheets[0]
 scl = Spreadsheet::Workbook.new
 sheet1 = scl.create_worksheet
 inventory.rows.each do |row|
   row.inspect
   sheet1.row(1).push(row)
 end
 
 sheet1.name = Site Configuration List
 scl.write(Rails.root.to_s + 
 /tmp/users/generated/siteconfigurationlist.xls)
   end
 end
 
 
 -- 
 You received this message because you are subscribed to the Google Groups 
 Ruby on Rails: Talk group.
 To unsubscribe from this group and stop receiving emails from it, send an 
 email to rubyonrails-talk+unsubscr...@googlegroups.com.
 To post to this group, send email to rubyonrails-talk@googlegroups.com.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/rubyonrails-talk/bc470d4d-19c4-4969-8ba7-4ead7a35d40c%40googlegroups.com.
 For more options, visit https://groups.google.com/groups/opt_out.

-- 
You received this message because you are subscribed to the Google Groups Ruby 
on Rails: Talk group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to rubyonrails-talk+unsubscr...@googlegroups.com.
To post to this group, send email to rubyonrails-talk@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/rubyonrails-talk/760C6DBE-BE3F-4F1C-B7E9-431BA5DF52C3%40wdstudio.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: [Rails] What's the best way to approach reading and parse large XLSX files?

2013-10-10 Thread Monserrat Foster
A coworker suggested I should use just basic OOP for this, to create a 
class that reads files, and then another to load the files into memory. 
Could please point me in the right direction for this (where can I read 
about it)? I have no idea what's he talking about, as I've never done this 
before. 

I'll look up nokogiri and SAX

On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee Davis wrote:

 On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote: 

  Hello, I'm developing an app that basically, receives a 10MB or less 
 XLSX files with +3 rows or so, and another XLSX file with about 
 200rows, I have to read one row of the smallest file, look it up on the 
 largest file and write data from both files to a new one. 

 Wow. Do you have to do all this in a single request? 

 You may want to look at Nokogiri and its SAX parser. SAX parsers don't 
 care about the size of the document they operate on, because they work one 
 node at a time, and don't load the whole thing into memory at once. There 
 are some limitations on what kind of work a SAX parser can perform, because 
 it isn't able to see the entire document and know where it is within the 
 document at any point. But for certain kinds of problems, it can be the 
 only way to go. Sounds like you may need something like this. 

 Walter 

  
  I just did a test reading a few rows from the largest file using ROO 
 (Spreadsheet doesn't support XSLX and Creek look good but I can't find a 
 way to read row by row) 
  and it basically made my computer crash, the server crashed, I tried 
 rebooting it and it said It was already started, anyway, it was a disaster. 
  
  So, my question was, is there gem that works best with large XLSX files 
 or is there another way to approach this withouth crashing my computer? 
  
  This is what I had (It's very possible I'm doing it wrong, help is 
 welcome) 
  What i was trying to do here, was to process the files and create the 
 new XLS file after both of the XLSX files were uploaded: 
  
  
  require 'roo' 
  require 'spreadsheet' 
  require 'creek' 
  class UploadFiles  ActiveRecord::Base 
after_commit :process_files 
attr_accessible :inventory, :material_list 
has_one :inventory 
has_one :material_list 
has_attached_file :inventory, :url=/:current_user/inventory, 
 :path=:rails_root/tmp/users/uploaded_files/inventory/inventory.:extension 

has_attached_file :material_list, 
 :url=/:current_user/material_list, 
 :path=:rails_root/tmp/users/uploaded_files/material_list/material_list.:extension
  

validates_attachment_presence :material_list 
accepts_nested_attributes_for :material_list, :allow_destroy = true   
accepts_nested_attributes_for :inventory, :allow_destroy = true   
validates_attachment_content_type :inventory, :content_type = 
 [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet], 
 :message = Only .XSLX files are accepted as Inventory 
validates_attachment_content_type :material_list, :content_type = 
 [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet], 
 :message = Only .XSLX files are accepted as Material List 


def process_files 
  inventory =  Creek::Book.new(Rails.root.to_s + 
 /tmp/users/uploaded_files/inventory/inventory.xlsx) 
  material_list = Creek::Book.new(Rails.root.to_s + 
 /tmp/users/uploaded_files/material_list/material_list.xlsx) 
  inventory = inventory.sheets[0] 
  scl = Spreadsheet::Workbook.new 
  sheet1 = scl.create_worksheet 
  inventory.rows.each do |row| 
row.inspect 
sheet1.row(1).push(row) 
  end 
  
  sheet1.name = Site Configuration List 
  scl.write(Rails.root.to_s + 
 /tmp/users/generated/siteconfigurationlist.xls) 
end 
  end 
  
  
  -- 
  You received this message because you are subscribed to the Google 
 Groups Ruby on Rails: Talk group. 
  To unsubscribe from this group and stop receiving emails from it, send 
 an email to rubyonrails-ta...@googlegroups.com javascript:. 
  To post to this group, send email to 
  rubyonra...@googlegroups.comjavascript:. 

  To view this discussion on the web visit 
 https://groups.google.com/d/msgid/rubyonrails-talk/bc470d4d-19c4-4969-8ba7-4ead7a35d40c%40googlegroups.com.
  

  For more options, visit https://groups.google.com/groups/opt_out. 



-- 
You received this message because you are subscribed to the Google Groups Ruby 
on Rails: Talk group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to rubyonrails-talk+unsubscr...@googlegroups.com.
To post to this group, send email to rubyonrails-talk@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/rubyonrails-talk/0325dc87-0649-45fc-9d55-0fbcd8bed0a0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.