I'm not sure I completely understand the format you're trying to use,
but here's the first part of the algorithm, which seems to be the one
you don't know how to do.

The key is dyadic <;.1 , which you can look up in the dictionary.

   ] attributes =. (<'Location'),~ 'Attribute '&,&.> ":&.>i.4
┌───────────┬───────────┬───────────┬───────────┬────────┐
│Attribute 0│Attribute 1│Attribute 2│Attribute 3│Location│
└───────────┴───────────┴───────────┴───────────┴────────┘
   NB. Append Attribute 0 to the beginning and split on attributes
   ] d1split =. (<;.1~ e.&attributes) ({.attributes) , <;._2 d1
┌───────────────────┬───────────────────┬───────────────────────────┬────────────────────────────┬─────────────────────────┬────────────────────┬──────────────────┬───────────────┐
│┌───────────┬─────┐│┌───────────┬─────┐│┌───────────┬───────┬─────┐│┌────────┬────┬───────┬────┐│┌───────────┬─────┬─────┐│┌───────────┬──────┐│┌───────────┬────┐│┌────────┬────┐│
││Attribute 0│alpha│││Attribute 1│bravo│││Attribute 
2│charlie│delta│││Location│echo│foxtrot│golf│││Attribute 
3│hotel│india│││Attribute 1│juliet│││Attribute 2│kilo│││Location│lima││
│└───────────┴─────┘│└───────────┴─────┘│└───────────┴───────┴─────┘│└────────┴────┴───────┴────┘│└───────────┴─────┴─────┘│└───────────┴──────┘│└───────────┴────┘│└────────┴────┘│
└───────────────────┴───────────────────┴───────────────────────────┴────────────────────────────┴─────────────────────────┴────────────────────┴──────────────────┴───────────────┘
   NB. Merge multipart attributes except for location
   ({. ([,<@:(;:^:_1)@]^:((<'Location')~:[)) }.)&.> d1split
┌───────────────────┬───────────────────┬───────────────────────────┬────────────────────────────┬─────────────────────────┬────────────────────┬──────────────────┬───────────────┐
│┌───────────┬─────┐│┌───────────┬─────┐│┌───────────┬─────────────┐│┌────────┬────┬───────┬────┐│┌───────────┬───────────┐│┌───────────┬──────┐│┌───────────┬────┐│┌────────┬────┐│
││Attribute 0│alpha│││Attribute 1│bravo│││Attribute 2│charlie 
delta│││Location│echo│foxtrot│golf│││Attribute 3│hotel india│││Attribute 
1│juliet│││Attribute 2│kilo│││Location│lima││
│└───────────┴─────┘│└───────────┴─────┘│└───────────┴─────────────┘│└────────┴────┴───────┴────┘│└───────────┴───────────┘│└───────────┴──────┘│└───────────┴────┘│└────────┴────┘│
└───────────────────┴───────────────────┴───────────────────────────┴────────────────────────────┴─────────────────────────┴────────────────────┴──────────────────┴───────────────┘

I'm confused about how you are handling multiple values for one of the
attributes (like Attribute 1 here). If you give more detail I can give
some hints on that. Of course, if you can do it yourself, that's even
better!

Marshall

On Tue, Oct 23, 2012 at 01:10:55PM -0700, Bill Harris wrote:
> I get to J little enough these days so I'm a bit rusty when it comes
> to the interesting stuff, and I'm stuck on a particular problem.
> 
> I start with a PDF report.  I run it through pdftotext and then
> format/zulu's a2b to get a file that is mostly of the form
> 
> value
> attribute
> value
> attribute
> value
> .
> .
> .
> value
> value
> attribute
> value
> .
> .
> .
> 
> The first value of each entry has no explicit attribute name, although
> "entry name" would be a suitable attribute name.  Some attributes span
> multiple rows, and attributes may be of any reasonable length and do
> include whitespace.  I know the set of attribute names, and some
> include whitespace, too.  Some entries don't use all attributes.
> 
> There's one other complication: one attribute (call it 'location,' if
> you will) has multiple rows that indicate multiple locations.  I need
> to duplicate the full entry for each location listed in that entry.
> 
> For other's use, I want to output a csv file that has one entry per
> row and each attribute in a separate column, with empty cells where
> the attribute wasn't used.  I can then sort, search, and aggregate
> inside J, as I wish, to process further myself.
> 
> Here's an example bit of data:
> 
> d1=: 0 : 0
> alpha
> Attribute 1
> bravo
> Attribute 2
> charlie
> delta
> Location
> echo
> foxtrot
> golf
> Attribute 3
> hotel
> india
> Attribute 1
> juliet
> Attribute 2
> kilo
> Location
> lima
> )
> 
> Here's what I think I want it to look like at an intermediate step:
> 
> d2 =: 0 : 0
> Attribute 0: alpha
> Attribute 1: bravo
> Attribute 2: charlie delta
> Location: echo
> Attribute 3: hotel
> Attribute 0: alpha
> Attribute 1: bravo
> Attribute 2: charlie delta
> Location: foxtrot
> Attribute 3: hotel
> Attribute 0: alpha
> Attribute 1: bravo
> Attribute 2: charlie delta
> Location: golf
> Attribute 3: hotel
> Attribute 0: india
> Attribute 1: juliet
> Attribute 2: kilo
> Location: lima
> Attribute 3:
> )
> 
> Attribute 0 is always a one-liner, so I detect its value by backing up
> one from 'Attribute 1'.  (I didn't pick the file format. :-) )
> 
> There are about 20-40 lines at the start that I need to
> drop--everything before the first instance of a value for Attribute 0.
> 
> The final result, ready for analysis, would look something like
> 
> d3 =: 4 5 $  <;._2 d2
> 
> Better, it would look like that with everything up to and including
> the first ':' elided (the value entries can include multiple colons)
> and with the attributes as a header row.  I can manage the header, and
> I'm pretty sure I can manage stripping out attribute names.
> 
> I've looked at JfC chapter 23 as a potentially useful spot, but I
> haven't yet seen the light.  Suggestions of fruitful paths forward?
> 
> Thanks,
> 
> Bill
> -- 
> Bill Harris
> http://facilitatedsystems.com/weblog/
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to