Hmmm I wonder whether this would have worked on the scraped ratings data that I had to clean. Well I did that with XPath in XSLT, might take a look and see.
I have 3 different movie data sets from different sources. The one I just posted containing 3m movies, 1 has 180k movies (created by with JSONiq running against freebase) and the other about 50k movies , all of which I have managed to cast in XML. So there is plenty of data to experiment with. I look forward to your trip, I'll be around for a few months myself. On Wed, Jul 1, 2015 at 2:59 PM, daniela florescu <[email protected]> wrote: > Ihe, > > transforming XQuery to be able to do data cleaning has been a LONG desire > of mine. > > Helena Galhardas was a PhD student of mine. She is now a professor in > Lisbon, > > She and her students wrote the data cleaning package in Zorba — it’s 100% > clean XQuery, > so you can reuse it for other engines. > > Let me know how it goes. > > On the 7th I am leaving to Europe for 3-4 months. > > I will certainly visit London often. > > Hope we can talk, best > Dana > > On Jul 1, 2015, at 11:54 AM, daniela florescu <[email protected]> wrote: > > Ihe, > > before you load anything anywhere, you need to do data cleaning on this > data > if you do integration from the Web and data has no unique ids….. > > In particular entity resolution… > > Literature is full of data cleaning and entity resolution algorithms. > > One that you will find familiar (because it looks very much like XQuery > :-) is here: > > http://www.inesc-id.pt/ficheiros/publicacoes/1259.pdf > > Best regards > Dana > > > > > On Jul 1, 2015, at 10:04 AM, Ihe Onwuka <[email protected]> wrote: > > You will note that the data doesn't have a unique id. Title certainly > isn't unique, if you consider how many movies there have been called Batman > or Treasure Island. > > Now I may encounter data about this movie from another source that covers > different facets , for example it's box office takings or movie reviews. > > So it's a classic semantic web application. I want to amalgamate disparate > data about the same fact in one entity. As I said I have a transformation > that does this but it doesn't scale very well because I have to search the > entire movie base to find the best match. To overcome this I have to adopt > a mapReduce-ish approach to solve the problem. > > The thinking is a graphical representation would eliminate that problem > because a graph gives me a persistent data structure already indexed for > retrieval via several different axes, whereas indexes constructed in the > XSLT transformation for the same purpose are ephemeral and would need to > be reconstructed every time you ran the transformation. > > On Wed, Jul 1, 2015 at 12:46 PM, Peter Hunsberger < > [email protected]> wrote: > >> Should be pretty straight forward to import that into Neo4J or Titan. >> Neo might be simplest, in particular via conversion of the data into JSON. >> However, Titan might give you other capabilities such as using Hadoop type >> processing either for import or for subsequent analytics. Without knowing >> more about the business requirements can't really give you much more than >> that... >> >> Peter Hunsberger >> >> On Wed, Jul 1, 2015 at 11:32 AM, Ihe Onwuka <[email protected]> wrote: >> >>> I would like to convert the XML snippet below to a multi-relational >>> graph representation. >>> One way is to transform a triple store via RDF. Another which I am less >>> familiar with is to transform to graphML followed by a subsequent import >>> into some graph database tool. >>> >>> The graphical representation is desirable for processing rather than >>> visualization reasons. Chiefly I have a matching algorthim implemented in >>> XSLT which works fine but doesn't scale well, a problem that I think can be >>> solved with a graphical representation. >>> >>> I am keen to hear from my elders and betters on the subject. >>> >>> <movie title="20000 lieues sous les mers"> >>> <actors> >>> <person name="Méliès, Georges"/> >>> </actors> >>> <alias> >>> <title title="20,000 Leagues Under the Sea " year="1907"/> >>> <title title="Amid the Workings of the Deep " year="1907"/> >>> <title title="Deux cent mille lieues sous les mers " year="1907"/> >>> <title title="Le cauchemar d'un pêcheur " year="1907"/> >>> <title title="Under the Seas " year="1907"/> >>> </alias> >>> <directors> >>> <person name="Méliès, Georges"/> >>> </directors> >>> <genres> >>> <tag name="adventure"/> >>> <tag name="fantasy"/> >>> <tag name="sci-fi"/> >>> <tag name="short"/> >>> </genres> >>> <keywords> >>> <tag name="based-on-novel"/> >>> <tag name="dream"/> >>> <tag name="fish"/> >>> <tag name="number-in-title"/> >>> <tag name="submarine"/> >>> <tag name="undersea-monster"/> >>> <tag name="underwater"/> >>> </keywords> >>> <producers> >>> <person name="Méliès, Georges"/> >>> </producers> >>> </movie> >>> >> >> > > >
_______________________________________________ [email protected] http://x-query.com/mailman/listinfo/talk
