Re: [HACKERS] Netflix Prize data
Mark Woodward wrote: I tried to cluster the data along a particular index but had to cancel it after 3 hours. If the data is in random order, it's faster to do SELECT * INTO foo_sorted FROM foo ORDER BY bar then CREATE INDEX, than to run CLUSTER. That's because CLUSTER does a full index scan of the table, which is slower than a seqscan + sort if the table is not already clustered. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Netflix Prize data
> "Mark Woodward" <[EMAIL PROTECTED]> writes: >> The rating, however, is one char 1~9. Would making it a char(1) buy >> anything? > > No, that would actually hurt because of the length word for the char > field. Even if you used the "char" type, which really is only one byte, > you wouldn't win anything because of alignment issues. Personally I'd > just go for three ints and a date, rather than trying to be cute with > the rating. Actually, the date is just days, right? I don't actualy need it too much. So, create a small int for date and do this: smalldate = date('1970-01-01') - rdate. And use small int for rating. Column | Type | Modifiers +--+--- movie | integer | client | integer | day| smallint | rating | smallint | ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Netflix Prize data
"Mark Woodward" <[EMAIL PROTECTED]> writes: > The rating, however, is one char 1~9. Would making it a char(1) buy anything? No, that would actually hurt because of the length word for the char field. Even if you used the "char" type, which really is only one byte, you wouldn't win anything because of alignment issues. Personally I'd just go for three ints and a date, rather than trying to be cute with the rating. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Netflix Prize data
> > "Greg Sabino Mullane" <[EMAIL PROTECTED]> writes: > >> CREATE TABLE rating ( >> movie SMALLINT NOT NULL, >> person INTEGER NOT NULL, >> rating SMALLINT NOT NULL, >> viewed DATE NOT NULL >> ); > > You would probably be better off putting the two smallints first followed > by > the integer and date. Otherwise both the integer and the date field will > have > an extra two bytes of padding wasting 4 bytes of space. > > If you reorder the fields that way you'll be down to 28 bytes of tuple > header > overhead and 12 bytes of data. There's actually another 4 bytes in the > form of > the line pointer so a total of 44 bytes per record. Ie, almost 73% of the > disk > i/o you're seeing is actually per-record overhead. > That's good advice, however, It is said that Netflix has greater than 64K movies, so, while the test info may work with a small int, I doubt the overall system would work. The rating, however, is one char 1~9. Would making it a char(1) buy anything? In wonder If I started screwing around with movie ID and rating, and moved them into one int. One byte for rating, three bytes for movie ID. That could reduce the data size by at least half gig. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Netflix Prize data
"Greg Sabino Mullane" <[EMAIL PROTECTED]> writes: > CREATE TABLE rating ( > movie SMALLINT NOT NULL, > person INTEGER NOT NULL, > rating SMALLINT NOT NULL, > viewed DATE NOT NULL > ); You would probably be better off putting the two smallints first followed by the integer and date. Otherwise both the integer and the date field will have an extra two bytes of padding wasting 4 bytes of space. If you reorder the fields that way you'll be down to 28 bytes of tuple header overhead and 12 bytes of data. There's actually another 4 bytes in the form of the line pointer so a total of 44 bytes per record. Ie, almost 73% of the disk i/o you're seeing is actually per-record overhead. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Netflix Prize data
> "Mark Woodward" <[EMAIL PROTECTED]> writes: >> The one thing I notice is that it is REAL slow. > > How fast is your disk? Counting on my fingers, I estimate you are > scanning the table at about 47MB/sec, which might or might not be > disk-limited... > >> I'm using 8.1.4. The "rdate" field looks something like: "2005-09-06" > > So why aren't you storing it as type "date"? > You are assuming I gave it any thought at all. :-) I converted it to a date type (create table ratings2 as ) [EMAIL PROTECTED]:~/netflix/download$ time psql -c "select count(*) from ratings" netflix count --- 100480507 (1 row) real1m29.852s user0m0.002s sys 0m0.005s That's about the right increase based on the reduction in data size. OK, I guess I am crying wolf, 47M/sec isn't all that bad for the system. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Netflix Prize data
>> I signed up for the Netflix Prize. (www.netflixprize.com) >> and downloaded their data and have imported it into PostgreSQL. >> Here is how I created the table: > > I signed up as well, but have the table as follows: > > CREATE TABLE rating ( > movie SMALLINT NOT NULL, > person INTEGER NOT NULL, > rating SMALLINT NOT NULL, > viewed DATE NOT NULL > ); > > I also recommend not loading the entire file until you get further > along in the algorithm solution. :) > > Not that I have time to really play with this As luck would have it, I wrote a recommendations system based on music ratings a few years ago. After reading the NYT article, it seems as though one or more of the guys behind "Net Perceptions" is either helping them or did their system, I'm not sure. I wrote my system because Net Perceptions was too slow and did a lousy job. I think the notion of "communities" in general is an interesting study in statistics, but every thing I've seen in the form of bad recommendations shows that while [N] people may share certain tastes, but that doesn't nessisarily mean that what one likes the others do. This is especially flawed with movie rentals because it is seldom a 1:1 ratio of movies to people. There are often multiple people in a household. Also, movies are almost always for multiple people. Anyway, good luck! (Not better than me, of course :-) ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Netflix Prize data
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 > I signed up for the Netflix Prize. (www.netflixprize.com) > and downloaded their data and have imported it into PostgreSQL. > Here is how I created the table: I signed up as well, but have the table as follows: CREATE TABLE rating ( movie SMALLINT NOT NULL, person INTEGER NOT NULL, rating SMALLINT NOT NULL, viewed DATE NOT NULL ); I also recommend not loading the entire file until you get further along in the algorithm solution. :) Not that I have time to really play with this - -- Greg Sabino Mullane [EMAIL PROTECTED] PGP Key: 0x14964AC8 200610041827 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -BEGIN PGP SIGNATURE- iD8DBQFFJDZ0vJuQZxSWSsgRAr6OAKCiOuspNm8QCsujaEN0Kgie6RsTjgCdGPda 9zVzpkrhTEhySEVBwMBTOdU= =zF7u -END PGP SIGNATURE- ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Netflix Prize data
"Mark Woodward" <[EMAIL PROTECTED]> writes: > I'm using 8.1.4. The "rdate" field looks something like: "2005-09-06" So, > the raw data is 23 bytes, the date string will probably be rounded up to > 12 bytes, that's 24 bytes per row of data. What is the overhead per > variable? per row? > > Is there any advantage to using "varchar(10)" over "text" ? I'll second the "use a date" comment. But to answer the questions, text and varchar are handled identically in almost every respect. The overhead per variable width field (like text or varchar) is 4 bytes. The overhead per row depends on a few factors, but figure 28 bytes. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Netflix Prize data
"Mark Woodward" <[EMAIL PROTECTED]> writes: > The one thing I notice is that it is REAL slow. How fast is your disk? Counting on my fingers, I estimate you are scanning the table at about 47MB/sec, which might or might not be disk-limited... > I'm using 8.1.4. The "rdate" field looks something like: "2005-09-06" So why aren't you storing it as type "date"? regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Netflix Prize data
Mark, On 10/4/06 1:43 PM, "Mark Woodward" <[EMAIL PROTECTED]> wrote: > [EMAIL PROTECTED]:~/netflix$ time psql netflix -c "select count(*) from > ratings" >count > --- > 100480507 > (1 row) > > > real2m6.270s > user0m0.004s > sys 0m0.005s I think you are getting about 40MB/s on your sequential scan of about 5GB of heap data in this case. I calculate the size of the data as: 3 Integers (12 bytes), one text date field (10 bytes ?) and tuple overhead (24 bytes) = 46 bytes per row 100 million rows x 46 bytes / row = 4.6 Gbytes - Luke ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] Netflix Prize data
I signed up for the Netflix Prize. (www.netflixprize.com) and downloaded their data and have imported it into PostgreSQL. Here is how I created the table: Table "public.ratings" Column | Type | Modifiers +-+--- item | integer | client | integer | rating | integer | rdate | text| Indexes: "ratings_client" btree (client) "ratings_item" btree (item) [EMAIL PROTECTED]:~/netflix$ time psql netflix -c "select count(*) from ratings" count --- 100480507 (1 row) real2m6.270s user0m0.004s sys 0m0.005s The one thing I notice is that it is REAL slow. I know it is, in fact, 100 million records, but I don't think PostgreSQL is usually slow like this. I'm going to check with some other machines to see if there is a problem with my test machine or if something is wierd about PostgreSQL and large numbers of rows. I tried to cluster the data along a particular index but had to cancel it after 3 hours. I'm using 8.1.4. The "rdate" field looks something like: "2005-09-06" So, the raw data is 23 bytes, the date string will probably be rounded up to 12 bytes, that's 24 bytes per row of data. What is the overhead per variable? per row? Is there any advantage to using "varchar(10)" over "text" ? ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq