Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines()

2013-05-28 Thread Alister
On Tue, 28 May 2013 08:31:35 +0100, Fábio Santos wrote:

> On 28 May 2013 04:19, "Bryan Britten"  wrote:
>> I'm not familiar with using read(4096), I'll have to look into that.
>> When
> I tried to just save the file, my computer just sat in limbo for some
> time and didn't seem to want to process the command.
> 
> That's just file.read with an integer argument. You can read a file by
> chunks by repeatedly calling that function until you get the empty
> string.
> 
>> Based on my *extremely* limited knowledge of JSON, that's definitely
>> the
> type of file this is. Here is a snippet of what is seen when you log in:
> ...
> That's json. It's pretty big, but not big enough to stall a slow
> computer more than half a second.
> 
> -
> 
> I've looked for documentation on that method on twitter.
> 
> It seems that it's part of the twitter streaming api.
> 
> https://dev.twitter.com/docs/streaming-apis
> 
> What this means is that the requests aren't supposed to end. They are
> supposed to be read gradually, using the lines to split the response
> into meaningful chunks. That's why you can't read the data and why your
> browser never gets around to download it. Both urlopen and your browser
> block while waiting for the request to end.

Are we overlooking the obvious
why not use one of the Python twitter modules to isolate your app from 
the nitty-gritty details of the twitter stream 

https://dev.twitter.com/docs/twitter-libraries

-- 
Given sufficient time, what you put off doing today will get done by 
itself.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines()

2013-05-28 Thread Bryan Britten
Thanks to everyone for the help and insight. I think for now I'll just back 
away from this file and go back to something much easier to practice with. 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines()

2013-05-28 Thread Fábio Santos
On 28 May 2013 04:19, "Bryan Britten"  wrote:
> I'm not familiar with using read(4096), I'll have to look into that. When
I tried to just save the file, my computer just sat in limbo for some time
and didn't seem to want to process the command.

That's just file.read with an integer argument. You can read a file by
chunks by repeatedly calling that function until you get the empty string.

> Based on my *extremely* limited knowledge of JSON, that's definitely the
type of file this is. Here is a snippet of what is seen when you log in:
...
That's json. It's pretty big, but not big enough to stall a slow computer
more than half a second.

-

I've looked for documentation on that method on twitter.

It seems that it's part of the twitter streaming api.

https://dev.twitter.com/docs/streaming-apis

What this means is that the requests aren't supposed to end. They are
supposed to be read gradually, using the lines to split the response into
meaningful chunks. That's why you can't read the data and why your browser
never gets around to download it. Both urlopen and your browser block while
waiting for the request to end.

Here's more info on streaming requests on their docs:

https://dev.twitter.com/docs/streaming-apis/processing

For streaming requests in python, I would point you to the requests
library, but I am not sure it handles streaming requests.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines()

2013-05-27 Thread Bryan Britten
On Monday, May 27, 2013 7:58:05 PM UTC-4, Dave Angel wrote:
> On 05/27/2013 04:47 PM, Bryan Britten wrote:
> 
> > Hey, everyone!
> 
> >
> 
> > I'm very new to Python and have only been using it for a couple of days, 
> > but have some experience in programming (albeit mostly statistical 
> > programming in SAS or R) so I'm hoping someone can answer this question in 
> > a technical way, but without using an abundant amount of jargon.
> 
> >
> 
> > The issue I'm having is that I'm trying to pull information from a website 
> > to practice Python with, but I'm having trouble getting the data in a 
> > timely fashion. If I use the following code:
> 
> >
> 
> > 
> 
> > import json
> 
> > import urllib
> 
> >
> 
> > urlStr = "https://stream.twitter.com/1/statuses/sample.json";
> 
> >
> 
> > twtrDict = [json.loads(line) for line in urllib.urlopen(urlStr)]
> 
> > 
> 
> >
> 
> > I get a memory issue. I'm running 32-bit Python 2.7 with 4 gigs of RAM if 
> > that helps at all.
> 
> 
> 
> Which OS?

I'm operating on Windows 7.

> 
> The first question I'd ask is how big this file is.  I can't tell, since 
> 
> it needs a user name & password to actually get the file.  

If you have Twitter, you can just use your log-in information to access the 
file.

> But it's not unusual to need at least double that space in memory, and in 
> Windoze 
> 
> you're limited to two gig max, regardless of how big your hardware might be.
> 
> 
> 
> If you separately fetch the file, then you can experiment with it, 
> 
> including cutting it down to a dozen lines, and see if you can deal with 
> 
> that much.
> 
> 
> 
> How could you fetch it?  With wget, with a browser (and saveAs), with a 
> 
> simple loop which uses read(4096) repeatedly and writes each block to a 
> 
> local file.  Don't forget to use 'wb', as you don't know yet what line 
> 
> endings it might use.
> 
I'm not familiar with using read(4096), I'll have to look into that. When I 
tried to just save the file, my computer just sat in limbo for some time and 
didn't seem to want to process the command. 
> 
> Once you have an idea what the data looks like, you can answer such 
> 
> questions as whether it's json at all, whether the lines each contain a 
> 
> single json record, or what.
> 
Based on my *extremely* limited knowledge of JSON, that's definitely the type 
of file this is. Here is a snippet of what is seen when you log in:

{"created_at":"Tue May 28 03:09:23 + 
2013","id":339216806461972481,"id_str":"339216806461972481","text":"RT 
@aleon_11: Sigo creyendo que las noches lluviosas me acercan mucho m\u00e1s a 
ti!","source":"\u003ca href=\"http:\/\/blackberry.com\/twitter\" 
rel=\"nofollow\"\u003eTwitter for 
BlackBerry\u00ae\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":310910123,"id_str":"310910123","name":"\u2661","screen_name":"LaMarielita_","location":"","url":null,"description":"MERCADOLOGA
 & PUBLICISTA EN PROCESO, AMO A MI DIOS & MI FAMILIA\u2665 ME ENCANTA REIRME , 
MOLESTAR & HABLAR :D BFF, pancho, ale & china :) 
LY\u2661","protected":false,"followers_count":506,"friends_count":606,"listed_count":1,"created_at":"Sat
 Jun 04 15:24:19 + 
2011","favourites_count":207,"utc_offset":-25200,"time_zone":"Mountain Time (US 
& Canada)","geo_enabled":false,"
 
verified":false,"statuses_count":17241,"lang":"es","contributors_enabled":false,"is_translator":false,"profile_background_color":"FF6699","profile_background_image_url":"http:\/\/a0.twimg.com\/images\/themes\/theme11\/bg.gif","profile_background_image_url_https":"https:\/\/si0.twimg.com\/images\/themes\/theme11\/bg.gif","profile_background_tile":true,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/3720425493\/13a48910e56ca34edeea07ff04075c77_normal.jpeg","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/3720425493\/13a48910e56ca34edeea07ff04075c77_normal.jpeg","profile_link_color":"B40B43","profile_sidebar_border_color":"CC3366","profile_sidebar_fill_color":"E5507E","profile_text_color":"362720","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Tue
 May 2
 8 02:57:40 + 
2013","id":339213856922537984,"id_str":"339213856922537984","text":"Sigo 
creyendo que las noches lluviosas me acercan mucho m\u00e1s a 
ti!","source":"web","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":105252134,"id_str":"105252134","name":"Alejandra
 
Le\u00f3n","screen_name":"aleon_11","location":"Guatemala","url":null,"description":"La
 vida se disfruta m\u00e1s, cuando no se le pone tanta 
importancia.","protecte

Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines()

2013-05-27 Thread Dave Angel

On 05/27/2013 04:47 PM, Bryan Britten wrote:

Hey, everyone!

I'm very new to Python and have only been using it for a couple of days, but 
have some experience in programming (albeit mostly statistical programming in 
SAS or R) so I'm hoping someone can answer this question in a technical way, 
but without using an abundant amount of jargon.

The issue I'm having is that I'm trying to pull information from a website to 
practice Python with, but I'm having trouble getting the data in a timely 
fashion. If I use the following code:


import json
import urllib

urlStr = "https://stream.twitter.com/1/statuses/sample.json";

twtrDict = [json.loads(line) for line in urllib.urlopen(urlStr)]


I get a memory issue. I'm running 32-bit Python 2.7 with 4 gigs of RAM if that 
helps at all.


Which OS?

The first question I'd ask is how big this file is.  I can't tell, since 
it needs a user name & password to actually get the file.  But it's not 
unusual to need at least double that space in memory, and in Windoze 
you're limited to two gig max, regardless of how big your hardware might be.


If you separately fetch the file, then you can experiment with it, 
including cutting it down to a dozen lines, and see if you can deal with 
that much.


How could you fetch it?  With wget, with a browser (and saveAs), with a 
simple loop which uses read(4096) repeatedly and writes each block to a 
local file.  Don't forget to use 'wb', as you don't know yet what line 
endings it might use.


Once you have an idea what the data looks like, you can answer such 
questions as whether it's json at all, whether the lines each contain a 
single json record, or what.


For all we know, the file might be a few terabytes in size.


--
DaveA
--
http://mail.python.org/mailman/listinfo/python-list


Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines()

2013-05-27 Thread Fábio Santos
On 27 May 2013 22:36, "Bryan Britten"  wrote:
>
> Try to not sigh audibly as I ask what I'm sure are two asinine questions.
>
> 1) How is this approach different from twtrDict = [json.loads(line) for
line in urllib.urlopen(urlStr)]?
>

The suggested approach made use of generators. Just because you can iterate
over something, that doesn't mean it is all in memory ;)

Check out the difference between range() and xrange() in python 2
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines()

2013-05-27 Thread Denis McMahon
On Mon, 27 May 2013 14:29:38 -0700, Bryan Britten wrote:

> Try to not sigh audibly as I ask what I'm sure are two asinine
> questions.
> 
> 1) How is this approach different from twtrDict = [json.loads(line) for
> line in urllib.urlopen(urlStr)]?
> 
> 2) How do I tell how many JSON objects are on each line?

Your code at (1) creates a single list of all the json objects

The code you replied to loaded each object, assumed you did something 
with it, and then over-wrote it with the next one.

As for (2) - either inspection, or errors from the json parser.

-- 
Denis McMahon, denismfmcma...@gmail.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines()

2013-05-27 Thread Bryan Britten
Try to not sigh audibly as I ask what I'm sure are two asinine questions. 

1) How is this approach different from twtrDict = [json.loads(line) for line in 
urllib.urlopen(urlStr)]?

2) How do I tell how many JSON objects are on each line?



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Reading *.json from URL - json.loads() versus urllib.urlopen.readlines()

2013-05-27 Thread Roy Smith
In article <10be5c62-4c58-4b4f-b00a-82d85ee4e...@googlegroups.com>,
 Bryan Britten  wrote:

> If I use the following code:
> 
> 
> import urllib
> 
> urlStr = "https://stream.twitter.com/1/statuses/sample.json";
> 
> fileHandle = urllib.urlopen(urlStr)
> 
> twtrText = fileHandle.readlines()
> 
> 
> It takes hours (upwards of 6 or 7, if not more) to finish computing the last 
> command.

I'm not surprised!  readlines() reads in the ENTIRE file in one gulp.  
That a lot of tweets!

> With that being said, my question is whether there is a more efficient manner 
> to do this.

In general, when reading a large file, you want to iterate over lines of 
the file and process each one.  Something like:

for line in urllib.urlopen(urlStr):
   twtrDict = json.loads(line)

You still need to download and process all the data, but at least you 
don't need to store it in memory all at once.  There is an assumption 
here that there's exactly one json object per line.  If that's not the 
case, things might get a little more complicated.
-- 
http://mail.python.org/mailman/listinfo/python-list


Reading *.json from URL - json.loads() versus urllib.urlopen.readlines()

2013-05-27 Thread Bryan Britten
Hey, everyone! 

I'm very new to Python and have only been using it for a couple of days, but 
have some experience in programming (albeit mostly statistical programming in 
SAS or R) so I'm hoping someone can answer this question in a technical way, 
but without using an abundant amount of jargon.

The issue I'm having is that I'm trying to pull information from a website to 
practice Python with, but I'm having trouble getting the data in a timely 
fashion. If I use the following code:


import json
import urllib

urlStr = "https://stream.twitter.com/1/statuses/sample.json";

twtrDict = [json.loads(line) for line in urllib.urlopen(urlStr)]


I get a memory issue. I'm running 32-bit Python 2.7 with 4 gigs of RAM if that 
helps at all.

If I use the following code:


import urllib

urlStr = "https://stream.twitter.com/1/statuses/sample.json";

fileHandle = urllib.urlopen(urlStr)

twtrText = fileHandle.readlines()


It takes hours (upwards of 6 or 7, if not more) to finish computing the last 
command.

With that being said, my question is whether there is a more efficient manner 
to do this. I'm worried that if it's taking this long to process the 
.readlines() command, trying to work with the data is going to be a 
computational nightmare.

Thanks in advance for any insights or advice!
-- 
http://mail.python.org/mailman/listinfo/python-list