YouTube pages rely on Javascript to create the <video> element, and your browser's XPath tool works because it operates on the rendered page, after Javascript has done its work.
Scrapy itself does not interpret Javascript instructions, it's not a browser, so it can only work on what's inside the HTML source code when the web page is fetched. You can see for example that the elements with ID "player-api", which contains "movie-player" in your screenshot, is empty in the source code <div id="player-api" class="player-width player-height off-screen-target player-api"></div> What you can see also is that this #player-api element is followed by <script> elements. And while is not straighforward to read what this Javascript code is about, you can use js2xml (disclaimer: I wrote and maintain js2xml) Below is an example usage for js2xml using scrapy shell: it parses Javascript statements from <script> elements in #player, and then extracts dicts. There's an "args" key in the main script, that itself contains an url_encoded_fmt_stream_map key with some URLs for the video you may be after: I'm using urlparse to decode what looks like a query string (the full scrapy shell session is https://gist.github.com/redapple/8269818915cc2c337dc2) $ scrapy shell "https://www.youtube.com/watch?v=1EFnX1UkXVU" 2014-12-30 15:18:09+0100 [default] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=1EFnX1UkXVU> (referer: None) In [1]: import js2xml In [2]: import urlparse In [3]: import pprint In [4]: for script in response.css('#player script').xpath('string()').extract(): jstree = js2xml.parse(script) data = js2xml.jsonlike.getall(jstree) for d in data: pprint.pprint(d) ...: {} {'args': {'account_playback_token': 'QUFFLUhqa0sweExRZno5OHZEaGcwWVVQaXAxVWh0NUNFZ3xBQ3Jtc0tseE9DRUw3cFVRbkFGN1hub2VmQlNERGl3WjFIQV84aTI0b0lxZnhwdDZKRl96N1g5eWN3dkZER1pFbVM4dS1FeWJoc1FJeTBXdS0tbU5LY1NsWngtSHY1R0hoTl9xdy1iWUNoam1nRFM2czEweVdMNA==', 'adaptive_fmts': 'size=1280x720&clen=51269588&fps=15&itag=136&init=0-709...bitrate=80798', 'allow_embed': '1', 'allow_ratings': '1', 'atc': 'a=3&b=nhjwMM7ySu8wj8OhutnokFK8Dvs&c=1419949090&d=1&e=1EFnX1UkXVU&c3a=28&c1a=1&hh=hKbH2J9f2WwblpFs2hvo0H17oZo', 'author': 'Michael Herman', 'avg_rating': '4.948387146', 'c': 'WEB', 'cc3_module': '1', 'cc_asr': '1', 'cc_font': 'Arial Unicode MS, arial, verdana, _sans', 'cc_fonts_url': 'https://s.ytimg.com/yts/swfbin/player-vfly1u_c5/fonts708.swf', 'cc_load_policy': '2', 'cc_module': 'https://s.ytimg.com/yts/swfbin/player-vfly1u_c5/subtitle_module.swf', 'cl': '82697338', 'cr': 'FR', 'csi_page_type': 'watch,watch7', 'dash': '1', 'dashmpd': 'http://manifest.googlevideo.com/api/...', 'enablecsi': '1', 'enablejsapi': 1, 'eventid': 'IrSiVP-kC4v4cKrwgRg', 'fexp': '900718,927622,931342,932404,938809,9405699,9406022,940927,940940,941004,943917,947209,947218,948124,952302,952605,952901,955110,955301,957103,957105,957201', 'fmt_list': '22/1280x720/9/0/115,43/640x360/99/0/0,18/640x360/9/0/115,5/426x240/7/0/0,36/426x240/99/1/0,17/256x144/99/1/0', 'hl': 'en_US', 'host_language': 'en', 'idpj': '-6', 'iurl': 'https://i.ytimg.com/vi/1EFnX1UkXVU/hqdefault.jpg', 'iurlhq': 'https://i.ytimg.com/vi/1EFnX1UkXVU/hqdefault.jpg', 'iurlmaxres': 'https://i.ytimg.com/vi/1EFnX1UkXVU/maxresdefault.jpg', 'iurlmq': 'https://i.ytimg.com/vi/1EFnX1UkXVU/mqdefault.jpg', 'iurlsd': 'https://i.ytimg.com/vi/1EFnX1UkXVU/sddefault.jpg', 'iv3_module': '1', 'iv_invideo_url': 'https://www.youtube.com/annotations_invideo?cta=2&video_id=1EFnX1UkXVU', 'iv_load_policy': '1', 'iv_module': 'https://s.ytimg.com/yts/swfbin/player-vfly1u_c5/iv_module.swf', 'keywords': 'Scrapy,Python,scraping,python scrapy,web scraping', 'ldpj': '-25', 'length_seconds': '717', 'loaderUrl': 'https://www.youtube.com/watch?v=1EFnX1UkXVU', 'no_get_video_log': '1', 'of': 'lNeUuIm8BRrYa4UFYW3Vbw', 'plid': 'AAULb6kfjbEHoNwt', 'pltype': 'contentugc', 'probe_url': 'http://r5---sn-5hn7ym7z.googlevideo.com/videogoodput?id=o-ACe-sIXL0cLvgJC4v5mIahOxT1PHw4zDPr8ZGMCgqwQI&source=goodput&range=0-99999&expire=1419952690&ip=89.84.122.217&ms=pm&mm=35&nh=EAk&sparams=id,source,range,expire,ip,ms,mm,nh&signature=3B4094AEE2FC1C0142BCEDB115F785607DEC0CF1.04988A5889C0348F50D45D76A7D6831155C91407&key=cms1', 'ptk': 'youtube_none', 'ssl': '1', 'storyboard_spec': 'https://i.ytimg.com/sb/1EFnX1UkXVU/storyboard3_L$L/$N.jpg|48#27#100#10#10#0#default#28F7DFM7_rVji4ZXj1Inr3KDPBE|80#45#145#10#10#5000#M$M#oy8NWkx8UFfdFYJoDyKoK-F6EUo|160#90#145#5#5#5000#M$M#RPAH69FExaDD6f0lYwoCjc64vI8', 't': '1', 'thumbnail_url': 'https://i.ytimg.com/vi/1EFnX1UkXVU/default.jpg', 'timestamp': '1419949090', 'title': 'Scraping Web Pages with Scrapy', 'tmi': '1', 'token': '1', 'ttsurl': 'https://www.youtube.com/api/timedtext?...', 'ucid': 'UCt7yOnL7bI7yCa1Xe_GTjJQ', 'url_encoded_fmt_stream_map': 'fallback_host=tc.v18.cache4.googlevideo.com&quality=hd720...', 'video_id': '1EFnX1UkXVU', 'view_count': '52035', 'vq': 'auto', 'watermark': ',https://s.ytimg.com/yts/img/watermark/youtube_watermark-vflHX6b6E.png,https://s.ytimg.com/yts/img/watermark/youtube_hd_watermark-vflAzLcD6.png'}, 'assets': {'css': '//s.ytimg.com/yts/cssbin/www-player-vflPfi1TF.css', 'html': '/html5_player_template', 'js': '//s.ytimg.com/yts/jsbin/html5player-en_US-vflw4H1P-/html5player.js'}, 'attrs': {'id': 'movie_player'}, 'html5': False, 'messages': {'player_fallback': ['Adobe Flash Player or an HTML5 supported browser is required for video playback.<br><a href="http://get.adobe.com/flashplayer/">Get the latest Flash Player </a><br><a href="/html5">Learn more about upgrading to an HTML5 browser</a>']}, 'min_version': '8.0.0', 'params': {'allowfullscreen': 'true', 'allowscriptaccess': 'always', 'bgcolor': '#000000'}, 'sts': 16427, 'url': 'https://s.ytimg.com/yts/swfbin/player-vfly1u_c5/watch_as3.swf', 'url_v8': 'https://s.ytimg.com/yts/swfbin/player-vfly1u_c5/cps.swf', 'url_v9as2': 'https://s.ytimg.com/yts/swfbin/player-vfly1u_c5/cps.swf'} [] In [5]: for script in response.css('#player script').xpath('string()').extract(): ...: jstree = js2xml.parse(script) ...: data = js2xml.jsonlike.getall(jstree) ...: for d in data: ...: try: ...: if d: ...: pprint.pprint(urlparse.parse_qsl(d.get("args", {}).get("url_encoded_fmt_stream_map", ""))) ...: except: ...: pass ...: [('fallback_host', 'tc.v18.cache4.googlevideo.com'), ('quality', 'hd720'), ('itag', '22'), ('type', 'video/mp4; codecs="avc1.64001F, mp4a.40.2"'), ('url', 'http://r3---sn-25ge7n7d.googlevideo.com/videoplayback?dur=716.985&id=o-AMERlvuyknt71bMvL2Sjki6y2WsGz0TDKn11unO3_SQy&mm=31&ip=89.84.122.217&key=yt5&itag=22&mime=video%2Fmp4&source=youtube&ms=au&fexp=900718%2C927622%2C931342%2C932404%2C938809%2C9405699%2C9406022%2C940927%2C940940%2C941004%2C943917%2C947209%2C947218%2C948124%2C952302%2C952605%2C952901%2C955110%2C955301%2C957103%2C957105%2C957201&mv=m&mt=1419949043&sver=3&initcwndbps=872500&sparams=dur%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Cmime%2Cmm%2Cms%2Cmv%2Cratebypass%2Csource%2Cupn%2Cexpire&ratebypass=yes&signature=75A8510F49A9C73C72BC4F4A8759320481305D26.EA7ABB7DD01D7B4BA5228ABD8DF8DD47AB73A3A1&expire=1419970690&upn=5QFvFRIqKzs&ipbits=0,fallback_host=tc.v20.cache6.googlevideo.com'), ('quality', 'medium'), ('itag', '43'), ('type', 'video/webm; codecs="vp8.0, vorbis"'), ('url', 'http://r3---sn-25ge7n7d.googlevideo.com/videoplayback?dur=0.000&id=o-AMERlvuyknt71bMvL2Sjki6y2WsGz0TDKn11unO3_SQy&mm=31&ip=89.84.122.217&key=yt5&itag=43&mime=video%2Fwebm&source=youtube&ms=au&fexp=900718%2C927622%2C931342%2C932404%2C938809%2C9405699%2C9406022%2C940927%2C940940%2C941004%2C943917%2C947209%2C947218%2C948124%2C952302%2C952605%2C952901%2C955110%2C955301%2C957103%2C957105%2C957201&mv=m&mt=1419949043&sver=3&initcwndbps=872500&sparams=dur%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Cmime%2Cmm%2Cms%2Cmv%2Cratebypass%2Csource%2Cupn%2Cexpire&ratebypass=yes&signature=E17363F74C7068BEB4DB31FC90AEF2EA70A3C233.F634AC2BD1B5A6B27E1DDFB4FB09DE7C04D1DF0E&expire=1419970690&upn=5QFvFRIqKzs&ipbits=0,fallback_host=tc.v13.cache4.googlevideo.com'), ('quality', 'medium'), ('itag', '18'), ('type', 'video/mp4; codecs="avc1.42001E, mp4a.40.2"'), ('url', 'http://r3---sn-25ge7n7d.googlevideo.com/videoplayback?dur=716.985&id=o-AMERlvuyknt71bMvL2Sjki6y2WsGz0TDKn11unO3_SQy&mm=31&ip=89.84.122.217&key=yt5&itag=18&mime=video%2Fmp4&source=youtube&ms=au&fexp=900718%2C927622%2C931342%2C932404%2C938809%2C9405699%2C9406022%2C940927%2C940940%2C941004%2C943917%2C947209%2C947218%2C948124%2C952302%2C952605%2C952901%2C955110%2C955301%2C957103%2C957105%2C957201&mv=m&mt=1419949043&sver=3&initcwndbps=872500&sparams=dur%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Cmime%2Cmm%2Cms%2Cmv%2Cratebypass%2Csource%2Cupn%2Cexpire&ratebypass=yes&signature=78201511AECE7F328D67AA08EC40E22777C62616.6B0C1787F391F30F1D28D8C2BCD6E67C71F1BB5F&expire=1419970690&upn=5QFvFRIqKzs&ipbits=0,fallback_host=tc.v4.cache4.googlevideo.com'), ('quality', 'small'), ('itag', '5'), ('type', 'video/x-flv'), ('url', 'http://r3---sn-25ge7n7d.googlevideo.com/videoplayback?dur=716.983&id=o-AMERlvuyknt71bMvL2Sjki6y2WsGz0TDKn11unO3_SQy&mm=31&ip=89.84.122.217&key=yt5&itag=5&mime=video%2Fx-flv&source=youtube&ms=au&fexp=900718%2C927622%2C931342%2C932404%2C938809%2C9405699%2C9406022%2C940927%2C940940%2C941004%2C943917%2C947209%2C947218%2C948124%2C952302%2C952605%2C952901%2C955110%2C955301%2C957103%2C957105%2C957201&mv=m&mt=1419949043&sver=3&initcwndbps=872500&sparams=dur%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Cmime%2Cmm%2Cms%2Cmv%2Csource%2Cupn%2Cexpire&signature=DE27A5283FB425F79CC1ACBB67D0B20FF07D5BD5.DBACE3E830A573BF4092AC442C99278D4CFF549F&expire=1419970690&upn=5QFvFRIqKzs&ipbits=0,fallback_host=tc.v4.cache5.googlevideo.com'), ('quality', 'small'), ('itag', '36'), ('type', 'video/3gpp; codecs="mp4v.20.3, mp4a.40.2"'), ('url', 'http://r3---sn-25ge7n7d.googlevideo.com/videoplayback?dur=717.125&id=o-AMERlvuyknt71bMvL2Sjki6y2WsGz0TDKn11unO3_SQy&mm=31&ip=89.84.122.217&key=yt5&itag=36&mime=video%2F3gpp&source=youtube&ms=au&fexp=900718%2C927622%2C931342%2C932404%2C938809%2C9405699%2C9406022%2C940927%2C940940%2C941004%2C943917%2C947209%2C947218%2C948124%2C952302%2C952605%2C952901%2C955110%2C955301%2C957103%2C957105%2C957201&mv=m&mt=1419949043&sver=3&initcwndbps=872500&sparams=dur%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Cmime%2Cmm%2Cms%2Cmv%2Csource%2Cupn%2Cexpire&signature=E9DD3B41DDA5B39F12D7311682DEB24A376F04C9.0C3EEEFED598AF77E877D361B57385CE5941303F&expire=1419970690&upn=5QFvFRIqKzs&ipbits=0,fallback_host=tc.v9.cache5.googlevideo.com'), ('quality', 'small'), ('itag', '17'), ('type', 'video/3gpp; codecs="mp4v.20.3, mp4a.40.2"'), ('url', 'http://r3---sn-25ge7n7d.googlevideo.com/videoplayback?dur=717.217&id=o-AMERlvuyknt71bMvL2Sjki6y2WsGz0TDKn11unO3_SQy&mm=31&ip=89.84.122.217&key=yt5&itag=17&mime=video%2F3gpp&source=youtube&ms=au&fexp=900718%2C927622%2C931342%2C932404%2C938809%2C9405699%2C9406022%2C940927%2C940940%2C941004%2C943917%2C947209%2C947218%2C948124%2C952302%2C952605%2C952901%2C955110%2C955301%2C957103%2C957105%2C957201&mv=m&mt=1419949043&sver=3&initcwndbps=872500&sparams=dur%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Cmime%2Cmm%2Cms%2Cmv%2Csource%2Cupn%2Cexpire&signature=E4199F944FC4A5A1DFBAD4562EB628E62B53FD27.FA0A2D69378E3AB8B4E50FD55A2F64CA7A048EA1&expire=1419970690&upn=5QFvFRIqKzs&ipbits=0')] On Tuesday, December 30, 2014 6:49:51 AM UTC+1, Gaurang shah wrote: > > Following is the details. > Os: Windows 7 64 bit > Python 2.7 > Scrapy 0.25.1 > > I don't understand the last question. I am using selector provided by > scrapy to get the node using xpath. Following is the code. > > selector = Selector(response) > view_count = > selector.xpath("//div[@class='watch-view-count']/text()")[0].extract().strip() > video_url = > selector.xpath("//video[contains(@class,'html5-main-video')]/@src").extract() > > > Gaurang Shah > Blog: qtp-help.blogspot.com > Mobile: +91 738756556 > > On Tue, Dec 30, 2014 at 1:24 AM, bruce <[email protected] <javascript:>> > wrote: > >> Hey Gaurang, >> >> What's the OS, version of python, version of scrapy you're using? >> >> Does scrapy use urlib? or better, if you know, what lib does scrapy use >> for the url/xpath processing? >> >> >> >> On Mon, Dec 29, 2014 at 11:32 AM, Gaurang shah <[email protected] >> <javascript:>> wrote: >> >>> Sorry guys, Forgot to mentioned. All these xpath is able to identify the >>> elemenet using firepath add-on of firefox. >>> >>> *//video * >>> *//video[contains(@class,'html5-main-video')]/@src* >>> >>> *//div[@class='html5-video-container']/video/@src* >>> >>> *//div[@id='movie_player']/div[1]/video/@src* >>> >>> *//div[@id='player-api']/div[1]/div[1]/video/@src* >>> >>> *However none of them is working in scrapy ???* >>> >>> Gaurang Shah >>> Blog: qtp-help.blogspot.com >>> Mobile: +91 738756556 >>> >>> On Mon, Dec 29, 2014 at 9:41 PM, bruce <[email protected] <javascript:>> >>> wrote: >>> >>>> Are you able to effectively create an xpath using your browser's >>>> xpath/dev tools? >>>> >>>> in firefox, you can use dom inspector, there are others as well, not >>>> sure of your browser.. >>>> >>>> In other words, is the issue with the "video" element, or something >>>> else in your xpath? >>>> >>>> If you can resolve the xpath with a separate tool, that should give you >>>> direction to solve the issue. >>>> >>>> >>>> >>>> On Mon, Dec 29, 2014 at 7:38 AM, Gaurang shah <[email protected] >>>> <javascript:>> wrote: >>>> >>>>> Hi Guys, >>>>> >>>>> I am trying to scrap the youtube site. And somehow the xpath which >>>>> fetches the video src is not working in scrapy. >>>>> >>>>> Url: https://www.youtube.com/watch?v=1EFnX1UkXVU >>>>> >>>>> >>>>> following xpaths is not working >>>>> *//video * >>>>> *//video[contains(@class,'html5-main-video')]/@src* >>>>> >>>>> >>>>> <https://lh3.googleusercontent.com/--_vqbGQxgWg/VKFLFyraflI/AAAAAAAACLY/2352f1VU0ds/s1600/Image%2B004.jpg> >>>>> I am able to retrive xpath till,* //div[@id='player-api']*, after >>>>> that it's dead end. scrapy is not able to find any more node in this. >>>>> However there are nodes inside that as well. >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "scrapy-users" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected] <javascript:>. >>>>> To post to this group, send email to [email protected] >>>>> <javascript:>. >>>>> Visit this group at http://groups.google.com/group/scrapy-users. >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>>> You received this message because you are subscribed to a topic in the >>>> Google Groups "scrapy-users" group. >>>> To unsubscribe from this topic, visit >>>> https://groups.google.com/d/topic/scrapy-users/nGisMymqofU/unsubscribe. >>>> To unsubscribe from this group and all its topics, send an email to >>>> [email protected] <javascript:>. >>>> To post to this group, send email to [email protected] >>>> <javascript:>. >>>> Visit this group at http://groups.google.com/group/scrapy-users. >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "scrapy-users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected] <javascript:>. >>> To post to this group, send email to [email protected] >>> <javascript:>. >>> Visit this group at http://groups.google.com/group/scrapy-users. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "scrapy-users" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/scrapy-users/nGisMymqofU/unsubscribe. >> To unsubscribe from this group and all its topics, send an email to >> [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at http://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
