Re: [tor-dev] GSoC: Ahmia.fi - Search Engine for Hidden Services
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, Some updates. Ahmia have now fresh new YaCy back-end installed. Unfortunately, I messed up with Solr and eventually we might have to destroy and re-crawl everything again. At the moment, it at least works. Then some good news. I created a milestone to github. There are all the main features and I try to develop them as fast as I can :) https://github.com/juhanurmi/ahmia/issues?milestone=1&page=1&state=open Currently, I have worked some code to gather popularity stats and new domains from tor2web nodes and saving them to ahmia.fi. Furthermore, I have built a tool that checks backlinks from the public WWW! This data is useful for the popularity measurements. I am already pushing code to github :) Cheers, Juha -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTcRgHAAoJELGTs54GL8vA2jAH/j2aIV158GSpS+udWM62PfsM 3RxTkzfnfxRT5JPC/BtVNqzDCwnyePskK3FVR6etd+rA9XD55He6Kb9EAFypfkK4 QI/2/IVViWOZzL/S55bz97/DbBPPCpIoesd20cUNC08qK57FnZZOKrQFCVtyL11i MskET/TMIZLFgXjLlCoGCsGvCt386OjbN1A0aAJkEwvKf9EfWEZdDED12nj4jaMB s6+dKr8+4jJt8hBKsrPSw1Kcb7UNBBzFGUL/N75Rl4fVToE9YJyLtNHhogy7z2JH d9JFuIcoSl/ZK/Ly1W/91DcJgZQwVU4fUedQ/aWocPO/HSxaUXsgIir88BoX89M= =2TIk -END PGP SIGNATURE- ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] GSoC: Ahmia.fi - Search Engine for Hidden Services
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 25.04.2014 17:27, George Kadianakis wrote: > Juha Nurmi writes: > >> On 22.04.2014 17:35, George Kadianakis wrote: >>> Enjoy GSoC :) >> >> I will :) >> >>> BTW, looking again at your proposal, I see that you are going >>> to do both popularity tracking and backlinks. >> >> Yes, another crawler gathers backlinks from the public WWW and I >> will start gathering the URL clicks from the users. >> >>> How are these two technologies going to interact with each >>> other? That is, how will the indexer consider the output of >>> those two features? >> >> Django front-end re-sorts the answers from YaCy back-end. >> >> See https://ahmia.fi/static/gsoc/re_sort.jpg >> >> I have this idea in mind: https://ahmia.fi/static/gsoc/sorter.py >> >> The result is sorted according to YaCy result index, number of >> backlinks and clicks which are scaled. >> >> Note the scaling: p_info.backlinks = 1 / (float(index) + 1) >> etc. >> >> sum_function = 3.0*self.yacy + 2.0*self.backlinks + >> 1.0*self.clicks >> >> where 3, 2 and 1 are test coefficients. I will optimize these and >> made a better model if necessary. However, clicks are easily >> spoofed and there have to be small coefficient for them. >> > > That makes sense. > > BTW, what is the 'yacy' score? Is it just the order that YaCy's > indexer chose for each result? Or does YaCy actually expose a > score for each result? How is the score derived? Or do you treat it > as a blackbox and assume it's the most accurate of backlinks and > popularity. > I am using only the order information. BTW, we (Mikko installed new servers) are migrating YaCy servers and took down the old one system. There should be a working crawler + fresh full text search results soon :) - -Juha -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTXK5uAAoJELGTs54GL8vA1bcH/R/8xYJMCk7rc296/UBWBlaX SDGYO/85EjbdBUokleQAZ8odxrV+rNCbsWMbncddo8QLxl6w99tS9Wz1ehZ+KOI2 beSCSEdS46gnztoGTRrRos4YFxEfbq708wFUh0CDQbzeT9doBX6dAV62FXhP8Fgm sY/YvqNMJSBnqqlojsAfHV70IorjveEJ23pnktX8fcfkTqM+xBIVk0Ul2zggQNW+ c/d9SuaZLDB2Fdbsch4Ip3Tln8C/tLF7HC1cyRh7QDwU1zmr8UUe0N3mmzwEqUWA h/uD/U3yZSNQfGrSI8/19QjvsDqCdoWIP/i78B90iIZhJ8YNlyN+cydb1O+cj9A= =Dfu/ -END PGP SIGNATURE- ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] GSoC: Ahmia.fi - Search Engine for Hidden Services
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 22.04.2014 17:35, George Kadianakis wrote: > Enjoy GSoC :) I will :) > BTW, looking again at your proposal, I see that you are going to > do both popularity tracking and backlinks. Yes, another crawler gathers backlinks from the public WWW and I will start gathering the URL clicks from the users. > How are these two technologies going to interact with each other? > That is, how will the indexer consider the output of those two > features? Django front-end re-sorts the answers from YaCy back-end. See https://ahmia.fi/static/gsoc/re_sort.jpg I have this idea in mind: https://ahmia.fi/static/gsoc/sorter.py The result is sorted according to YaCy result index, number of backlinks and clicks which are scaled. Note the scaling: p_info.backlinks = 1 / (float(index) + 1) etc. sum_function = 3.0*self.yacy + 2.0*self.backlinks + 1.0*self.clicks where 3, 2 and 1 are test coefficients. I will optimize these and made a better model if necessary. However, clicks are easily spoofed and there have to be small coefficient for them. > Also, with your newly acquired knowledge about backlinks, how long > is it going to take your incorporate them in ahmia? Are you > actually going to do it during the "Use an another crawler to > search .onion pages from the public Internet" phase? We can test it when popularity tracking and backlinks crawler are working. - -Juha -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTWKhsAAoJELGTs54GL8vA+WAH/1i4sCvvcwotn5b39Ox8yldn Wv6mBxqlIiaoeBj1Eeu+A92QfGvvpxdWDb7Kn3+3u0IO0wXcZlf0SrIri11IgprW 1f8x5BMDYiaFl12dVO/3jfXSmdfKQ24AdKknfK9wuD63266L2Tks/DVURHQKrYaM zTfYJKZNWJtOPxUj45lHknHxDWVzRlmqiksRn1aPwx2EW5dpKCCVkV9ySnJdZW74 DWs1es1rLKj6UVmVl6w88PJ/C1COWhMQspXtYIZ8paZQfMHtEgDxLuifITIHgdBh TdGLUEVteUl5wyCNjDh1Q+ZEkdbMvcpNZuP5D3lUYweHz0cMMOGHC0oaLlJS4KE= =48jK -END PGP SIGNATURE- ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
[tor-dev] GSoC: Ahmia.fi - Search Engine for Hidden Services
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I'm a student who is starting to work with ahmia.fi search engine as a part of Google Summer of Code. :) The proposal is online here https://ahmia.fi/gsoc/ In practise, I have now time and funding to develop my search engine. George is my primary mentor and Moritz the backup mentor. Today, I will submit all the required documents (the tax forms etc.) to Google. After that, I think I will speed up with code base in the GitHub :) Cheers, Juha -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTViWUAAoJELGTs54GL8vAuC8H/jSdgBCRQM/3l3mX5Uig9fgM wacPsxm6RJd3Sw+JJpYgoRP1nDqI513haP4Z6s//tR3Vn5RyQ/u7ik3QdFEVKbJD KqnQ4Eaf5hT4xsJwBXZIjzW6uhbYaq1GmUJi4eaglwUrgIgJrHzDbOz/p8q71O1z rLnrS1vrsvMzY4rU0dRe1/S9LyPWTUAfpVMINa54RPmNjMzrTT/WUnlcQWo9cY3a SRrT2MVz5nwBEXJuhZUmC3L6XLL8RX2TgzGwVyYOUfMlNuZdcSaOOTvF7gKVZVZQ hGhr/V40iNm5BOAcQ2TVaxuR5HjxSFWUp15T8ux+xxyN/Yp9EeaDjsAsTVegq0w= =QMfR -END PGP SIGNATURE- ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] GSoC - Search Engine for Hidden services
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 24.03.2014 13:57, George Kadianakis wrote: > Replying to some new additions in the proposal: > >> Thanks asn! "Ask help from organizations that are crawling" Today >> I emailed to duckduckgo and asked is there an easy way to search >> new .onions using their search engine. "Checking out the >> backlinks from public WWW" With known onion address it is >> possible to find the popularity of an address checking the number >> of search results: >> https://duckduckgo.com/?q=%22http%3A%2F%2Fjlve2y45zacpbz6s.onion%22 >> >> and https://www.google.com/#q=%22http:%2F%2Fjlve2y45zacpbz6s.onion%22 >> and >> https://www.google.com/#q=link:http:%2F%2Fjlve2y45zacpbz6s.onion >> This way I will get a list that tells the popularity according >> to links from the public WWW: onion address & number of WWW sites >> that are linking to it xyz.onion 123 abc.onion 90 uio.onion 24 >> mre.onion 17 Today I asked from the YaCy's developer how could I >> use this information. "Commenting features" I agree that >> commenting might be a mouth of madness because people might write >> just some random crap there. Technically this would be developed >> to the Django framework. Note that the priority of this task is >> low (10). We could decide to leave this commenting feature to the >> very last task or skip it. > > ACK wrt commenting. > > As far as backlinks are concerned, while I appreciate how rapid > and easy your solution is, you might want to make it a bit more > robust. > > The way you did it, you treat the 123 references to 'xyz.onion', > as strictly better than the 90 references to 'abc.onion'. This is > not the case in the real web, since the 123 references to > 'xyz.onion' might be SEO and they might be coming from xyz.onion > itself or related websites. > > Proper search engines assign weights to each backlink, according > to how legit the search engine believes the linker to be. This has > to do with how many backlinks the linker had, and how legit the > HTML content of the linker looks like, etc. You can find more > heuristics that search engines use by skimming an SEO book or an > SEO forum. > > It's up to you how deep you want to go into backlinking during > GSoC, but IMO backlinking is a more reliable heuristic than > popularity tracking. Up to you anyway! > > ___ tor-dev mailing > list tor-dev@lists.torproject.org > https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev > We could test the reliability of the linkers too. As you said, there are multiple methods to do this. Because the number of .onions and the linkers is relatively small we can analyze the linking sites too. Usually there are <10 sites linking to an .onion site. - -Juha -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTMmA4AAoJELGTs54GL8vAggAH/A/n6mVtrAxWNaJ4pvqevw+l gIDpW69HDgP3431jEeH6n8WqN42AbfAxvqBb+cUtPSvUDV+ihopxK/aUs88mexjd kLpsPbzT84idYRxNP1w/nt4r7uUjSTEEL/XBG0CEv5IAyzZIe+kzYm2ghIW7RRKp BwIEyJcYLMDPnlAjZEkFJ2D06CghmUJYxNwywyIcrDLQi/4yhzE0bpxPg7axfo5h yfjN3z6kogrDY0dHmQ6ljC7RawVc2TyfWDcIo/NghIjHQkon+JRY+s0s49c/Nng3 n8da1/UwCLXB5g/tW9NcOUNpvFhwSDIRimIHASMuw0s3OvQoU6KT43AtDGQg6Nw= =wcJt -END PGP SIGNATURE- ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] GSoC - Search Engine for Hidden services
>>> Also, are you sure that 1-3 workdays are sufficient to design & >>> implement a banned domain synchronizer between tor2web and >>> ahmia? >> >> Well, I cannot know that. Let's put one workweek for that. I am >> hoping to spend a workday or two with Tor2web and we get it >> done. >> > > How is ahmia going to communicate with tor2web? Will the connection > be authenticated? How will you block bad people from adding their > own stuff to your blacklist? One way to solve this is to download a list of working Tor2web nodes from the github. These nodes are added manually to the github. After that I can download the information from the nodes everyday. On the other hand, Tor2web software can download the list of the banned domains from ahmia.fi. This is one easy way to handle the information exchange. ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] GSoC - Search Engine for Hidden services
On 17.03.2014 15:17, George Kadianakis wrote: > But now that you don't have a "Search API" project, what are you > going to do during the Globaleaks integration? The search API was supposed to be a query API to the ahmia's database. However, this is not a relevant feature at the moment. > Also, are you sure that 1-3 workdays are sufficient to design & > implement a banned domain synchronizer between tor2web and ahmia? Well, I cannot know that. Let's put one workweek for that. I am hoping to spend a workday or two with Tor2web and we get it done. > BTW, you are supposed to do your application in Google Melange, > not in this mailing list (although I'm happy you posted your app > here so that more people can comment on it!). The website is: > https://www.google-melange.com/gsoc/homepage/google/gsoc2014 The > deadline is in 4 days or so, I think. Sure! > PS: Nitpick but I should exercise my allergy to broken crypto and > suggest to switch to a better hash algorithm (SHA256 or so) instead > of MD5 for passing banned domain names around. The reason we are publishing MD5sums of the banned domains is that in some countries it is illegal to own or host a list of CP URLs. Anyway, if someone is looking for CP .onions he will find them... However, I do not see any reason why we couldn't together with Tor2web change to a better hash algorithm. -Juha ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Re: [tor-dev] GSoC - Search Engine for Hidden services
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 > And what would you like to do over the summer so that: a) Something > useful and concrete comes out of only 3 months of work. b) Your > work will also be useful after the summer ends. > > I would be interested to see some areas that you would like to work > on over the summer, and how that would change the ahmia.fi user > experience. I have drafted a timetable for the possible new features to ahmia.fi: https://docs.google.com/document/d/1XB42HM4uESYBAnoHHRuaqKMP64VFDI91Qa-CtIuye2E/edit?usp=sharing >> We would like to release the source code of ahmia.fi and develop >> our search engine in GSoC. >> > > Yes, releasing the source code of ahmia.fi would be very useful in > any case. It would be great if you could do that. I released the source code of ahmia.fi: https://github.com/juhanurmi/ahmia - -Juha -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTId07AAoJELGTs54GL8vAGcQIAJIVzSfP0Ppx0JEwvFT5aS8r dUQZ/zJJHuJQae5l79mYWKXlJfrKwuYnFSnqBIftw69UbFqyfu7EhXvAryCqL7L7 Dcq/LAQj5k75srOI9g+9oDnmpf1I7hA3L5a2WF63QTXwnv0XFOA5AVWVSdt2hlIU r83gJY7+npF5ZgVeBKZXfNQ9tOSpOT9VpLZRrMf8dxXlnkMlEaXWIj0Y5uCx3oIw jyVClj4wkYg8bQbxAYROT3sbc024U6Ua8dRm0Xhk/H7y6bJw6CmcLUpybrB0VvRO pxv1MGxAFTXOALXblX4sxc+15IgLFaxbV26RCijbQ6Q1r0AXDb3XMlP7Xgg1H/8= =AIFa -END PGP SIGNATURE- ___ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev