Re: [tor-dev] GSoC: Ahmia.fi - Search Engine for Hidden Services

2014-05-12 Thread Juha Nurmi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

Some updates. Ahmia have now fresh new YaCy back-end installed.
Unfortunately, I messed up with Solr and eventually we might have to
destroy and re-crawl everything again. At the moment, it at least works.

Then some good news. I created a milestone to github. There are all
the main features and I try to develop them as fast as I can :)

https://github.com/juhanurmi/ahmia/issues?milestone=1&page=1&state=open

Currently, I have worked some code to gather popularity stats and new
domains from tor2web nodes and saving them to ahmia.fi. Furthermore, I
have built a tool that checks backlinks from the public WWW! This data
is useful for the popularity measurements.

I am already pushing code to github :)

Cheers,
Juha
-BEGIN PGP SIGNATURE-
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTcRgHAAoJELGTs54GL8vA2jAH/j2aIV158GSpS+udWM62PfsM
3RxTkzfnfxRT5JPC/BtVNqzDCwnyePskK3FVR6etd+rA9XD55He6Kb9EAFypfkK4
QI/2/IVViWOZzL/S55bz97/DbBPPCpIoesd20cUNC08qK57FnZZOKrQFCVtyL11i
MskET/TMIZLFgXjLlCoGCsGvCt386OjbN1A0aAJkEwvKf9EfWEZdDED12nj4jaMB
s6+dKr8+4jJt8hBKsrPSw1Kcb7UNBBzFGUL/N75Rl4fVToE9YJyLtNHhogy7z2JH
d9JFuIcoSl/ZK/Ly1W/91DcJgZQwVU4fUedQ/aWocPO/HSxaUXsgIir88BoX89M=
=2TIk
-END PGP SIGNATURE-
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] GSoC: Ahmia.fi - Search Engine for Hidden Services

2014-04-27 Thread Juha Nurmi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 25.04.2014 17:27, George Kadianakis wrote:
> Juha Nurmi  writes:
> 
>> On 22.04.2014 17:35, George Kadianakis wrote:
>>> Enjoy GSoC :)
>> 
>> I will :)
>> 
>>> BTW, looking again at your proposal, I see that you are going
>>> to do both popularity tracking and backlinks.
>> 
>> Yes, another crawler gathers backlinks from the public WWW and I
>> will start gathering the URL clicks from the users.
>> 
>>> How are these two technologies going to interact with each
>>> other? That is, how will the indexer consider the output of
>>> those two features?
>> 
>> Django front-end re-sorts the answers from YaCy back-end.
>> 
>> See https://ahmia.fi/static/gsoc/re_sort.jpg
>> 
>> I have this idea in mind: https://ahmia.fi/static/gsoc/sorter.py
>> 
>> The result is sorted according to YaCy result index, number of 
>> backlinks and clicks which are scaled.
>> 
>> Note the scaling:  p_info.backlinks = 1 / (float(index) + 1)
>> etc.
>> 
>> sum_function = 3.0*self.yacy + 2.0*self.backlinks +
>> 1.0*self.clicks
>> 
>> where 3, 2 and 1 are test coefficients. I will optimize these and
>> made a better model if necessary. However, clicks are easily
>> spoofed and there have to be small coefficient for them.
>> 
> 
> That makes sense.
> 
> BTW, what is the 'yacy' score? Is it just the order that YaCy's 
> indexer chose for each result? Or does YaCy actually expose a
> score for each result? How is the score derived? Or do you treat it
> as a blackbox and assume it's the most accurate of backlinks and 
> popularity.
> 

I am using only the order information.

BTW, we (Mikko installed new servers) are migrating YaCy servers and
took down the old one system. There should be a working crawler +
fresh full text search results soon :)

- -Juha
-BEGIN PGP SIGNATURE-
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTXK5uAAoJELGTs54GL8vA1bcH/R/8xYJMCk7rc296/UBWBlaX
SDGYO/85EjbdBUokleQAZ8odxrV+rNCbsWMbncddo8QLxl6w99tS9Wz1ehZ+KOI2
beSCSEdS46gnztoGTRrRos4YFxEfbq708wFUh0CDQbzeT9doBX6dAV62FXhP8Fgm
sY/YvqNMJSBnqqlojsAfHV70IorjveEJ23pnktX8fcfkTqM+xBIVk0Ul2zggQNW+
c/d9SuaZLDB2Fdbsch4Ip3Tln8C/tLF7HC1cyRh7QDwU1zmr8UUe0N3mmzwEqUWA
h/uD/U3yZSNQfGrSI8/19QjvsDqCdoWIP/i78B90iIZhJ8YNlyN+cydb1O+cj9A=
=Dfu/
-END PGP SIGNATURE-
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] GSoC: Ahmia.fi - Search Engine for Hidden Services

2014-04-23 Thread Juha Nurmi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 22.04.2014 17:35, George Kadianakis wrote:
> Enjoy GSoC :)

I will :)

> BTW, looking again at your proposal, I see that you are going to
> do both popularity tracking and backlinks.

Yes, another crawler gathers backlinks from the public WWW and I will
start gathering the URL clicks from the users.

> How are these two technologies going to interact with each other?
> That is, how will the indexer consider the output of those two
> features?

Django front-end re-sorts the answers from YaCy back-end.

See https://ahmia.fi/static/gsoc/re_sort.jpg

I have this idea in mind: https://ahmia.fi/static/gsoc/sorter.py

The result is sorted according to YaCy result index, number of
backlinks and clicks which are scaled.

Note the scaling:  p_info.backlinks = 1 / (float(index) + 1) etc.

sum_function = 3.0*self.yacy + 2.0*self.backlinks + 1.0*self.clicks

where 3, 2 and 1 are test coefficients. I will optimize these and made
a better model if necessary. However, clicks are easily spoofed and
there have to be small coefficient for them.

> Also, with your newly acquired knowledge about backlinks, how long
> is it going to take your incorporate them in ahmia? Are you
> actually going to do it during the "Use an another crawler to
> search .onion pages from the public Internet" phase?

We can test it when popularity tracking and backlinks crawler are working.

- -Juha
-BEGIN PGP SIGNATURE-
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTWKhsAAoJELGTs54GL8vA+WAH/1i4sCvvcwotn5b39Ox8yldn
Wv6mBxqlIiaoeBj1Eeu+A92QfGvvpxdWDb7Kn3+3u0IO0wXcZlf0SrIri11IgprW
1f8x5BMDYiaFl12dVO/3jfXSmdfKQ24AdKknfK9wuD63266L2Tks/DVURHQKrYaM
zTfYJKZNWJtOPxUj45lHknHxDWVzRlmqiksRn1aPwx2EW5dpKCCVkV9ySnJdZW74
DWs1es1rLKj6UVmVl6w88PJ/C1COWhMQspXtYIZ8paZQfMHtEgDxLuifITIHgdBh
TdGLUEVteUl5wyCNjDh1Q+ZEkdbMvcpNZuP5D3lUYweHz0cMMOGHC0oaLlJS4KE=
=48jK
-END PGP SIGNATURE-
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


[tor-dev] GSoC: Ahmia.fi - Search Engine for Hidden Services

2014-04-22 Thread Juha Nurmi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

I'm a student who is starting to work with ahmia.fi search engine as a
part of Google Summer of Code. :)

The proposal is online here https://ahmia.fi/gsoc/

In practise, I have now time and funding to develop my search engine.
George is my primary mentor and Moritz the backup mentor.

Today, I will submit all the required documents (the tax forms etc.)
to Google.

After that, I think I will speed up with code base in the GitHub :)

Cheers,
Juha
-BEGIN PGP SIGNATURE-
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTViWUAAoJELGTs54GL8vAuC8H/jSdgBCRQM/3l3mX5Uig9fgM
wacPsxm6RJd3Sw+JJpYgoRP1nDqI513haP4Z6s//tR3Vn5RyQ/u7ik3QdFEVKbJD
KqnQ4Eaf5hT4xsJwBXZIjzW6uhbYaq1GmUJi4eaglwUrgIgJrHzDbOz/p8q71O1z
rLnrS1vrsvMzY4rU0dRe1/S9LyPWTUAfpVMINa54RPmNjMzrTT/WUnlcQWo9cY3a
SRrT2MVz5nwBEXJuhZUmC3L6XLL8RX2TgzGwVyYOUfMlNuZdcSaOOTvF7gKVZVZQ
hGhr/V40iNm5BOAcQ2TVaxuR5HjxSFWUp15T8ux+xxyN/Yp9EeaDjsAsTVegq0w=
=QMfR
-END PGP SIGNATURE-
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] GSoC - Search Engine for Hidden services

2014-03-25 Thread Juha Nurmi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 24.03.2014 13:57, George Kadianakis wrote:
> Replying to some new additions in the proposal:
> 
>> Thanks asn! "Ask help from organizations that are crawling" Today
>> I emailed to duckduckgo and asked is there an easy way to search
>> new .onions using their search engine. "Checking out the
>> backlinks from public WWW" With known onion address it is
>> possible to find the popularity of an address checking the number
>> of search results: 
>> https://duckduckgo.com/?q=%22http%3A%2F%2Fjlve2y45zacpbz6s.onion%22
>>
>> 
and https://www.google.com/#q=%22http:%2F%2Fjlve2y45zacpbz6s.onion%22
>> and
>> https://www.google.com/#q=link:http:%2F%2Fjlve2y45zacpbz6s.onion 
>> This way I will get a list that tells the popularity according
>> to links from the public WWW: onion address & number of WWW sites
>> that are linking to it xyz.onion 123 abc.onion 90 uio.onion 24
>> mre.onion 17 Today I asked from the YaCy's developer how could I
>> use this information. "Commenting features" I agree that
>> commenting might be a mouth of madness because people might write
>> just some random crap there. Technically this would be developed
>> to the Django framework. Note that the priority of this task is
>> low (10). We could decide to leave this commenting feature to the
>> very last task or skip it.
> 
> ACK wrt commenting.
> 
> As far as backlinks are concerned, while I appreciate how rapid
> and easy your solution is, you might want to make it a bit more
> robust.
> 
> The way you did it, you treat the 123 references to 'xyz.onion',
> as strictly better than the 90 references to 'abc.onion'. This is
> not the case in the real web, since the 123 references to
> 'xyz.onion' might be SEO and they might be coming from xyz.onion
> itself or related websites.
> 
> Proper search engines assign weights to each backlink, according
> to how legit the search engine believes the linker to be. This has
> to do with how many backlinks the linker had, and how legit the
> HTML content of the linker looks like, etc. You can find more
> heuristics that search engines use by skimming an SEO book or an
> SEO forum.
> 
> It's up to you how deep you want to go into backlinking during
> GSoC, but IMO backlinking is a more reliable heuristic than
> popularity tracking. Up to you anyway!
> 
> ___ tor-dev mailing
> list tor-dev@lists.torproject.org 
> https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
> 

We could test the reliability of the linkers too. As you said, there
are multiple methods to do this. Because the number of .onions and the
linkers is relatively small we can analyze the linking sites too.
Usually there are <10 sites linking to an .onion site.

- -Juha
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTMmA4AAoJELGTs54GL8vAggAH/A/n6mVtrAxWNaJ4pvqevw+l
gIDpW69HDgP3431jEeH6n8WqN42AbfAxvqBb+cUtPSvUDV+ihopxK/aUs88mexjd
kLpsPbzT84idYRxNP1w/nt4r7uUjSTEEL/XBG0CEv5IAyzZIe+kzYm2ghIW7RRKp
BwIEyJcYLMDPnlAjZEkFJ2D06CghmUJYxNwywyIcrDLQi/4yhzE0bpxPg7axfo5h
yfjN3z6kogrDY0dHmQ6ljC7RawVc2TyfWDcIo/NghIjHQkon+JRY+s0s49c/Nng3
n8da1/UwCLXB5g/tW9NcOUNpvFhwSDIRimIHASMuw0s3OvQoU6KT43AtDGQg6Nw=
=wcJt
-END PGP SIGNATURE-
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] GSoC - Search Engine for Hidden services

2014-03-17 Thread Juha Nurmi
>>> Also, are you sure that 1-3 workdays are sufficient to design &
>>>  implement a banned domain synchronizer between tor2web and
>>> ahmia?
>> 
>> Well, I cannot know that. Let's put one workweek for that. I am
>> hoping to spend a workday or two with Tor2web and we get it
>> done.
>> 
> 
> How is ahmia going to communicate with tor2web? Will the connection
> be authenticated? How will you block bad people from adding their
> own stuff to your blacklist?

One way to solve this is to download a list of working Tor2web nodes
from the github. These nodes are added manually to the github. After
that I can download the information from the nodes everyday. On the
other hand, Tor2web software can download the list of the banned
domains from ahmia.fi. This is one easy way to handle the information
exchange.
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] GSoC - Search Engine for Hidden services

2014-03-17 Thread Juha Nurmi
On 17.03.2014 15:17, George Kadianakis wrote:
> But now that you don't have a "Search API" project, what are you
> going to do during the Globaleaks integration?

The search API was supposed to be a query API to the ahmia's database.
However, this is not a relevant feature at the moment.

> Also, are you sure that 1-3 workdays are sufficient to design & 
> implement a banned domain synchronizer between tor2web and ahmia?

Well, I cannot know that. Let's put one workweek for that. I am hoping
to spend a workday or two with Tor2web and we get it done.

> BTW, you are supposed to do your application in Google Melange,
> not in this mailing list (although I'm happy you posted your app
> here so that more people can comment on it!). The website is: 
> https://www.google-melange.com/gsoc/homepage/google/gsoc2014 The
> deadline is in 4 days or so, I think.

Sure!

> PS: Nitpick but I should exercise my allergy to broken crypto and 
> suggest to switch to a better hash algorithm (SHA256 or so) instead
> of MD5 for passing banned domain names around.

The reason we are publishing MD5sums of the banned domains is that in
some countries it is illegal to own or host a list of CP URLs. Anyway,
if someone is looking for CP .onions he will find them... However, I
do not see any reason why we couldn't together with Tor2web change to
a better hash algorithm.

-Juha
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] GSoC - Search Engine for Hidden services

2014-03-13 Thread Juha Nurmi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

> And what would you like to do over the summer so that: a) Something
> useful and concrete comes out of only 3 months of work. b) Your
> work will also be useful after the summer ends.
> 
> I would be interested to see some areas that you would like to work
> on over the summer, and how that would change the ahmia.fi user
> experience.

I have drafted a timetable for the possible new features to ahmia.fi:

https://docs.google.com/document/d/1XB42HM4uESYBAnoHHRuaqKMP64VFDI91Qa-CtIuye2E/edit?usp=sharing

>> We would like to release the source code of ahmia.fi and develop
>> our search engine in GSoC.
>> 
> 
> Yes, releasing the source code of ahmia.fi would be very useful in
> any case. It would be great if you could do that.

I released the source code of ahmia.fi:
https://github.com/juhanurmi/ahmia

- -Juha

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTId07AAoJELGTs54GL8vAGcQIAJIVzSfP0Ppx0JEwvFT5aS8r
dUQZ/zJJHuJQae5l79mYWKXlJfrKwuYnFSnqBIftw69UbFqyfu7EhXvAryCqL7L7
Dcq/LAQj5k75srOI9g+9oDnmpf1I7hA3L5a2WF63QTXwnv0XFOA5AVWVSdt2hlIU
r83gJY7+npF5ZgVeBKZXfNQ9tOSpOT9VpLZRrMf8dxXlnkMlEaXWIj0Y5uCx3oIw
jyVClj4wkYg8bQbxAYROT3sbc024U6Ua8dRm0Xhk/H7y6bJw6CmcLUpybrB0VvRO
pxv1MGxAFTXOALXblX4sxc+15IgLFaxbV26RCijbQ6Q1r0AXDb3XMlP7Xgg1H/8=
=AIFa
-END PGP SIGNATURE-
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev