Re: When querying ElasticSearch, score is 0

2015-04-18 Thread Andrejs Abele
Thank you  for the information.
Cheers,
Andrejs
On 04/18/2015 10:23 AM, Nick Pentreath wrote:
> ES-hadoop uses a scan & scroll search to efficiently retrieve large
> result sets. Scores are not tracked in a scan and sorting is not
> supported hence 0 scores.
>
> http://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan
>
>
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
> On Thu, Apr 16, 2015 at 10:46 PM, Andrejs Abele
>  <mailto:andrejs.ab...@insight-centre.org>> wrote:
>
> Hi,
> I have data in my ElasticSearch server, when I query it using rest
> interface, I get results and score for each result, but when I run
> the same query in spark using ElasticSearch API,  I get results
> and meta data, but the score is shown 0 for each record.
> My configuration is
>
> ...
> val conf = new SparkConf()
>   .setMaster("local[6]")
>   .setAppName("DBpedia to ElasticSearch")
>   .set("es.index.auto.create", "true")
>   .set("es.field.read.empty.as.null","true")
>   .set("es.read.metadata","true")
>
> ...
> val sc = new SparkContext(conf) 
> val test= Map("query"->"{\n\"query\":{\n \"fuzzy_like_this\" : {\n 
> \"fields\" : [n
> >\"label\"],\n \"like_text\" : \"102nd Ohio Infantry\" }\n  } \n}")
> val mYRDD = sc.esRDD("dbpedia/docs",test.get("query").get)
>
> Sample output:
> Map(id -> "http://dbpedia.org/resource/Alert,_Ohio";, label -> "Alert, 
> Ohio", category -> "Unincorporated communities in Ohio", abstract -> "Alert 
> is an unincorporated community in southern Morgan Township, Butler County, 
> Ohio, in the United States. It is located about ten miles southwest of 
> Hamilton on Howards Creek, a tributary of the Great Miami River in section 28 
> of R1ET3N of the Congress Lands. It is three miles west of Shandon and two 
> miles south of Okeana.", _metadata -> Map(_index -> dbpedia, _type -> docs, 
> _id -> AUy5aQs7895C6HE5GmG4, _score -> 0.0))
> As you can see _score is 0.
>
> Would appreciate any help,
>
> Cheers,
> Andrejs 
>
>



When querying ElasticSearch, score is 0

2015-04-16 Thread Andrejs Abele
Hi,
I have data in my ElasticSearch server, when I query it using rest
interface, I get results and score for each result, but when I run the
same query in spark using ElasticSearch API,  I get results and meta
data, but the score is shown 0 for each record.
My configuration is

...
val conf = new SparkConf()
  .setMaster("local[6]")
  .setAppName("DBpedia to ElasticSearch")
  .set("es.index.auto.create", "true")
  .set("es.field.read.empty.as.null","true")
  .set("es.read.metadata","true")

...
val sc = new SparkContext(conf) 
val test= Map("query"->"{\n\"query\":{\n \"fuzzy_like_this\" : {\n \"fields\" : 
[\"label\"],\n \"like_text\" : \"102nd Ohio Infantry\" }\n  } \n}")
val mYRDD = sc.esRDD("dbpedia/docs",test.get("query").get)

Sample output:
Map(id -> "http://dbpedia.org/resource/Alert,_Ohio";, label -> "Alert, Ohio", 
category -> "Unincorporated communities in Ohio", abstract -> "Alert is an 
unincorporated community in southern Morgan Township, Butler County, Ohio, in 
the United States. It is located about ten miles southwest of Hamilton on 
Howards Creek, a tributary of the Great Miami River in section 28 of R1ET3N of 
the Congress Lands. It is three miles west of Shandon and two miles south of 
Okeana.", _metadata -> Map(_index -> dbpedia, _type -> docs, _id -> 
AUy5aQs7895C6HE5GmG4, _score -> 0.0))

As you can see _score is 0.

Would appreciate any help,

Cheers,
Andrejs 



save as JSON objects

2014-11-04 Thread Andrejs Abele
Hi,
Can some one pleas sugest me, what is the best way to output spark data as
JSON file. (File where each line is a JSON object)
Cheers,
Andrejs


Re: how idf is calculated

2014-10-31 Thread Andrejs Abele
I found my problem. I assumed based on TF-IDF in  Wikipedia , that log base
10 is used, but as I found in this discussion
, in
scala it is actually ln (natural logarithm).

Regards,
Andrejs

On Thu, Oct 30, 2014 at 10:49 PM, Ashic Mahtab  wrote:

> Hi Andrejs,
> The calculations are a bit different to what I've come across in Mining
> Massive Datasets (2nd Ed. Ullman et. al.,  Cambridge Press) available here:
> http://www.mmds.org/
>
> Their calculation of IDF is as follows:
>
> IDFi = log2(N / ni)
>
> where N is the number of documents and ni is the number of documents in
> which the word appears. This looks different to your IDF function.
>
> For TF, they use
>
> TFij = fij / maxk fkj
>
> That is:
>
> For document j,
>  the term frequency of the term i in j is the number of times i
> appears in j divided by the maximum number of times any term appears in j.
> Stop words are usually excluded when considering the maximum).
>
> So, in your case, the
>
> TFa1 = 2 / 2 = 1
> TFb1 = 1 / 2 = 0.5
> TFc1 = 1/2 = 0.5
> TFm1 = 2/2 = 1
> ...
>
> IDFa = log2(3 / 2) = 0.585
>
> So, TFa1 * IDFa = 0.585
>
> Wikipedia mentions an adjustment to overcome biases for long documents, by
> calculating TFij = 0.5 + {(0.5*fij)/maxk fkj}, but that doesn't change
> anything for TFa1, as the value remains 1.
>
> In other words, my calculations don't agree with yours, and neither seem
> to agree with Spark :)
>
> Regards,
> Ashic.
>
> --
> Date: Thu, 30 Oct 2014 22:13:49 +
> Subject: how idf is calculated
> From: andr...@sindicetech.com
> To: u...@spark.incubator.apache.org
>
>
> Hi,
> I'm writing a paper and I need to calculate tf-idf. Whit your help I
> managed to get results, I needed, but the problem is that I need to be able
> to explain how each number was gotten. So I tried to understand how idf was
> calculated and the numbers i get don't correspond to those I should get .
>
> I have 3 documents (each line a document)
> a a b c m m
> e a c d e e
> d j k l m m c
>
> When I calculate tf, I get this
> (1048576,[99,100,106,107,108,109],[1.0,1.0,1.0,1.0,1.0,2.0])
> (1048576,[97,98,99,109],[2.0,1.0,1.0,2.0])
> (1048576,[97,99,100,101],[1.0,1.0,1.0,3.0]
>
> idf is supposedly calculated idf = log((m + 1) / (d(t) + 1))
> m -number of documents (3 in my case).
> d(t) - in how many documents is term present
> a: log(4/3) =0.1249387366
> b: log(4/2) =0.3010299957
> c: log(4/4) =0
> d: log(4/3) =0.1249387366
> e: log(4/2) =0.3010299957
> l: log(4/2) =0.3010299957
> m: log(4/3) =0.1249387366
>
> When I output  idf vector `
> idf.idf.toArray.filter(_.>(0)).distinct.foreach(println(_)) `
> I get :
> 1.3862943611198906
> 0.28768207245178085
> 0.6931471805599453
>
> I understand why there are only 3 numbers, because only 3 are unique :
> log(4/2), log(4/3), log(4/4), but I don't understand how numbers in idf
> where calculated
>
> Best regards,
> Andrejs
>
>


how idf is calculated

2014-10-30 Thread Andrejs Abele
Hi,
I'm writing a paper and I need to calculate tf-idf. Whit your help I
managed to get results, I needed, but the problem is that I need to be able
to explain how each number was gotten. So I tried to understand how idf was
calculated and the numbers i get don't correspond to those I should get .

I have 3 documents (each line a document)
a a b c m m
e a c d e e
d j k l m m c

When I calculate tf, I get this
(1048576,[99,100,106,107,108,109],[1.0,1.0,1.0,1.0,1.0,2.0])
(1048576,[97,98,99,109],[2.0,1.0,1.0,2.0])
(1048576,[97,99,100,101],[1.0,1.0,1.0,3.0]

idf is supposedly calculated idf = log((m + 1) / (d(t) + 1))
m -number of documents (3 in my case).
d(t) - in how many documents is term present
a: log(4/3) =0.1249387366
b: log(4/2) =0.3010299957
c: log(4/4) =0
d: log(4/3) =0.1249387366
e: log(4/2) =0.3010299957
l: log(4/2) =0.3010299957
m: log(4/3) =0.1249387366

When I output  idf vector `
idf.idf.toArray.filter(_.>(0)).distinct.foreach(println(_)) `
I get :
1.3862943611198906
0.28768207245178085
0.6931471805599453

I understand why there are only 3 numbers, because only 3 are unique :
log(4/2), log(4/3), log(4/4), but I don't understand how numbers in idf
where calculated

Best regards,
Andrejs


Getting vector values

2014-10-30 Thread Andrejs Abele
Hi,

I'm new to Mllib and spark.  I'm trying to use tf-idf and use those values
for term ranking.
I'm getting tf values in vector format, but how can get the values of
vector?

 val sc = new SparkContext(conf)
 val documents: RDD[Seq[String]] =
sc.textFile("/home/andrejs/Datasets/dbpedia/test.txt").map(_.split("
").toSeq)
   documents.foreach(println(_))
   val hashingTF = new HashingTF()
   val tf: RDD[Vector] = hashingTF.transform(documents)

   tf.foreach(println(_))

My output is :
WrappedArray(a, a, b, c)
WrappedArray(e, a, c, d)

(1048576,[97,99,100,101],[1.0,1.0,1.0,1.0])
(1048576,[97,98,99],[2.0,1.0,1.0])

How can I get  [97,99,100,101] out, and [1.0,1.0,1.0,1.0] ?
And how can I map that 100 = 1.0  ?

Some help is greatly appreciated,

Andrejs