Re: How to approach Indexing for a newbie?

2014-01-15 Thread Ralf Schmitt
joergpra...@gmail.com joergpra...@gmail.com writes:

 4. You have to write a program that traverses your folders, picks up each
 document, and extracts fields from the document to get them indexed. 

Or you might use es-nozzle [1], which traverses your folders and indexes
documents into elasticsearch. It uses tika to extract content from
various file formats and will incrementally synchronize the folders
content to the elasticsearch index. I.e. it updates your index with new
documents and deletes documents from elasticsearch if they have been
removed from the folder.

Please visit http://brainbot.com/es-nozzle/doc/ for detailed
documentation. The code lives on github:
https://github.com/brainbot-com/es-nozzle

Please let me know about any problems you run into if you give it a
try. I'm the author of es-nozzle.

Another option might be fsriver: https://github.com/dadoonet/fsriver

-- 
Cheers
Ralf

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/87zjmxl012.fsf%40systemexit.de.
For more options, visit https://groups.google.com/groups/opt_out.


How to approach Indexing for a newbie?

2014-01-14 Thread ZenMaster80
I have a project that used an old search engine and I would like to move 
things to ElasticSearch. I have been doing some reading, and I wanted some 
perspective on how to approach the problem.
- I have bundles(folders) of text/html/pdf/img documents, each folder has 
an average of 50-100 documents, document is about 100K in Size.
- The number of folders and documents can increase and decrease, mostly 
increase but very slightly.

I understand that txt/html will need to be turned into JSON now, and 
somehow I will have to create an index and add these documents to the index 
for indexing. I have some questions that I don't fully understand still.
1- How do I know how many indices do I need?
2- How do  I know how many shards to allocate when creating the index?
3- How do I know how many nodes needed, and how do I make things scale up 
and down? Is there a way to idle things when no indexing is happening? 
4- How do I add documents to the index for indexing? I always see example 
with JSON snippets, but in reality I have something like 
folder1{doc1,doc2,..doc100}, folder2{docA...docN} ...
5- This is probably a dumb question...Is there a preferable language to use 
for the indexing calls? If I were to build an app to call the REST API, 
which language I need to use to do this if at all?

Thanks again for the help.

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/39e218f3-395c-44b9-bac1-cc2994e26391%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: How to approach Indexing for a newbie?

2014-01-14 Thread joergpra...@gmail.com
1. Mostly, indexes are result of a partition design outside ES. For
example, by time, user, data origin. The beauty of ES is that it can host
as many indexes as you wish.

2. If your maximum number of nodes (hosts) you want to spend to ES is
known, use that node number for the number of shards. So you make sure your
cluster can scale. If the number is not known, try to estimate the total
number of documents to get indexed, the total volume of that documents, and
an estimated index volume per shard. Rule of thumb: a shard should be sized
so it can fit into the Java heap and so that it can be moved between nodes
in reasonable time (~1-10 GB).

3. You can scale up by adding nodes - just start ES on another host. Scale
down is also easy, stop ES on a node.

4. You have to write a program that traverses your folders, picks up each
document, and extracts fields from the document to get them indexed. With
scrutmydocs.org you can experiment how this works by using such a file
traverser which is already prepared to handle quite a lot of file types
automatically.

5. You should consider using one of the standard clients. As ES supports
HTTP REST, and the standard clients are designed to support a comparable
set of features, it does not matter what language you use. Just pick your
favorite language. (My personal favorite is Java, where there is no need to
use HTTP REST, instead the native transport protocol can be used)

Jörg

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGvSgLthdp8Nk%3DTMVQYymzRYWOnEvAC4HYo14bMH1Ks8g%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: How to approach Indexing for a newbie?

2014-01-14 Thread ZenMaster80
Wow, this is exactly what I was looking for. I am a bit curious on #5, I am 
assuming there is a Java API to access ES, is there any link on how to get 
started using Java with ES? I would like to know how to import ES 
framework/API into java project.

Thanks again, this is a great clarification!

On Tuesday, January 14, 2014 4:17:31 PM UTC-5, Jörg Prante wrote:

 1. Mostly, indexes are result of a partition design outside ES. For 
 example, by time, user, data origin. The beauty of ES is that it can host 
 as many indexes as you wish.

 2. If your maximum number of nodes (hosts) you want to spend to ES is 
 known, use that node number for the number of shards. So you make sure your 
 cluster can scale. If the number is not known, try to estimate the total 
 number of documents to get indexed, the total volume of that documents, and 
 an estimated index volume per shard. Rule of thumb: a shard should be sized 
 so it can fit into the Java heap and so that it can be moved between nodes 
 in reasonable time (~1-10 GB).

 3. You can scale up by adding nodes - just start ES on another host. Scale 
 down is also easy, stop ES on a node.

 4. You have to write a program that traverses your folders, picks up each 
 document, and extracts fields from the document to get them indexed. With 
 scrutmydocs.org you can experiment how this works by using such a file 
 traverser which is already prepared to handle quite a lot of file types 
 automatically.

 5. You should consider using one of the standard clients. As ES supports 
 HTTP REST, and the standard clients are designed to support a comparable 
 set of features, it does not matter what language you use. Just pick your 
 favorite language. (My personal favorite is Java, where there is no need to 
 use HTTP REST, instead the native transport protocol can be used)

 Jörg



-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/d6586c50-fad0-46e5-8ff5-d624d821d937%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: How to approach Indexing for a newbie?

2014-01-14 Thread ZenMaster80
Thanks. I added the .jar as a dependency in a simple java project using 
eclipse. 
I get this error when I try to run the program, any clues?

Exception in thread main java.lang.NoClassDefFoundError: 
org/apache/lucene/util/Version

at org.elasticsearch.Version.clinit(Version.java:42)

at org.elasticsearch.node.internal.InternalNode.init(InternalNode.java:121
)

at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:159)

at org.elasticsearch.node.NodeBuilder.node(NodeBuilder.java:166)

at EntryPoint.main(EntryPoint.java:25)

Caused by: java.lang.ClassNotFoundException: org.apache.lucene.util.Version

at java.net.URLClassLoader$1.run(URLClassLoader.java:366)

at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:354)

at java.lang.ClassLoader.loadClass(ClassLoader.java:423)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)

at java.lang.ClassLoader.loadClass(ClassLoader.java:356)

... 5 more



On Tuesday, January 14, 2014 5:22:22 PM UTC-5, Jörg Prante wrote:

 To get an overview what is possible, look at the Elasticsearch test 
 sources at  
 https://github.com/elasticsearch/elasticsearch/tree/master/src/test/java/org/elasticsearch

 There are many code snippets that are useful for learning how to use the 
 Java API.

 You can use Elasticsearch by adding the jar as a dependency in your 
 project (with Maven it is very easy).

 Jörg



-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c6e0080d-108c-4eda-af15-9cce9546dca5%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: How to approach Indexing for a newbie?

2014-01-14 Thread InquiringMind


 *Never mind, I just had to import more jars from /lib.*


You can import all jars from /*some_base_path*/lib (for example) by adding 
a /* to the end of the path, and then add that to the -cp / -classpathoption's 
value, separating multiple paths with semicolons. That single 
* (and *not* *.jar)  is a shorthand to Java to include all jar files in the 
directory. So you do not need to add them one-by-one and never again worry 
when a future version of ES adds new jar files or renames existing jar 
files. In fact, I've only discovered new jar files in ES versions when I 
read about it on this newsgroup; that little asterisk is like magic and 
saves me from ever worrying or caring about the exact set of jar files that 
are bundled with ES.

See the *Understanding class path wildcards* section at 
http://docs.oracle.com/javase/7/docs/technotes/tools/windows/classpath.html  
for the full details.

Hope this helps!

Brian

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8a883bcd-30e3-463d-bda8-e8f1434d14c4%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.