Re: How to approach Indexing for a newbie?
joergpra...@gmail.com joergpra...@gmail.com writes: 4. You have to write a program that traverses your folders, picks up each document, and extracts fields from the document to get them indexed. Or you might use es-nozzle [1], which traverses your folders and indexes documents into elasticsearch. It uses tika to extract content from various file formats and will incrementally synchronize the folders content to the elasticsearch index. I.e. it updates your index with new documents and deletes documents from elasticsearch if they have been removed from the folder. Please visit http://brainbot.com/es-nozzle/doc/ for detailed documentation. The code lives on github: https://github.com/brainbot-com/es-nozzle Please let me know about any problems you run into if you give it a try. I'm the author of es-nozzle. Another option might be fsriver: https://github.com/dadoonet/fsriver -- Cheers Ralf -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/87zjmxl012.fsf%40systemexit.de. For more options, visit https://groups.google.com/groups/opt_out.
How to approach Indexing for a newbie?
I have a project that used an old search engine and I would like to move things to ElasticSearch. I have been doing some reading, and I wanted some perspective on how to approach the problem. - I have bundles(folders) of text/html/pdf/img documents, each folder has an average of 50-100 documents, document is about 100K in Size. - The number of folders and documents can increase and decrease, mostly increase but very slightly. I understand that txt/html will need to be turned into JSON now, and somehow I will have to create an index and add these documents to the index for indexing. I have some questions that I don't fully understand still. 1- How do I know how many indices do I need? 2- How do I know how many shards to allocate when creating the index? 3- How do I know how many nodes needed, and how do I make things scale up and down? Is there a way to idle things when no indexing is happening? 4- How do I add documents to the index for indexing? I always see example with JSON snippets, but in reality I have something like folder1{doc1,doc2,..doc100}, folder2{docA...docN} ... 5- This is probably a dumb question...Is there a preferable language to use for the indexing calls? If I were to build an app to call the REST API, which language I need to use to do this if at all? Thanks again for the help. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/39e218f3-395c-44b9-bac1-cc2994e26391%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: How to approach Indexing for a newbie?
1. Mostly, indexes are result of a partition design outside ES. For example, by time, user, data origin. The beauty of ES is that it can host as many indexes as you wish. 2. If your maximum number of nodes (hosts) you want to spend to ES is known, use that node number for the number of shards. So you make sure your cluster can scale. If the number is not known, try to estimate the total number of documents to get indexed, the total volume of that documents, and an estimated index volume per shard. Rule of thumb: a shard should be sized so it can fit into the Java heap and so that it can be moved between nodes in reasonable time (~1-10 GB). 3. You can scale up by adding nodes - just start ES on another host. Scale down is also easy, stop ES on a node. 4. You have to write a program that traverses your folders, picks up each document, and extracts fields from the document to get them indexed. With scrutmydocs.org you can experiment how this works by using such a file traverser which is already prepared to handle quite a lot of file types automatically. 5. You should consider using one of the standard clients. As ES supports HTTP REST, and the standard clients are designed to support a comparable set of features, it does not matter what language you use. Just pick your favorite language. (My personal favorite is Java, where there is no need to use HTTP REST, instead the native transport protocol can be used) Jörg -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGvSgLthdp8Nk%3DTMVQYymzRYWOnEvAC4HYo14bMH1Ks8g%40mail.gmail.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: How to approach Indexing for a newbie?
Wow, this is exactly what I was looking for. I am a bit curious on #5, I am assuming there is a Java API to access ES, is there any link on how to get started using Java with ES? I would like to know how to import ES framework/API into java project. Thanks again, this is a great clarification! On Tuesday, January 14, 2014 4:17:31 PM UTC-5, Jörg Prante wrote: 1. Mostly, indexes are result of a partition design outside ES. For example, by time, user, data origin. The beauty of ES is that it can host as many indexes as you wish. 2. If your maximum number of nodes (hosts) you want to spend to ES is known, use that node number for the number of shards. So you make sure your cluster can scale. If the number is not known, try to estimate the total number of documents to get indexed, the total volume of that documents, and an estimated index volume per shard. Rule of thumb: a shard should be sized so it can fit into the Java heap and so that it can be moved between nodes in reasonable time (~1-10 GB). 3. You can scale up by adding nodes - just start ES on another host. Scale down is also easy, stop ES on a node. 4. You have to write a program that traverses your folders, picks up each document, and extracts fields from the document to get them indexed. With scrutmydocs.org you can experiment how this works by using such a file traverser which is already prepared to handle quite a lot of file types automatically. 5. You should consider using one of the standard clients. As ES supports HTTP REST, and the standard clients are designed to support a comparable set of features, it does not matter what language you use. Just pick your favorite language. (My personal favorite is Java, where there is no need to use HTTP REST, instead the native transport protocol can be used) Jörg -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d6586c50-fad0-46e5-8ff5-d624d821d937%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: How to approach Indexing for a newbie?
Thanks. I added the .jar as a dependency in a simple java project using eclipse. I get this error when I try to run the program, any clues? Exception in thread main java.lang.NoClassDefFoundError: org/apache/lucene/util/Version at org.elasticsearch.Version.clinit(Version.java:42) at org.elasticsearch.node.internal.InternalNode.init(InternalNode.java:121 ) at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:159) at org.elasticsearch.node.NodeBuilder.node(NodeBuilder.java:166) at EntryPoint.main(EntryPoint.java:25) Caused by: java.lang.ClassNotFoundException: org.apache.lucene.util.Version at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:423) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:356) ... 5 more On Tuesday, January 14, 2014 5:22:22 PM UTC-5, Jörg Prante wrote: To get an overview what is possible, look at the Elasticsearch test sources at https://github.com/elasticsearch/elasticsearch/tree/master/src/test/java/org/elasticsearch There are many code snippets that are useful for learning how to use the Java API. You can use Elasticsearch by adding the jar as a dependency in your project (with Maven it is very easy). Jörg -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c6e0080d-108c-4eda-af15-9cce9546dca5%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Re: How to approach Indexing for a newbie?
*Never mind, I just had to import more jars from /lib.* You can import all jars from /*some_base_path*/lib (for example) by adding a /* to the end of the path, and then add that to the -cp / -classpathoption's value, separating multiple paths with semicolons. That single * (and *not* *.jar) is a shorthand to Java to include all jar files in the directory. So you do not need to add them one-by-one and never again worry when a future version of ES adds new jar files or renames existing jar files. In fact, I've only discovered new jar files in ES versions when I read about it on this newsgroup; that little asterisk is like magic and saves me from ever worrying or caring about the exact set of jar files that are bundled with ES. See the *Understanding class path wildcards* section at http://docs.oracle.com/javase/7/docs/technotes/tools/windows/classpath.html for the full details. Hope this helps! Brian -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8a883bcd-30e3-463d-bda8-e8f1434d14c4%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.