High frequency terms in results document....

2015-02-15 Thread Shouvik Bardhan
Apologies if I have missed it in discussions prior but I looked all over. I
looked at the Luke code and it does find high frequency terms on the entire
index. I am trying to get the top N high frequency terms in the documents
returned from a search result. I came across something called
FilterIndexReader but I don't think it is part of 4.X codebase. Any pointer
is appreciated.


JTRES 2015 CFP

2015-02-15 Thread Lukasz Ziarek
Dear Real-time Java Community,

 Remi and I are pleased to announce the release of the JTRES 2015 call for 
papers (below) and the JTRES 2015 website: http://jtres2015.univ-mlv.fr.  JTRES 
will be held in Paris on October 7th and 8th.  We hope to see you there and 
look forward to your submissions.

Regards,
Luke and Remi

   The 13th International Workshop on Java Technologies for
   Real-time and Embedded Systems - JTRES 2015

October 7th - 8th
Paris, France


 Call for Papers



MOTIVATION

Over 90% of all microprocessors are now used for real-time and
embedded applications. Embedded devices are deployed on a broad
diversity of distinct processor architectures and operating
systems. The application software for many embedded devices is custom
tailored if not written entirely from scratch. The size of typical
embedded system software applications is growing exponentially from
year to year, with many of today's embedded systems comprised of
multiple millions of lines of code. For all of these reasons, the
software portability, reuse, and modular composability benefits
offered by Java are especially valuable to developers of embedded
systems.

Both embedded and general purpose software frequently need to comply
with real-time constraints. Higher-level programming languages and
middleware are needed to robustly and productively design, implement,
compose, integrate, validate, and enforce memory and real-time
constraints along with conventional functional requirements for
reusable software components. The Java programming language has become
an attractive choice because of its safety, productivity, its
relatively low maintenance costs, and the availability of well trained
developers.

Although Java features good software engineering characteristics,
traditional Java virtual machine (JVM) implementations are unsuitable
for deploying real-time software due to under-specification of thread
scheduling and synchronization semantics, unclear demand and
utilization of memory and CPU resources, and unpredictable
interference associated with automatic garbage collection and adaptive
compilation.

GOAL

Interest in real-time Java by both the academic research community and
commercial industry has been motivated by the need to manage the
complexity and costs associated with continually expanding embedded
real-time software systems. The goal of the workshop is to gather
researchers working on real-time and embedded Java to identify the
challenging problems that still need to be solved in order to assure
the success of real-time Java as a technology and to report results
and experience gained by researchers.

The Java ecosystem has outgrown the combination of Java as programming
language and the JVM. For example, Android uses Java as source
language and the Dalvik virtual machine for execution. Languages such
as Scala are compiled to Java bytecode and executed on the JVM. JTRES
welcomes submissions that apply such approaches to embedded and/or
real-time systems.

TOPICS OF INTEREST

Topics of interest to this workshop include, but are not limited to:

- New real-time programming paradigms and language features
- Industrial experience and practitioner reports
- Open source solutions for real-time Java
- Real-time design patterns and programming idioms
- High-integrity and safety critical system support
- Java-based real-time operating systems and processors
- Extensions to the RTSJ and SCJ
- Real-time and embedded virtual machines and execution environments
- Memory management and real-time garbage collection
- Scheduling frameworks, feasibility analysis, and timing analysis
- Multiprocessor and distributed real-time Java
- Real-time solutions for Android
- Languages other than Java on real-time or embedded JVMs
- Benchmarks and Open Source applications using real-time Java

SUBMISSION REQUIREMENTS

Participants are expected to submit a paper of at most 10 pages (ACM
Conference Format, i.e., two-columns, 10 point font). Industrial
experience and practitioner reports may be submitted as 4-page short
papers with no page limit for references cited. Accepted papers will
be published in the ACM International Conference Proceedings Series
via the ACM Digital Library and have to be presented by one author at
JTRES.

LaTeX and Word templates can be found at:
http://www.acm.org/sigs/pubs/proceed/template.html

The ISBN number for JTRES 2015 is TBD.

Papers describing open source projects shall include a description how
to obtain the source and how to run the experiments in the appendix.

Papers should be submitted through Easychair. Please use the
submission link: https://easychair.org/conferences/?conf=jtres2015

The best papers will be invited for submission to a special issue of
the Journal on Concurrency and Computation: Practice and Experience,
as determined by the program committee.

IMPORTANT DATES

- Paper Submission: June 8, 2015
- Notification of

Re: Top 10 words

2015-02-15 Thread Denis Bazhenov
Either you have to index those words in a facet or calculate top 10 words 
on-the-fly. Last approach could be effective enough if you have ability to read 
those documents quickly. The calculation of Top 10 words could be done pretty 
easily in terms of memory and CPU, because there is no need to do sorting in 
fact (see https://github.com/addthis/stream-lib).

> On Feb 14, 2015, at 04:34, Maisnam Ns  wrote:
> 
> Hi Jigar,
> 
> Thanks for the clustering algorithm will see if it can be applied.
> 
> These are not known fields as these documents are coming from some other
> search engine. Every time the user changes his search string the documents
> will vary but I am assuming here the worst case scenario say about 10
> documents. For faceted search also we need to know in advance the facets.
> 
> You search for a string it gives bunch of documents containing some summary
> of the document and all I have to do is quickly find top 10 words from
> these documents from the summary which will vary depending on the search
> query. The response time is the problem it has to be in just  a few seconds
> and memory is the issue here.
> 
> Again thanks for that link will look into it. If you find some solution
> please let me know.
> 
> Thanks
> 
> On Fri, Feb 13, 2015 at 11:12 PM, Jigar Shah  wrote:
> 
>> If those are the known fields in the documents, you may extract words while
>> indexing and create facets. Lucene supports faceted search which can give
>> you Top n counts of such fields, which is much more efficient.
>> 
>> Another option is apply clustering algorithm on results which can provide
>> Top n words, you can refer http://search.carrot2.org
>> 
>> 
>> 
>> 
>> On Fri, Feb 13, 2015 at 10:13 PM, Maisnam Ns  wrote:
>> 
>>> Hi,
>>> 
>>> Can someone help me with this use case:
>>> 
>>> 1. I have to search a string and let's say the search engine(it is not
>>> lucene) found this string in 100,000 documents.  I need to find the top
>> 10
>>> words occurring in this 10 documents.As the document size is large
>> how
>>> to further index these documents and find the top 10 words
>>> 
>>> 1. I am thinking of using Lucene Ramdirectory or memory indexing and find
>>> the most occurring top 10 words.
>>> 2. Is this the right approach , indexing and writing to the disk would be
>>> almost over kill and a user can search any number of times.
>>> 
>>> Thanks in advance.
>>> 
>> 

---
Denis Bazhenov 






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Top 10 words

2015-02-15 Thread Maisnam Ns
Hi Denis,

Looks good and thanks for the links. And one more help , once finding the
top ten say  'Lucene' -1000 , 'search' -789 , I need to a quick span query
on 'Lucene'  say e.g 'Companies use Lucene for searching' , some phrases
containing 'Lucene'. I tried using this
http://sujitpal.blogspot.in/2009/02/summarization-with-lucene.html but this
does not work in memory . Is there a java library where I can do a quick
search on span query.

Thanks

On Mon, Feb 16, 2015 at 5:57 AM, Denis Bazhenov  wrote:

> Either you have to index those words in a facet or calculate top 10 words
> on-the-fly. Last approach could be effective enough if you have ability to
> read those documents quickly. The calculation of Top 10 words could be done
> pretty easily in terms of memory and CPU, because there is no need to do
> sorting in fact (see https://github.com/addthis/stream-lib).
>
> > On Feb 14, 2015, at 04:34, Maisnam Ns  wrote:
> >
> > Hi Jigar,
> >
> > Thanks for the clustering algorithm will see if it can be applied.
> >
> > These are not known fields as these documents are coming from some other
> > search engine. Every time the user changes his search string the
> documents
> > will vary but I am assuming here the worst case scenario say about 10
> > documents. For faceted search also we need to know in advance the facets.
> >
> > You search for a string it gives bunch of documents containing some
> summary
> > of the document and all I have to do is quickly find top 10 words from
> > these documents from the summary which will vary depending on the search
> > query. The response time is the problem it has to be in just  a few
> seconds
> > and memory is the issue here.
> >
> > Again thanks for that link will look into it. If you find some solution
> > please let me know.
> >
> > Thanks
> >
> > On Fri, Feb 13, 2015 at 11:12 PM, Jigar Shah 
> wrote:
> >
> >> If those are the known fields in the documents, you may extract words
> while
> >> indexing and create facets. Lucene supports faceted search which can
> give
> >> you Top n counts of such fields, which is much more efficient.
> >>
> >> Another option is apply clustering algorithm on results which can
> provide
> >> Top n words, you can refer http://search.carrot2.org
> >>
> >>
> >>
> >>
> >> On Fri, Feb 13, 2015 at 10:13 PM, Maisnam Ns 
> wrote:
> >>
> >>> Hi,
> >>>
> >>> Can someone help me with this use case:
> >>>
> >>> 1. I have to search a string and let's say the search engine(it is not
> >>> lucene) found this string in 100,000 documents.  I need to find the top
> >> 10
> >>> words occurring in this 10 documents.As the document size is large
> >> how
> >>> to further index these documents and find the top 10 words
> >>>
> >>> 1. I am thinking of using Lucene Ramdirectory or memory indexing and
> find
> >>> the most occurring top 10 words.
> >>> 2. Is this the right approach , indexing and writing to the disk would
> be
> >>> almost over kill and a user can search any number of times.
> >>>
> >>> Thanks in advance.
> >>>
> >>
>
> ---
> Denis Bazhenov 
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Top 10 words

2015-02-15 Thread Maisnam Ns
Hi Jigar,

The link you shared http://search.carrot2.org
is really nice a lot of it's features actually has my requirements.

Thanks for the share


On Mon, Feb 16, 2015 at 9:20 AM, Maisnam Ns  wrote:

> Hi Denis,
>
> Looks good and thanks for the links. And one more help , once finding the
> top ten say  'Lucene' -1000 , 'search' -789 , I need to a quick span query
> on 'Lucene'  say e.g 'Companies use Lucene for searching' , some phrases
> containing 'Lucene'. I tried using this
> http://sujitpal.blogspot.in/2009/02/summarization-with-lucene.html but
> this does not work in memory . Is there a java library where I can do a
> quick search on span query.
>
> Thanks
>
> On Mon, Feb 16, 2015 at 5:57 AM, Denis Bazhenov  wrote:
>
>> Either you have to index those words in a facet or calculate top 10 words
>> on-the-fly. Last approach could be effective enough if you have ability to
>> read those documents quickly. The calculation of Top 10 words could be done
>> pretty easily in terms of memory and CPU, because there is no need to do
>> sorting in fact (see https://github.com/addthis/stream-lib).
>>
>> > On Feb 14, 2015, at 04:34, Maisnam Ns  wrote:
>> >
>> > Hi Jigar,
>> >
>> > Thanks for the clustering algorithm will see if it can be applied.
>> >
>> > These are not known fields as these documents are coming from some other
>> > search engine. Every time the user changes his search string the
>> documents
>> > will vary but I am assuming here the worst case scenario say about
>> 10
>> > documents. For faceted search also we need to know in advance the
>> facets.
>> >
>> > You search for a string it gives bunch of documents containing some
>> summary
>> > of the document and all I have to do is quickly find top 10 words from
>> > these documents from the summary which will vary depending on the search
>> > query. The response time is the problem it has to be in just  a few
>> seconds
>> > and memory is the issue here.
>> >
>> > Again thanks for that link will look into it. If you find some solution
>> > please let me know.
>> >
>> > Thanks
>> >
>> > On Fri, Feb 13, 2015 at 11:12 PM, Jigar Shah 
>> wrote:
>> >
>> >> If those are the known fields in the documents, you may extract words
>> while
>> >> indexing and create facets. Lucene supports faceted search which can
>> give
>> >> you Top n counts of such fields, which is much more efficient.
>> >>
>> >> Another option is apply clustering algorithm on results which can
>> provide
>> >> Top n words, you can refer http://search.carrot2.org
>> >>
>> >>
>> >>
>> >>
>> >> On Fri, Feb 13, 2015 at 10:13 PM, Maisnam Ns 
>> wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> Can someone help me with this use case:
>> >>>
>> >>> 1. I have to search a string and let's say the search engine(it is not
>> >>> lucene) found this string in 100,000 documents.  I need to find the
>> top
>> >> 10
>> >>> words occurring in this 10 documents.As the document size is large
>> >> how
>> >>> to further index these documents and find the top 10 words
>> >>>
>> >>> 1. I am thinking of using Lucene Ramdirectory or memory indexing and
>> find
>> >>> the most occurring top 10 words.
>> >>> 2. Is this the right approach , indexing and writing to the disk
>> would be
>> >>> almost over kill and a user can search any number of times.
>> >>>
>> >>> Thanks in advance.
>> >>>
>> >>
>>
>> ---
>> Denis Bazhenov 
>>
>>
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>