Re: Adding pdf/word file using JSON/XML

Roland Everaert Wed, 12 Jun 2013 02:22:49 -0700

1) Being aggressive and insulting is not a way to help people understand
such complex tool or to help people in general.


2) I read again the feature page of Solr and it is stated that the
interface is REST-like and not RESTful as I though in the first place, and
communicate to the devs. And as the devs told me a RESTful interface
doesn't use parameters in the URI/URL, so ii is my mistake. Hence we have
no problem with the interface as it is.

Any way I still have a question regarding the /extract interface. It seems
that every time a file is updated in Solr, the lucene document is recreated
from scratch which means that any extra information we want to be
indexed/stored along the file is erased if the request doesn't contains
them. Is there a parameter that allow changing that behaviour?



Regards,


Roland.


On Tue, Jun 11, 2013 at 4:35 PM, Jack Krupansky <j...@basetechnology.com>wrote:

> "is it possible to index the file + metadata with a JSON/XML request?"
>
> You still aren't being clear as to what you are really trying to achieve
> here. I mean, just write a shell script that does the curl command, or
> write a Java program or application layer that uses SolrJ to talk to Solr
> and accepts JSON?XML/REST requests.
>
>
> "It seems that the only way to index a file with some metadata is to build
> a
> request that would look like the following example that uses curl."
>
> Curl is just a fancy way to do an HTTP request. You can do the same HTTP
> request from Java code (or Python or whatever.)
>
>
> "The developer would like to avoid using parameters in the url to pass
> arguments."
>
> Seriously?! What is THAT all about!!  I mean, really, HTTP and URLs and
> URL query parameters are part of the heart of the Internet infrastructure!
>
> If this whole thread is merely that you have an IDIOT who can't cope with
> passing HTTP URL query parameters, all I can say is... Wow!
>
> But use SolrJ and then at least it doesn't LOOK like they are URL Query
> parameters.
>
> Or, maybe this is just a case where the developer WANTS to use SOAP rather
> than a REST style of API.
>
> In any case, please clue us in as to what PROBLEM you are really trying to
> solve. Just use plain English and avoid getting caught up in what the
> solution might be.
>
> The real bottom line is that random application developers should not be
> talking directly to Solr anyway - they should be provided with an
> "application layer" that has a clean, application-oriented REST API and the
> gory details of the Solr API would be hidden inside the application layer.
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: Roland Everaert
> Sent: Tuesday, June 11, 2013 8:48 AM
>
> To: solr-user@lucene.apache.org
> Subject: Re: Adding pdf/word file using JSON/XML
>
> We are working on an application that allows some users to add files (pdf,
> ms word, odt, etc), located on their local hard disk, to our internal
> system and allows other users to search for them. So we are considering
> Solr for the indexing and search functionalities of the system. Along with
> the file content, we want to index some metadata related to the file.
>
> It seems obvious that Solr couldn't import the file from the local disk of
> the user, so the system will have to import the file into a directory that
> Solr can reach and instruct Solr to index the file with the metadata, but
> is it possible to index the file + metadata with a JSON/XML request?
>
> It seems that the only way to index a file with some metadata is to build a
> request that would look like the following exemple that uses curl. The
> developer would like to avoid using parameters in the url to pass
> arguments.
>
> curl "
> http://localhost:8080/solr/**update/extract?literal.id=**
> doc10&literal.name=BLAH&**defaultField=text<http://localhost:8080/solr/update/extract?literal.id=doc10&literal.name=BLAH&defaultField=text>
> "
> --data-binary @/path/to/file.pdf -H "Content-Type: application/pdf"
>
>
> Additionally, it seems that if a subsequent request is sent to the indexer
> to update the file, if the metadata are not passed to Solr with the
> request, they are deleted.
>
> Thanks for your help,
>
>
>
> Roland.
>
>
> On Mon, Jun 10, 2013 at 4:14 PM, Jack Krupansky <j...@basetechnology.com>*
> *wrote:
>
>  Sorry, but you are STILL not being clear!
>>
>> Are you asking if you can pass Solr parameters as XML fields? No.
>>
>> Are you asking if the file name and path can be indexed as metadata? To
>> some degree:
>>
>> curl 
>> "http://localhost:8983/solr/****update/extract?literal.id=doc-****1\<http://localhost:8983/solr/**update/extract?literal.id=doc-**1%5C>
>> <http://localhost:8983/**solr/update/extract?literal.**id=doc-1%5C<http://localhost:8983/solr/update/extract?literal.id=doc-1%5C>
>> >
>> &commit=true&uprefix=attr_" -F "HelloWorld.docx=@HelloWorld.****docx"
>>
>> Then the stream has a name that is indexed as metadata:
>>
>> <arr name="attr_meta">
>>  <str>stream_source_info</str>
>>  <str>HelloWorld.docx</str>
>>  <str>stream_content_type</str>
>>  <str>application/octet-stream<****/str>
>>
>>  <str>stream_size</str>
>>  <str>10096</str>
>>  <str>stream_name</str>
>>  <str>HelloWorld.docx</str>
>>  <str>Content-Type</str>
>>  <str>application/vnd.****openxmlformats-officedocument.****
>> wordprocessingml.document</****str>
>> </arr>
>>
>> and
>>
>> <arr name="attr_stream_source_info"****>
>>
>>  <str>HelloWorld.docx</str>
>> </arr>
>>
>> <arr name="attr_stream_name">
>>  <str>HelloWorld.docx</str>
>> </arr>
>>
>> Or, what is it that you are really string to do?
>>
>> Simply tell us in plain language what problem you are trying to solve.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Roland Everaert
>> Sent: Monday, June 10, 2013 9:23 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Adding pdf/word file using JSON/XML
>>
>>
>> Sorry if it was not clear.
>>
>> What I would like is to know how to construct an XML/JSON request that
>> provide any necessary information (supposedly the full path on disk) to
>> solr to retrieve and index a pdf/ms word document.
>>
>> So, an XML request could look like this:
>>
>> <add>
>> <doc>
>> <field name="id">doc10</field>
>> <field name="name">BLAH</field>
>> <field name="path">/path/to/file.pdf<****/field>
>>
>> </doc>
>> </add>
>>
>>
>> Regards,
>>
>>
>> Roland.
>>
>>
>> On Mon, Jun 10, 2013 at 3:12 PM, Gora Mohanty <g...@mimirtech.com> wrote:
>>
>>  On 10 June 2013 17:47, Roland Everaert <reveatw...@gmail.com> wrote:
>>
>>> > Hi,
>>> >
>>> > Based on the wiki, below is an example of how I am currently adding a >
>>> pdf
>>> > file with an extra field called name:
>>> > curl "
>>> >
>>> http://localhost:8080/solr/****update/extract?literal.id=**<http://localhost:8080/solr/**update/extract?literal.id=**>
>>> doc10&literal.name=BLAH&****defaultField=text<http://**
>>> localhost:8080/solr/update/**extract?literal.id=doc10&**
>>> literal.name=BLAH&**defaultField=text<http://localhost:8080/solr/update/extract?literal.id=doc10&literal.name=BLAH&defaultField=text>
>>> >
>>>
>>> "
>>> > --data-binary @/path/to/file.pdf -H "Content-Type: application/pdf"
>>> >
>>> > Is it possible to add a file + any extra fields using a JSON or XML
>>> request.
>>>
>>> It is not entirely clear what you are asking. Do you mean
>>> can one do the same as your example above for a PDF
>>> file, but with a XML or JSON file? If so, yes. Please see
>>> the examples in example/exampledocs/ of a Solr source
>>> tree, and 
>>> http://wiki.apache.org/solr/****ExtractingRequestHandler<http://wiki.apache.org/solr/**ExtractingRequestHandler>
>>> <http:**//wiki.apache.org/solr/**ExtractingRequestHandler<http://wiki.apache.org/solr/ExtractingRequestHandler>
>>> >
>>>
>>> Regards,
>>> Gora
>>>
>>>
>>>
>>
>

Re: Adding pdf/word file using JSON/XML

Reply via email to