Re: [basex-talk] Cant add raw files explicitly or text index ms office docs

2015-11-26 Thread E. Wray Johnson
Thanks!  Today is our Thanksgiving holiday so I am not working today.
I will look at this soon.

Consider a file filter that uses regular expression(s).

Wray Johnson
(m) 704-293-9008

> On Nov 26, 2015, at 11:43 AM, "Christian Grün"  
> wrote:
>
> Hi E. Wray,
>
> I have attached a little example for some XQuery code, which adds
> files, archives and archive contents to a database. It’s probably not
> the most efficient solution, so feel free to enhance it or ask more
> questions.
>
> I agree that your use case is an enticing one: We also use BaseX to
> process office files, and Rositsa Shadura wrote an interesting thesis
> on that topic [1]. As Dirk pointed out, it turned out that we didn’t
> want to choose one particular solution, and the XQuery approach is
> currently the most flexible one.
>
> Hope this helps,
> Christian
>
> [1] http://basex.org/about-us/publications
> ___
>
>> On Wed, Nov 25, 2015 at 5:43 PM, Dirk Kirsten  wrote:
>> Hello,
>>
>> which problems did you encounter? This problem should be solvable using a
>> small XQuery, basically putting what you describe in natural languages in
>> XQuery so our processor understands it.
>>
>> I don't think it would make any sense to add such a specific format. There
>> are simply way to many possible combinations - You want archive files
>> extracted, others might want not to do this. In the end we would end up with
>> a very complex definition language - And what's the point if we already have
>> a standardized query language like XQuery, which can achieve the same thing?
>>
>> Cheers
>> Dirk
>>
>> On 11/25/2015 05:38 PM, E. Wray Johnson wrote:
>>
>> Here is what I want to do: For a given folder and all its subfolders on my
>> physical dive, mirror its contents including the contents of archives,
>> parsing xml, json,html, text, etc. using their respective parser skipping
>> invalids, and adding all other files as raw. I want archive files (*.zip,
>> *.doxc) to be added as raw, however I want the text inside archive files
>> like docx (ms-word) to be indexed and any files in the archives files that
>> match a filter to be indexed.
>>
>> Note: It would be nice if there was a single db:add method that allowed me
>> to specify a map of filters to parsers with options, where all files that do
>> not match a filter (or are invalid) will be optionally added as raw.
> 


Re: [basex-talk] Cant add raw files explicitly or text index ms office docs

2015-11-25 Thread Dirk Kirsten
Hello,

which problems did you encounter? This problem should be solvable using
a small XQuery, basically putting what you describe in natural languages
in XQuery so our processor understands it.

I don't think it would make any sense to add such a specific format.
There are simply way to many possible combinations - You want archive
files extracted, others might want not to do this. In the end we would
end up with a very complex definition language - And what's the point if
we already have a standardized query language like XQuery, which can
achieve the same thing?

Cheers
Dirk

On 11/25/2015 05:38 PM, E. Wray Johnson wrote:
> Here is what I want to do: For a given folder and all its subfolders
> on my physical dive, mirror its contents including the contents of
> archives, parsing xml, json,html, text, etc. using their respective
> parser skipping invalids, and adding all other files as raw. I want
> archive files (*.zip, *.doxc) to be added as raw, however I want the
> text inside archive files like docx (ms-word) to be indexed and any
> files in the archives files that match a filter to be indexed.
>
> Note: It would be nice if there was a single db:add method that
> allowed me to specify a map of filters to parsers with options, where
> all files that do not match a filter (or are invalid) will be
> optionally added as raw.

-- 
Dirk Kirsten, BaseX GmbH, http://basexgmbh.de
|-- Firmensitz: Blarerstrasse 56, 78462 Konstanz
|-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
|   Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle
`-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22



[basex-talk] Cant add raw files explicitly or text index ms office docs

2015-11-25 Thread E. Wray Johnson
Here is what I want to do: For a given folder and all its subfolders on my
physical dive, mirror its contents including the contents of archives,
parsing xml, json,html, text, etc. using their respective parser skipping
invalids, and adding all other files as raw. I want archive files (*.zip,
*.doxc) to be added as raw, however I want the text inside archive files
like docx (ms-word) to be indexed and any files in the archives files that
match a filter to be indexed.

Note: It would be nice if there was a single db:add method that allowed me
to specify a map of filters to parsers with options, where all files that
do not match a filter (or are invalid) will be optionally added as raw.