Re: Pushing a whole set of pdf-files to solr

2013-04-24 Thread sdspieg
I am still struggling with this. I have solr 4.2.1.2013.03.26.08.26.55
installed. So are you telling me that I should somehow install the older
version of that tool that comes with Solr 3.x? Because with the newer
version I get the errors I already mentioned. Now I suppose I may be an
untypical user, as I am running all of this under windows and really just
want to find an easy way to get a whole bunch of files from a local folder
(on my harddrive) into my local version of solr. But so is there really no
easier way of doing this? 

-Stephan 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4058776.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Pushing a whole set of pdf-files to solr

2013-04-24 Thread sdspieg
(Just documenting my experiences). I stopped and restarted solr in the tomcat
web application manager. Everything seems fine
http://lucene.472066.n3.nabble.com/file/n4058786/4-25-2013_2-38-43_AM.png 
And yet I still get that same error message. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4058786.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Pushing a whole set of pdf-files to solr

2013-04-24 Thread Jack Krupansky
Your errors may simply have been improperly encoded documents. Or some 
encoding that is not supported. Hard to say.


Start with a simple case, then build on success. I think you're just trying 
to do too much all at once. Do one PDF file first, then work up to a 
directory, and only when you've mastered all that successfully, then you can 
try large numbers of unknown documents from somewhere else.


Try a simple PDF, like one you create yourself by outputting from an Office 
app.


And try an Office (MS or Open) file as well.

-- Jack Krupansky

-Original Message- 
From: sdspieg

Sent: Wednesday, April 24, 2013 7:57 PM
To: solr-user@lucene.apache.org
Subject: Re: Pushing a whole set of pdf-files to solr

I am still struggling with this. I have solr 4.2.1.2013.03.26.08.26.55
installed. So are you telling me that I should somehow install the older
version of that tool that comes with Solr 3.x? Because with the newer
version I get the errors I already mentioned. Now I suppose I may be an
untypical user, as I am running all of this under windows and really just
want to find an easy way to get a whole bunch of files from a local folder
(on my harddrive) into my local version of solr. But so is there really no
easier way of doing this?

-Stephan



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4058776.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Pushing a whole set of pdf-files to solr

2013-04-10 Thread sdspieg
Jack - I apologize for my ignorance here, but when you keep emphasizing 'new'
- does that mean that there is ANOTHER version of this tool than the one
that is built into solr-4.2.1? 
And on the encoding issue - I thought pdf was platform-agnostic? Or is the
problem on my windows system - i.e. that it extracts the (correctly encoded)
text into Win-1251, which solr then has a problem with? But can't I change
that somewhere then?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4055010.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Pushing a whole set of pdf-files to solr

2013-04-10 Thread Jack Krupansky

Yes, there is the version that comes with Solr 3.x.

I'm not aware of an encoding issue.

-- Jack Krupansky

-Original Message- 
From: sdspieg

Sent: Wednesday, April 10, 2013 8:11 AM
To: solr-user@lucene.apache.org
Subject: Re: Pushing a whole set of pdf-files to solr

Jack - I apologize for my ignorance here, but when you keep emphasizing 
'new'

- does that mean that there is ANOTHER version of this tool than the one
that is built into solr-4.2.1?
And on the encoding issue - I thought pdf was platform-agnostic? Or is the
problem on my windows system - i.e. that it extracts the (correctly encoded)
text into Win-1251, which solr then has a problem with? But can't I change
that somewhere then?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4055010.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Pushing a whole set of pdf-files to solr

2013-04-09 Thread sdspieg
If anybody could still help me out with this, I'd really appreciate it.
Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054885.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Pushing a whole set of pdf-files to solr

2013-04-09 Thread Furkan KAMACI
Apache Solr 4 Cookbok says that:

curl http://localhost:8983/solr/update/extract?literal.id=1commit=true;
-F myfile=@cookbook.pdf

is that what you want?

2013/4/10 sdspieg sdsp...@mail.ru

 If anybody could still help me out with this, I'd really appreciate it.
 Thanks!



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054885.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Pushing a whole set of pdf-files to solr

2013-04-09 Thread Jack Krupansky
The newer release of SimplePostTool with Solr 4.x makes it easy to post PDF 
files from a directory, including automatically adding the file name to a 
field. But SolrCell is the direct API that it uses as well.


-- Jack Krupansky
-Original Message- 
From: Furkan KAMACI

Sent: Tuesday, April 09, 2013 6:58 PM
To: solr-user@lucene.apache.org
Subject: Re: Pushing a whole set of pdf-files to solr

Apache Solr 4 Cookbok says that:

curl http://localhost:8983/solr/update/extract?literal.id=1commit=true;
-F myfile=@cookbook.pdf

is that what you want?

2013/4/10 sdspieg sdsp...@mail.ru


If anybody could still help me out with this, I'd really appreciate it.
Thanks!



--
View this message in context:
http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054885.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: Pushing a whole set of pdf-files to solr

2013-04-09 Thread sdspieg
Thanks for those replies. I will look into them. But if anyone knows of a
site that describes step by step how a windows user who has already
installed solr (and tomcat) can easily feed a folder (and subfolders) with
100s of pdfs into solr, or would be willing to write down down those steps,
I would really appreciate the reference. And I bet you there are lots of
people like me... 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054915.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Pushing a whole set of pdf-files to solr

2013-04-09 Thread sdspieg
I am able to run the java -jar post.jar -help command which I found here:
http://docs.lucidworks.com/display/solr/Running+Solr. But now how can I tell
post to post all pdf files in a certain folder (preferably recursively) to a
collection? Could anybody please post the exact command for that? 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054916.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Pushing a whole set of pdf-files to solr

2013-04-09 Thread Gora Mohanty
On 10 April 2013 07:28, sdspieg sdsp...@mail.ru wrote:
 I am able to run the java -jar post.jar -help command which I found here:
 http://docs.lucidworks.com/display/solr/Running+Solr. But now how can I tell
 post to post all pdf files in a certain folder (preferably recursively) to a
 collection? Could anybody please post the exact command for that?
[...]

There are two options:
* I am not familiar with Microsoft Windows, but writing some kind of a batch
  script that recurses down a directory, and posts files to Solr should be easy.
* One could use the Solr DataImportHandler with FileDataSource to handle
   the filesystem traversal, and TikaEntityProcessor to handle the indexing of
   rich content. Please see:
   http://wiki.apache.org/solr/DataImportHandler
   http://wiki.apache.org/solr/TikaEntityProcessor

Regards,
Gora


Re: Pushing a whole set of pdf-files to solr

2013-04-09 Thread sdspieg
Another progress report. I 'flattened' all the folders which contained the
pdf files with Fileboss and then moved the pdf files to the directory where
I found the post.jar file (in solr-4.2.1\solr-4.2.1\example\exampledocs). I
then ran java -Ddata=files -jar post.jar *.pdf and in the command window
it seemed to be working fine (these are just academic articles in pdf-format
that I downloaded with ZOtyero from EBSCO):
04/10/2013  12:20 AM   159,224 Vorontsov - 2012 - The Korea- Russia
Gas
Pipeline Project Past, Pres.pdf
04/10/2013  12:12 AM 3,885,056 Walker - 2012 - Asia competes for
energy
security.pdf
04/10/2013  12:45 AM66,195 Whitmill - 2012 - Is UK Energy Policy
Dri
ving Energy Innovation - or.pdf
04/10/2013  12:29 AM 2,208,367 Wietfeld - 2011 - Understanding
Middle Ea
st Gas Exporting Behavior.pdf
04/10/2013  12:59 AM 3,011,185 Wiseman - 2011 - Expanding Regional
Renew
able Governance.pdf
04/10/2013  12:38 AM   180,692 Woudhuysen - 2012 - Innovation in
Energy
Expressions of a Crisis, and.pdf
04/10/2013  12:49 AM   229,991 Yergin - 2012 - How Is Energy
Remaking th
e World.pdf
04/10/2013  12:40 AM 3,397,328 Young - 2012 - Industrial Gases.
(cover s
tory).pdf
04/10/2013  01:36 AM73,125 Zimmerer - 2011 - New Geographies of
Ener
gy Introduction to the Spe.pdf
... and so on, all together some 300 articles.

But then when I looked in solr, I saw the following:
04:34:41
SEVERE
SolrCore
org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe3 (at
char #10,​ byte #-1)
04:34:41
SEVERE
SolrCore
org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe3 (at
char #10,​ byte #-1)

... and a lot more of those.

I'd like to think I made SOME progress, but it also seems like I'm still not
close to being there. Any suggestions from the experts here on what I am
doing wrong? 

Thanks!

-Stephan



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054920.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Pushing a whole set of pdf-files to solr

2013-04-09 Thread Gora Mohanty
On 10 April 2013 08:11, sdspieg sdsp...@mail.ru wrote:
 Another progress report. I 'flattened' all the folders which contained the
 pdf files with Fileboss and then moved the pdf files to the directory where
 I found the post.jar file (in solr-4.2.1\solr-4.2.1\example\exampledocs). I
 then ran java -Ddata=files -jar post.jar *.pdf and in the command window
 it seemed to be working fine (these are just academic articles in pdf-format
 that I downloaded with ZOtyero from EBSCO):
[...]

If it works, great, but it is not generally advisable to have a large number
of files under one directory. However, that is not the source of your error
here.
 But then when I looked in solr, I saw the following:
 04:34:41
 SEVERE
 SolrCore
 org.apache.solr.common.SolrException: Invalid UTF-8 middle byte 0xe3 (at
 char #10, byte #-1)
[...]

Your files seem to have some encoding other than UTF-8: My random
guess would be Windows-1252. You need to convert the files to UTF-8.

Regards,
Gora


Re: Pushing a whole set of pdf-files to solr

2013-04-09 Thread Jack Krupansky
The newer SimplePostTool can in fact recurse a directory of PDFs. Just get 
the usage for the tool. I'm sure it lists the command options.


-- Jack Krupansky

-Original Message- 
From: sdspieg

Sent: Tuesday, April 09, 2013 9:48 PM
To: solr-user@lucene.apache.org
Subject: Re: Pushing a whole set of pdf-files to solr

Thanks for those replies. I will look into them. But if anyone knows of a
site that describes step by step how a windows user who has already
installed solr (and tomcat) can easily feed a folder (and subfolders) with
100s of pdfs into solr, or would be willing to write down down those steps,
I would really appreciate the reference. And I bet you there are lots of
people like me...



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Pushing-a-whole-set-of-pdf-files-to-solr-tp4025256p4054915.html
Sent from the Solr - User mailing list archive at Nabble.com.