[jira] [Commented] (SOLR-6907) URLEncode documents directory in MorphlineMapperTest

2015-01-03 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263599#comment-14263599
 ] 

wolfgang hoschek commented on SOLR-6907:


+1 Looks reasonable to me.

 URLEncode documents directory in MorphlineMapperTest
 

 Key: SOLR-6907
 URL: https://issues.apache.org/jira/browse/SOLR-6907
 Project: Solr
  Issue Type: Bug
  Components: contrib - MapReduce, Tests
Reporter: Ramkumar Aiyengar
Priority: Minor

 Currently the test fails if the source is checked out on a directory whose 
 path contains, say spaces..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-4509) Disable HttpClient stale check for performance and fewer spurious connection errors.

2014-11-25 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-4509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224815#comment-14224815
 ] 

wolfgang hoschek commented on SOLR-4509:


Would be good to remove that stale check also in solrj.

 Disable HttpClient stale check for performance and fewer spurious connection 
 errors.
 

 Key: SOLR-4509
 URL: https://issues.apache.org/jira/browse/SOLR-4509
 Project: Solr
  Issue Type: Improvement
  Components: search
 Environment: 5 node SmartOS cluster (all nodes living in same global 
 zone - i.e. same physical machine)
Reporter: Ryan Zezeski
Assignee: Mark Miller
Priority: Minor
 Fix For: 5.0, Trunk

 Attachments: IsStaleTime.java, SOLR-4509-4_4_0.patch, 
 SOLR-4509.patch, SOLR-4509.patch, SOLR-4509.patch, SOLR-4509.patch, 
 baremetal-stale-nostale-med-latency.dat, 
 baremetal-stale-nostale-med-latency.svg, 
 baremetal-stale-nostale-throughput.dat, baremetal-stale-nostale-throughput.svg


 By disabling the Apache HTTP Client stale check I've witnessed a 2-4x 
 increase in throughput and reduction of over 100ms.  This patch was made in 
 the context of a project I'm leading, called Yokozuna, which relies on 
 distributed search.
 Here's the patch on Yokozuna: https://github.com/rzezeski/yokozuna/pull/26
 Here's a write-up I did on my findings: 
 http://www.zinascii.com/2013/solr-distributed-search-and-the-stale-check.html
 I'm happy to answer any questions or make changes to the patch to make it 
 acceptable.
 ReviewBoard: https://reviews.apache.org/r/28393/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6212) upgrade Saxon-HE to 9.5.1-5 and reinstate Morphline tests that were affected under java 8/9 with 9.5.1-4

2014-06-29 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047223#comment-14047223
 ] 

wolfgang hoschek commented on SOLR-6212:


This is already fixed in the latest stable morphline release per 
http://kitesdk.org/docs/current/release_notes.html

 upgrade Saxon-HE to 9.5.1-5 and reinstate Morphline tests that were affected 
 under java 8/9 with 9.5.1-4
 

 Key: SOLR-6212
 URL: https://issues.apache.org/jira/browse/SOLR-6212
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.7, 5.0
Reporter: Michael Dodsworth
Assignee: Mark Miller
Priority: Minor

 From SOLR-1301:
 For posterity, there is a thread on the dev list where we are working 
 through an issue with Saxon on java 8 and ibm's j9. Wolfgang filed 
 https://saxonica.plan.io/issues/1944 upstream. (Saxon is pulled in via 
 cdk-morphlines-saxon).
 Due to this issue, several Morphline tests were made to be 'ignored' in java 
 8+. The Saxon issue has been fixed in 9.5.1-5, so we should upgrade and 
 reinstate those tests.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5109) Solr 4.4 will not deploy in Glassfish 4.x

2014-06-29 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047391#comment-14047391
 ] 

wolfgang hoschek commented on SOLR-5109:


FWIW, morphlines currently won't work with guava-16 or guava-17 because of the 
incompatible guava API changes in the guava Closeables class in those two guava 
releases. However, there's a fix for this issue that will show up soon in 
kite-morphlines 0.15.0. See 
https://github.com/kite-sdk/kite/commit/0ab2795872e4e5721f477d79e5049371a17ab8db

 Solr 4.4 will not deploy in Glassfish 4.x
 -

 Key: SOLR-5109
 URL: https://issues.apache.org/jira/browse/SOLR-5109
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.4
 Environment: Glassfish 4.x
Reporter: jamon camisso
Priority: Blocker
  Labels: guava
 Attachments: LUCENE-5109.patch, guava-15.0-SNAPSHOT.jar


 The bundled Guava 14.0.1 JAR blocks deploying Solr 4.4 in Glassfish 4.x.
 This failure is a known issue with upstream Guava and is described here:
 https://code.google.com/p/guava-libraries/issues/detail?id=1433
 Building Guava guava-15.0-SNAPSHOT.jar from master and bundling it in Solr 
 allows for a successful deployment.
 Until the Guava developers release version 15 using their HEAD or even an RC 
 tag seems like the only way to resolve this.
 This is frustrating since it was proposed that Guava be removed as a 
 dependency before Solr 4.0 was released and yet it remains and blocks 
 upgrading: https://issues.apache.org/jira/browse/SOLR-3601



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-5109) Solr 4.4 will not deploy in Glassfish 4.x

2014-06-29 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047394#comment-14047394
 ] 

wolfgang hoschek edited comment on SOLR-5109 at 6/30/14 5:36 AM:
-

Another potential issue is that hadoop ships with guava-11.0.2 on the classpath 
of the task tracker (the JVM that runs the job). So this old guava version will 
race with any other guava version that happens to be on the classpath.


was (Author: whoschek):
Another potential issue is that hadoop ships with guava-12.0.1 on the classpath 
of the task tracker (the JVM that runs the job). So this old guava version will 
race with any other guava version that happens to be on the classpath.

 Solr 4.4 will not deploy in Glassfish 4.x
 -

 Key: SOLR-5109
 URL: https://issues.apache.org/jira/browse/SOLR-5109
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.4
 Environment: Glassfish 4.x
Reporter: jamon camisso
Priority: Blocker
  Labels: guava
 Attachments: LUCENE-5109.patch, guava-15.0-SNAPSHOT.jar


 The bundled Guava 14.0.1 JAR blocks deploying Solr 4.4 in Glassfish 4.x.
 This failure is a known issue with upstream Guava and is described here:
 https://code.google.com/p/guava-libraries/issues/detail?id=1433
 Building Guava guava-15.0-SNAPSHOT.jar from master and bundling it in Solr 
 allows for a successful deployment.
 Until the Guava developers release version 15 using their HEAD or even an RC 
 tag seems like the only way to resolve this.
 This is frustrating since it was proposed that Guava be removed as a 
 dependency before Solr 4.0 was released and yet it remains and blocks 
 upgrading: https://issues.apache.org/jira/browse/SOLR-3601



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5109) Solr 4.4 will not deploy in Glassfish 4.x

2014-06-29 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047394#comment-14047394
 ] 

wolfgang hoschek commented on SOLR-5109:


Another potential issue is that hadoop ships with guava-12.0.1 on the classpath 
of the task tracker (the JVM that runs the job). So this old guava version will 
race with any other guava version that happens to be on the classpath.

 Solr 4.4 will not deploy in Glassfish 4.x
 -

 Key: SOLR-5109
 URL: https://issues.apache.org/jira/browse/SOLR-5109
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.4
 Environment: Glassfish 4.x
Reporter: jamon camisso
Priority: Blocker
  Labels: guava
 Attachments: LUCENE-5109.patch, guava-15.0-SNAPSHOT.jar


 The bundled Guava 14.0.1 JAR blocks deploying Solr 4.4 in Glassfish 4.x.
 This failure is a known issue with upstream Guava and is described here:
 https://code.google.com/p/guava-libraries/issues/detail?id=1433
 Building Guava guava-15.0-SNAPSHOT.jar from master and bundling it in Solr 
 allows for a successful deployment.
 Until the Guava developers release version 15 using their HEAD or even an RC 
 tag seems like the only way to resolve this.
 This is frustrating since it was proposed that Guava be removed as a 
 dependency before Solr 4.0 was released and yet it remains and blocks 
 upgrading: https://issues.apache.org/jira/browse/SOLR-3601



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Adding Morphline support to DIH - worth the effort?

2014-06-11 Thread Wolfgang Hoschek
From our perspective we don’t really see use cases for DIH anymore.

Morphlines was developed primarily with Lucene in mind (even though it doesn’t 
require Lucene).

Flume Morphline Solr Sink handles streaming ingestion into Solr in reliable, 
scalable, flexible and loosely coupled ways, in separate processes. Neither 
Flume nor Morphlines requires Hadoop.

MapReduceIndexerTool uses Morphlines for reliable, scalable and flexible batch 
ingestion on Hadoop.

On Hadoop, even the JDBC/SQL portion of DIH now seems mostly covered by a 
combination of Sqoop and MapReduceIndexerTool, and perhaps a bit of Hive.

I’m not sure what the use cases for DIH still are these days.

(I wrote most of the Morphlines framework, Flume Morphline Solr Sink, 
MapReduceIndexerTool and the hbase-indexer-morphline integration.)

Just my 0.02c,
Wolfgang.

On Jun 11, 2014, at 1:05 PM, Dyer, James james.d...@ingramcontent.com wrote:

 Mikhail,
  
 It would be nice if the DIH could be run separately from Solr (SOLR-853 and 
 others).  I think a lot of us have already expressed support for this, and at 
 one time I was looking into what it would take to complete.  Then again, 
 having watched the solr morphline sink be created for Flume, I realized there 
 are other teams out there possibly building an awesome DIH killer.  If that 
 happens, then we just saved ourselves a boatload of work, right?  I think if 
 someone out there can create a nice POC that uses a different tool, that 
 would be a great first step.
  
 But there is also SOLR-3671 which was just committed as a follow-on to 
 SOLR-2382.  This makes DIH able to send documents to places other than Solr.  
 Turns out someone here is using DIH to import to Mongo.  (See SOLR-5981 for 
 details).  So we already have one side of the functionality to generalize DIH.
  
 James Dyer
 Ingram Content Group
 (615) 213-4311
  
 From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] 
 Sent: Wednesday, June 11, 2014 11:56 AM
 To: dev@lucene.apache.org
 Subject: Re: Adding Morphline support to DIH - worth the effort?
  
 James,
 Don't you think that the spawning DIH2.0 as separate war is a priority?
  
 
 On Wed, Jun 11, 2014 at 6:39 PM, Dyer, James james.d...@ingramcontent.com 
 wrote:
 Alexandre,
 
 I think that writing a new entity processor for DIH is a much less risky 
 thing to commit than, say, SOLR-4799.  Entity Processors work as plug-ins and 
 they aren't likely to break anything else.  So a Morphline EntityProcessor is 
 much more likely to be evaluated and committed.
 
 But like anything else, you're going to need to explain what the need is and 
 what this new e.p. buys the user community.   There needs to be unit tests, 
 etc.
 
 Besides this, if you can show how a morphline e.p. can be a step towards 
 migrating away from DIH entirely, then that would be a plus.  Perhaps create 
 a new solr example along the lines of the dih solr example that demonstrates 
 to users this new way forward.  This would go a long way in convincing the 
 community we have a viable alternative to dih.
 
 James Dyer
 Ingram Content Group
 (615) 213-4311
 
 
 -Original Message-
 From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
 Sent: Tuesday, June 10, 2014 9:55 PM
 To: dev@lucene.apache.org
 Subject: Re: Adding Morphline support to DIH - worth the effort?
 
 Ripples in the pond again. Spreading and dying. Understandable, but
 still somewhat annoying.
 
 So, what would be the minimal viable next step to move this
 conversation forward? Something for 4.11 as opposed to 5.0?
 
 Anyone with commit status has a feeling of what - minimal -
 deliverable they would put their own weight behind?
 
 Regards,
Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr 
 proficiency
 
 
 On Mon, Jun 9, 2014 at 10:50 AM, david.w.smi...@gmail.com
 david.w.smi...@gmail.com wrote:
  One of the ideas over DIH discussed earlier is making it standalone.
 
  Yeah; my beef with the DIH is that it’s tied to Solr.  But I’d rather see
  something other than the DIH outside Solr; it’s not worthy IMO.  Why have
  something Solr specific even?  A great pipeline shouldn’t tie itself to any
  end-point.  There are a variety of solutions out there that I tried.  There
  are the big 3 open-source ETLs: Kettle, Clover, Talend) and they aren’t
  quite ideal in one way or another.  And Spring-Integration.  And some
  half-baked data pipelines like OpenPipe  Open Pipeline.  I never got around
  to taking a good look at Findwise’s open-sourced Hydra but I learned enough
  to know to my surprise it was configured in code versus a config file (like
  all the others) and that's a big turn-off to me.  Today I read through most
  of the Morphlines docs and a few choice source files and I’m
  super-impressed.  But as you note it’s missing a lot of other stuff.  I
  think something great could be built using it as a core piece.
 
  ~ David Smiley
  

[jira] [Commented] (SOLR-6126) MapReduce's GoLive script should support replicas

2014-06-02 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015266#comment-14015266
 ] 

wolfgang hoschek commented on SOLR-6126:


[~dsmiley] It uses the --zk-host CLI options to fetch the solr URLs of each 
replica from zk - see extractShardUrls(). This info gets passed via the 
Options.shardUrls parameter into the go-live phase. In the go-live phase the 
segments of each shard are explicitly merged via a separate REST merge request 
per replica into the corresponding replica. The result is that each input 
segment is explicitly merged N times where N is the replication factor. Each 
such merge reads from HDFS and writes to HDFS.

(BTW, I'll be unreachable on an transatlantic flight very soon)

 MapReduce's GoLive script should support replicas
 -

 Key: SOLR-6126
 URL: https://issues.apache.org/jira/browse/SOLR-6126
 Project: Solr
  Issue Type: Improvement
  Components: contrib - MapReduce
Reporter: David Smiley

 The GoLive feature of the MapReduce contrib module is pretty cool.  But a 
 comment in there indicates that it doesn't support replicas.  Every 
 production SolrCloud setup I've seen has had replicas!
 I wonder what is needed to support this.  For GoLive to work, it assumes a 
 shared file system (be it HDFS or whatever, like a SAN).  If perhaps the 
 replicas in such a system read from the very same network disk location, then 
 all we'd need to do is send a commit() to replicas; right?  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6126) MapReduce's GoLive script should support replicas

2014-06-01 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015092#comment-14015092
 ] 

wolfgang hoschek commented on SOLR-6126:


The comment in the code is a bit outdated. The code does actually support 
replicas.

 MapReduce's GoLive script should support replicas
 -

 Key: SOLR-6126
 URL: https://issues.apache.org/jira/browse/SOLR-6126
 Project: Solr
  Issue Type: Improvement
  Components: contrib - MapReduce
Reporter: David Smiley

 The GoLive feature of the MapReduce contrib module is pretty cool.  But a 
 comment in there indicates that it doesn't support replicas.  Every 
 production SolrCloud setup I've seen has had replicas!
 I wonder what is needed to support this.  For GoLive to work, it assumes a 
 shared file system (be it HDFS or whatever, like a SAN).  If perhaps the 
 replicas in such a system read from the very same network disk location, then 
 all we'd need to do is send a commit() to replicas; right?  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5848) Morphlines is not resolving

2014-03-12 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13932328#comment-13932328
 ] 

wolfgang hoschek commented on SOLR-5848:


Going forward I'd recommend upgrading to version 0.12.0 rather than dealing 
with 0.11.0 because 0.12.0 is compatible and there are some nice performance 
improvements and a couple of new features - 
http://kitesdk.org/docs/current/release_notes.html

 Morphlines is not resolving
 ---

 Key: SOLR-5848
 URL: https://issues.apache.org/jira/browse/SOLR-5848
 Project: Solr
  Issue Type: Bug
Reporter: Dawid Weiss
Assignee: Mark Miller
Priority: Critical
 Fix For: 4.8, 5.0


 This version of morphlines does not resolve for me and Grant.
 {code}
 ::
 ::  UNRESOLVED DEPENDENCIES ::
 ::
 :: org.kitesdk#kite-morphlines-saxon;0.11.0: not found
 :: org.kitesdk#kite-morphlines-hadoop-sequencefile;0.11.0: not found
 {code}
 Has this been deleted from Cloudera's repositories or something? This would 
 be pretty bad -- maven repos should be immutable...



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5848) Morphlines is not resolving

2014-03-12 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13932378#comment-13932378
 ] 

wolfgang hoschek commented on SOLR-5848:


Sounds good. Thx!

 Morphlines is not resolving
 ---

 Key: SOLR-5848
 URL: https://issues.apache.org/jira/browse/SOLR-5848
 Project: Solr
  Issue Type: Bug
Reporter: Dawid Weiss
Assignee: Mark Miller
Priority: Critical
 Fix For: 4.8, 5.0


 This version of morphlines does not resolve for me and Grant.
 {code}
 ::
 ::  UNRESOLVED DEPENDENCIES ::
 ::
 :: org.kitesdk#kite-morphlines-saxon;0.11.0: not found
 :: org.kitesdk#kite-morphlines-hadoop-sequencefile;0.11.0: not found
 {code}
 Has this been deleted from Cloudera's repositories or something? This would 
 be pretty bad -- maven repos should be immutable...



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-5786) MapReduceIndexerTool --help text is missing large parts of the help text

2014-02-27 Thread wolfgang hoschek (JIRA)
wolfgang hoschek created SOLR-5786:
--

 Summary: MapReduceIndexerTool --help text is missing large parts 
of the help text
 Key: SOLR-5786
 URL: https://issues.apache.org/jira/browse/SOLR-5786
 Project: Solr
  Issue Type: Bug
  Components: contrib - MapReduce
Affects Versions: 4.7
Reporter: wolfgang hoschek
Assignee: Mark Miller
 Fix For: 4.8


As already mentioned repeatedly and at length, this is a regression introduced 
by the fix in https://issues.apache.org/jira/browse/SOLR-5605

Here is the diff of --help output before SOLR-5605 vs after SOLR-5605:

{code}
130,235c130
  lucene  segments  left  in   this  index.  Merging
  segments involves reading  and  rewriting all data
  in all these  segment  files, potentially multiple
  times,  which  is  very  I/O  intensive  and  time
  consuming. However, an  index  with fewer segments
  can later be merged  faster,  and  it can later be
  queried  faster  once  deployed  to  a  live  Solr
  serving shard. Set  maxSegments  to  1 to optimize
  the index for low query  latency. In a nutshell, a
  small maxSegments  value  trades  indexing latency
  for subsequently improved query  latency. This can
  be  a  reasonable  trade-off  for  batch  indexing
  systems. (default: 1)
   --fair-scheduler-pool STRING
  Optional tuning knob  that  indicates  the name of
  the fair scheduler  pool  to  submit  jobs to. The
  Fair Scheduler is a  pluggable MapReduce scheduler
  that provides a way to  share large clusters. Fair
  scheduling is a method  of  assigning resources to
  jobs such that all jobs  get, on average, an equal
  share of resources  over  time.  When  there  is a
  single job  running,  that  job  uses  the  entire
  cluster. When  other  jobs  are  submitted,  tasks
  slots that free up are  assigned  to the new jobs,
  so that each job gets  roughly  the same amount of
  CPU time.  Unlike  the  default  Hadoop scheduler,
  which forms a queue of  jobs, this lets short jobs
  finish in reasonable time  while not starving long
  jobs. It is also an  easy  way  to share a cluster
  between multiple of users.  Fair  sharing can also
  work with  job  priorities  -  the  priorities are
  used as  weights  to  determine  the  fraction  of
  total compute time that each job gets.
   --dry-run  Run in local mode  and  print  documents to stdout
  instead of loading them  into  Solr. This executes
  the  morphline  in  the  client  process  (without
  submitting a job  to  MR)  for  quicker turnaround
  during early  trialdebug  sessions. (default:
  false)
   --log4j FILE   Relative or absolute  path  to  a log4j.properties
  config file on the  local  file  system. This file
  will  be  uploaded  to   each  MR  task.  Example:
  /path/to/log4j.properties
   --verbose, -v  Turn on verbose output. (default: false)
   --show-non-solr-cloud  Also show options for  Non-SolrCloud  mode as part
  of --help. (default: false)
 
 Required arguments:
   --output-dir HDFS_URI  HDFS directory to  write  Solr  indexes to. Inside
  there one  output  directory  per  shard  will  be
  generated.Example: hdfs://c2202.mycompany.
  com/user/$USER/test
   --morphline-file FILE  Relative or absolute path  to  a local config file
  that contains one  or  more  morphlines.  The file
  must be  UTF-8  encoded.  Example:
  /path/to/morphline.conf
 
 Cluster arguments:
   Arguments that provide information about your Solr cluster. 
 
   --zk-host STRING   The address of a ZooKeeper  ensemble being used by
  a SolrCloud cluster. This  ZooKeeper ensemble will
  be examined  to  determine  the  number  of output

[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

2014-02-27 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13914549#comment-13914549
 ] 

wolfgang hoschek commented on SOLR-5605:


Correspondingly, I filed https://issues.apache.org/jira/browse/SOLR-5786

Look, as you know, I wrote almost all of the original solr-mapreduce contrib, 
and I know this code inside out. To be honest, this kind of repetitive 
ignorance is tiresome at best and completely turns me off.

 MapReduceIndexerTool fails in some locales -- seen in random failures of 
 MapReduceIndexerToolArgumentParserTest
 ---

 Key: SOLR-5605
 URL: https://issues.apache.org/jira/browse/SOLR-5605
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 4.7, 5.0


 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest 
 which is reproducible with any seed -- all that matters is the locale.
 The problem sounded familiar, and a quick search verified that jenkins has in 
 fact hit this a couple of times in the past -- Uwe commented on the list that 
 this is due to a real problem in one of the third-party dependencies (that 
 does the argument parsing) that will affect usage on some systems.
 If working around the bug in the arg parsing lib isn't feasible, 
 MapReduceIndexerTool should fail cleanly if the locale isn't one we know is 
 supported



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-5786) MapReduceIndexerTool --help output is missing large parts of the help text

2014-02-27 Thread wolfgang hoschek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wolfgang hoschek updated SOLR-5786:
---

Summary: MapReduceIndexerTool --help output is missing large parts of the 
help text  (was: MapReduceIndexerTool --help text is missing large parts of the 
help text)

 MapReduceIndexerTool --help output is missing large parts of the help text
 --

 Key: SOLR-5786
 URL: https://issues.apache.org/jira/browse/SOLR-5786
 Project: Solr
  Issue Type: Bug
  Components: contrib - MapReduce
Affects Versions: 4.7
Reporter: wolfgang hoschek
Assignee: Mark Miller
 Fix For: 4.8


 As already mentioned repeatedly and at length, this is a regression 
 introduced by the fix in https://issues.apache.org/jira/browse/SOLR-5605
 Here is the diff of --help output before SOLR-5605 vs after SOLR-5605:
 {code}
 130,235c130
   lucene  segments  left  in   this  index.  Merging
   segments involves reading  and  rewriting all data
   in all these  segment  files, potentially multiple
   times,  which  is  very  I/O  intensive  and  time
   consuming. However, an  index  with fewer segments
   can later be merged  faster,  and  it can later be
   queried  faster  once  deployed  to  a  live  Solr
   serving shard. Set  maxSegments  to  1 to optimize
   the index for low query  latency. In a nutshell, a
   small maxSegments  value  trades  indexing latency
   for subsequently improved query  latency. This can
   be  a  reasonable  trade-off  for  batch  indexing
   systems. (default: 1)
--fair-scheduler-pool STRING
   Optional tuning knob  that  indicates  the name of
   the fair scheduler  pool  to  submit  jobs to. The
   Fair Scheduler is a  pluggable MapReduce scheduler
   that provides a way to  share large clusters. Fair
   scheduling is a method  of  assigning resources to
   jobs such that all jobs  get, on average, an equal
   share of resources  over  time.  When  there  is a
   single job  running,  that  job  uses  the  entire
   cluster. When  other  jobs  are  submitted,  tasks
   slots that free up are  assigned  to the new jobs,
   so that each job gets  roughly  the same amount of
   CPU time.  Unlike  the  default  Hadoop scheduler,
   which forms a queue of  jobs, this lets short jobs
   finish in reasonable time  while not starving long
   jobs. It is also an  easy  way  to share a cluster
   between multiple of users.  Fair  sharing can also
   work with  job  priorities  -  the  priorities are
   used as  weights  to  determine  the  fraction  of
   total compute time that each job gets.
--dry-run  Run in local mode  and  print  documents to stdout
   instead of loading them  into  Solr. This executes
   the  morphline  in  the  client  process  (without
   submitting a job  to  MR)  for  quicker turnaround
   during early  trialdebug  sessions. (default:
   false)
--log4j FILE   Relative or absolute  path  to  a log4j.properties
   config file on the  local  file  system. This file
   will  be  uploaded  to   each  MR  task.  Example:
   /path/to/log4j.properties
--verbose, -v  Turn on verbose output. (default: false)
--show-non-solr-cloud  Also show options for  Non-SolrCloud  mode as part
   of --help. (default: false)
  
  Required arguments:
--output-dir HDFS_URI  HDFS directory to  write  Solr  indexes to. Inside
   there one  output  directory  per  shard  will  be
   generated.Example: hdfs://c2202.mycompany.
   com/user/$USER/test
--morphline-file FILE  Relative or absolute path  to  a local config file
   that contains one  or  more  morphlines.  The file
   must be  UTF-8

[jira] [Updated] (SOLR-5786) MapReduceIndexerTool --help output is missing large parts of the help text

2014-02-27 Thread wolfgang hoschek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wolfgang hoschek updated SOLR-5786:
---

Description: 
As already mentioned repeatedly and at length, this is a regression introduced 
by the fix in https://issues.apache.org/jira/browse/SOLR-5605

Here is the diff of --help output before SOLR-5605 vs after SOLR-5605:

{code}
130,235c130
  lucene  segments  left  in   this  index.  Merging
  segments involves reading  and  rewriting all data
  in all these  segment  files, potentially multiple
  times,  which  is  very  I/O  intensive  and  time
  consuming. However, an  index  with fewer segments
  can later be merged  faster,  and  it can later be
  queried  faster  once  deployed  to  a  live  Solr
  serving shard. Set  maxSegments  to  1 to optimize
  the index for low query  latency. In a nutshell, a
  small maxSegments  value  trades  indexing latency
  for subsequently improved query  latency. This can
  be  a  reasonable  trade-off  for  batch  indexing
  systems. (default: 1)
   --fair-scheduler-pool STRING
  Optional tuning knob  that  indicates  the name of
  the fair scheduler  pool  to  submit  jobs to. The
  Fair Scheduler is a  pluggable MapReduce scheduler
  that provides a way to  share large clusters. Fair
  scheduling is a method  of  assigning resources to
  jobs such that all jobs  get, on average, an equal
  share of resources  over  time.  When  there  is a
  single job  running,  that  job  uses  the  entire
  cluster. When  other  jobs  are  submitted,  tasks
  slots that free up are  assigned  to the new jobs,
  so that each job gets  roughly  the same amount of
  CPU time.  Unlike  the  default  Hadoop scheduler,
  which forms a queue of  jobs, this lets short jobs
  finish in reasonable time  while not starving long
  jobs. It is also an  easy  way  to share a cluster
  between multiple of users.  Fair  sharing can also
  work with  job  priorities  -  the  priorities are
  used as  weights  to  determine  the  fraction  of
  total compute time that each job gets.
   --dry-run  Run in local mode  and  print  documents to stdout
  instead of loading them  into  Solr. This executes
  the  morphline  in  the  client  process  (without
  submitting a job  to  MR)  for  quicker turnaround
  during early  trialdebug  sessions. (default:
  false)
   --log4j FILE   Relative or absolute  path  to  a log4j.properties
  config file on the  local  file  system. This file
  will  be  uploaded  to   each  MR  task.  Example:
  /path/to/log4j.properties
   --verbose, -v  Turn on verbose output. (default: false)
   --show-non-solr-cloud  Also show options for  Non-SolrCloud  mode as part
  of --help. (default: false)
 
 Required arguments:
   --output-dir HDFS_URI  HDFS directory to  write  Solr  indexes to. Inside
  there one  output  directory  per  shard  will  be
  generated.Example: hdfs://c2202.mycompany.
  com/user/$USER/test
   --morphline-file FILE  Relative or absolute path  to  a local config file
  that contains one  or  more  morphlines.  The file
  must be  UTF-8  encoded.  Example:
  /path/to/morphline.conf
 
 Cluster arguments:
   Arguments that provide information about your Solr cluster. 
 
   --zk-host STRING   The address of a ZooKeeper  ensemble being used by
  a SolrCloud cluster. This  ZooKeeper ensemble will
  be examined  to  determine  the  number  of output
  shards to create  as  well  as  the  Solr  URLs to
  merge the output shards into  when using the --go-
  live option. Requires that  you  also  pass the --
  collection to merge the shards

[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

2014-02-27 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915037#comment-13915037
 ] 

wolfgang hoschek commented on SOLR-5605:


bq. Are you not a committer? At Apache, those who do decide.

Yes, but you've clearly been assigned to upstream this stuff and I have plenty 
of other things to attend to these days.

bq. I did not realize Patricks patch did not include the latest code updates 
from MapReduce. 

Might be good to pay more attention, also to CDH-14804?

bq. I had and still have bigger concerns around the usability of this code in 
Solr than this issue. It is very, very far from easy for someone to get started 
with this contrib right now. 

The usability is fine downstream where maven automatically builds a job jar 
that includes the necessary dependency jars inside of the lib dir of the MR job 
jar. Hence no startup script or extra steps are required downstream, just one 
(fat) jar. If it's not usable upstream it may be because no corresponding 
packaging system has been used upstream, for reasons that escape me.

bq. which is why non of these smaller issues concern me very much at this point.

I'm afraid ignorance never helps.

 MapReduceIndexerTool fails in some locales -- seen in random failures of 
 MapReduceIndexerToolArgumentParserTest
 ---

 Key: SOLR-5605
 URL: https://issues.apache.org/jira/browse/SOLR-5605
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 4.7, 5.0


 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest 
 which is reproducible with any seed -- all that matters is the locale.
 The problem sounded familiar, and a quick search verified that jenkins has in 
 fact hit this a couple of times in the past -- Uwe commented on the list that 
 this is due to a real problem in one of the third-party dependencies (that 
 does the argument parsing) that will affect usage on some systems.
 If working around the bug in the arg parsing lib isn't feasible, 
 MapReduceIndexerTool should fail cleanly if the locale isn't one we know is 
 supported



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

2014-02-27 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915037#comment-13915037
 ] 

wolfgang hoschek edited comment on SOLR-5605 at 2/27/14 9:23 PM:
-

bq. Are you not a committer? At Apache, those who do decide.

Yes, but you've clearly been assigned to upstream those contribs and I have 
plenty of other things to attend to these days.

bq. I did not realize Patricks patch did not include the latest code updates 
from MapReduce. 

Might be good to pay more attention, also to CDH-14804?

bq. I had and still have bigger concerns around the usability of this code in 
Solr than this issue. It is very, very far from easy for someone to get started 
with this contrib right now. 

The usability is fine downstream where maven automatically builds a job jar 
that includes the necessary dependency jars inside of the lib dir of the MR job 
jar. Hence no startup script or extra steps are required downstream, just one 
(fat) jar. If it's not usable upstream it may be because no corresponding 
packaging system has been used upstream, for reasons that escape me.

bq. which is why non of these smaller issues concern me very much at this point.

I'm afraid ignorance never helps.


was (Author: whoschek):
bq. Are you not a committer? At Apache, those who do decide.

Yes, but you've clearly been assigned to upstream this stuff and I have plenty 
of other things to attend to these days.

bq. I did not realize Patricks patch did not include the latest code updates 
from MapReduce. 

Might be good to pay more attention, also to CDH-14804?

bq. I had and still have bigger concerns around the usability of this code in 
Solr than this issue. It is very, very far from easy for someone to get started 
with this contrib right now. 

The usability is fine downstream where maven automatically builds a job jar 
that includes the necessary dependency jars inside of the lib dir of the MR job 
jar. Hence no startup script or extra steps are required downstream, just one 
(fat) jar. If it's not usable upstream it may be because no corresponding 
packaging system has been used upstream, for reasons that escape me.

bq. which is why non of these smaller issues concern me very much at this point.

I'm afraid ignorance never helps.

 MapReduceIndexerTool fails in some locales -- seen in random failures of 
 MapReduceIndexerToolArgumentParserTest
 ---

 Key: SOLR-5605
 URL: https://issues.apache.org/jira/browse/SOLR-5605
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 4.7, 5.0


 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest 
 which is reproducible with any seed -- all that matters is the locale.
 The problem sounded familiar, and a quick search verified that jenkins has in 
 fact hit this a couple of times in the past -- Uwe commented on the list that 
 this is due to a real problem in one of the third-party dependencies (that 
 does the argument parsing) that will affect usage on some systems.
 If working around the bug in the arg parsing lib isn't feasible, 
 MapReduceIndexerTool should fail cleanly if the locale isn't one we know is 
 supported



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

2014-02-25 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911744#comment-13911744
 ] 

wolfgang hoschek commented on SOLR-5605:


I have looked, have you? I have fixed this one before. Have you? 

Pls take the time to diff before vs. after to see that some docs parts are 
missing while other's are present (b/c of the funny missing buffer flush). It 
is not the same. This is a regression. Thx.

 MapReduceIndexerTool fails in some locales -- seen in random failures of 
 MapReduceIndexerToolArgumentParserTest
 ---

 Key: SOLR-5605
 URL: https://issues.apache.org/jira/browse/SOLR-5605
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 4.7, 5.0


 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest 
 which is reproducible with any seed -- all that matters is the locale.
 The problem sounded familiar, and a quick search verified that jenkins has in 
 fact hit this a couple of times in the past -- Uwe commented on the list that 
 this is due to a real problem in one of the third-party dependencies (that 
 does the argument parsing) that will affect usage on some systems.
 If working around the bug in the arg parsing lib isn't feasible, 
 MapReduceIndexerTool should fail cleanly if the locale isn't one we know is 
 supported



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Reopened] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

2014-02-19 Thread wolfgang hoschek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wolfgang hoschek reopened SOLR-5605:



Without this the --help text is screwed. 
https://issues.apache.org/jira/secure/EditComment!default.jspa?id=12687301commentId=13862272

 MapReduceIndexerTool fails in some locales -- seen in random failures of 
 MapReduceIndexerToolArgumentParserTest
 ---

 Key: SOLR-5605
 URL: https://issues.apache.org/jira/browse/SOLR-5605
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 4.7, 5.0


 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest 
 which is reproducible with any seed -- all that matters is the locale.
 The problem sounded familiar, and a quick search verified that jenkins has in 
 fact hit this a couple of times in the past -- Uwe commented on the list that 
 this is due to a real problem in one of the third-party dependencies (that 
 does the argument parsing) that will affect usage on some systems.
 If working around the bug in the arg parsing lib isn't feasible, 
 MapReduceIndexerTool should fail cleanly if the locale isn't one we know is 
 supported



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

2014-02-19 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905806#comment-13905806
 ] 

wolfgang hoschek commented on SOLR-5605:


Yes, as already mentioned, otherwise some of the --help text doesn't show up in 
the output because there's a change related to buffer flushing in 
argparse4j-0.4.2.

 MapReduceIndexerTool fails in some locales -- seen in random failures of 
 MapReduceIndexerToolArgumentParserTest
 ---

 Key: SOLR-5605
 URL: https://issues.apache.org/jira/browse/SOLR-5605
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 4.7, 5.0


 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest 
 which is reproducible with any seed -- all that matters is the locale.
 The problem sounded familiar, and a quick search verified that jenkins has in 
 fact hit this a couple of times in the past -- Uwe commented on the list that 
 this is due to a real problem in one of the third-party dependencies (that 
 does the argument parsing) that will affect usage on some systems.
 If working around the bug in the arg parsing lib isn't feasible, 
 MapReduceIndexerTool should fail cleanly if the locale isn't one we know is 
 supported



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Benson Margulies as Lucene/Solr committer!

2014-01-28 Thread Wolfgang Hoschek
Welcome on board!

Wolfgang.

On Jan 26, 2014, at 4:32 PM, Erick Erickson wrote:

 Good to have you aboard!
 
 Erick
 
 On Sat, Jan 25, 2014 at 10:52 PM, Mark Miller markrmil...@gmail.com wrote:
 Welcome!
 
 - Mark
 
 http://about.me/markrmiller
 
 On Jan 25, 2014, at 4:40 PM, Michael McCandless luc...@mikemccandless.com 
 wrote:
 
 I'm pleased to announce that Benson Margulies has accepted to join our
 ranks as a committer.
 
 Benson has been involved in a number of Lucene/Solr issues over time
 (see 
 http://jirasearch.mikemccandless.com/search.py?index=jirachg=ddsa1=allUsersa2=Benson+Margulies
 ), most recently on debugging tricky analysis issues.
 
 Benson, it is tradition that you introduce yourself with a brief bio.
 I know you're heavily involved in other Apache projects already...
 
 Once your account is set up, you should then be able to add yourself
 to the who we are page on the website as well.
 
 Congratulations and welcome!
 
 Mike McCandless
 
 http://blog.mikemccandless.com
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

2014-01-04 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862272#comment-13862272
 ] 

wolfgang hoschek commented on SOLR-5605:


Thanks for getting to the bottom of this! 

Looks like we'll now be good on upgrade to argparse4j-0.4.3, except we'll also 
need to apply CDH-16434 to MapReduceIndexerTool.java because there's a change 
related to flushing in 0.4.2:

-parser.printHelp(new PrintWriter(System.out));  
+parser.printHelp();

Otherwise some of the --help text doesn't show up in the output :-(

 MapReduceIndexerTool fails in some locales -- seen in random failures of 
 MapReduceIndexerToolArgumentParserTest
 ---

 Key: SOLR-5605
 URL: https://issues.apache.org/jira/browse/SOLR-5605
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man

 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest 
 which is reproducible with any seed -- all that matters is the locale.
 The problem sounded familiar, and a quick search verified that jenkins has in 
 fact hit this a couple of times in the past -- Uwe commented on the list that 
 this is due to a real problem in one of the third-party dependencies (that 
 does the argument parsing) that will affect usage on some systems.
 If working around the bug in the arg parsing lib isn't feasible, 
 MapReduceIndexerTool should fail cleanly if the locale isn't one we know is 
 supported



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

2014-01-04 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862272#comment-13862272
 ] 

wolfgang hoschek edited comment on SOLR-5605 at 1/4/14 11:42 AM:
-

Thanks for getting to the bottom of this! 

Looks like we'll now be good on upgrade to argparse4j-0.4.3, except we'll also 
need to apply CDH-16434 to MapReduceIndexerTool.java because there's a change 
related to flushing in 0.4.2:

{code}
-parser.printHelp(new PrintWriter(System.out));  
+parser.printHelp();
{code}

Otherwise some of the --help text doesn't show up in the output :-(


was (Author: whoschek):
Thanks for getting to the bottom of this! 

Looks like we'll now be good on upgrade to argparse4j-0.4.3, except we'll also 
need to apply CDH-16434 to MapReduceIndexerTool.java because there's a change 
related to flushing in 0.4.2:

-parser.printHelp(new PrintWriter(System.out));  
+parser.printHelp();

Otherwise some of the --help text doesn't show up in the output :-(

 MapReduceIndexerTool fails in some locales -- seen in random failures of 
 MapReduceIndexerToolArgumentParserTest
 ---

 Key: SOLR-5605
 URL: https://issues.apache.org/jira/browse/SOLR-5605
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man

 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest 
 which is reproducible with any seed -- all that matters is the locale.
 The problem sounded familiar, and a quick search verified that jenkins has in 
 fact hit this a couple of times in the past -- Uwe commented on the list that 
 this is due to a real problem in one of the third-party dependencies (that 
 does the argument parsing) that will affect usage on some systems.
 If working around the bug in the arg parsing lib isn't feasible, 
 MapReduceIndexerTool should fail cleanly if the locale isn't one we know is 
 supported



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5584) Update to Guava 15.0

2014-01-04 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862273#comment-13862273
 ] 

wolfgang hoschek commented on SOLR-5584:


As mentioned above, morphlines was designed to run fine with any guava version 
= 11.0.2. 

But the hadoop task tracker always puts guava 11.0.2 on the classpath of any MR 
job that it executes, so solr-mapreduce would need to figure out how to 
override or reorder that.

 Update to Guava 15.0
 

 Key: SOLR-5584
 URL: https://issues.apache.org/jira/browse/SOLR-5584
 Project: Solr
  Issue Type: Improvement
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 5.0, 4.7






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: The Old Git Discussion

2014-01-02 Thread Wolfgang Hoschek
+1

On Jan 2, 2014, at 10:53 PM, Simon Willnauer wrote:

 +1
 
 On Thu, Jan 2, 2014 at 9:51 PM, Mark Miller markrmil...@gmail.com wrote:
 bzr is dying; Emacs needs to move
 
 
 http://lists.gnu.org/archive/html/emacs-devel/2014-01/msg5.html
 
 Interesting thread.
 
 For similar reasons, I think that Lucene and Solr should eventually move to
 Git. It's not GitHub, but it's a lot closer. The new Apache projects I see
 are all choosing Git. It's the winners road I think. I don't know that there
 is a big hurry right now, but I think it's inevitable that we should switch.
 
 --
 - Mark
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5584) Update to Guava 15.0

2013-12-30 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13858699#comment-13858699
 ] 

wolfgang hoschek commented on SOLR-5584:


What exactly is failing for you? morphlines was designed to run fine with any 
guava version = 11.0.2. At least it did last I checked...

 Update to Guava 15.0
 

 Key: SOLR-5584
 URL: https://issues.apache.org/jira/browse/SOLR-5584
 Project: Solr
  Issue Type: Improvement
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 5.0, 4.7






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-25 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856657#comment-13856657
 ] 

wolfgang hoschek commented on SOLR-1301:


Also see https://issues.cloudera.org/browse/CDK-262


 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-15 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848097#comment-13848097
 ] 

wolfgang hoschek edited comment on SOLR-1301 at 12/16/13 2:27 AM:
--

Might be best to write a program that generates the list of files and then 
explicitly provide that file list to the MR job, e.g. via the --input-list 
option. For example you could use the HDFS version of the Linux file system 
'find' command for that (HdfsFindTool doc and code here: 
https://github.com/cloudera/search/tree/master_1.1.0/search-mr#hdfsfindtool)




was (Author: whoschek):
Might be best to write a program that generates the list of files and then 
explicitly provide that file list to the MR job, e.g. via the --input-list 
option. For example you could use the HDFS version of the Linux file system 
'find' command for that (HdfsFindTool doc and code here: 
https://github.com/cloudera/search/tree/master_1.1.0/search-mr)



 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-15 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848775#comment-13848775
 ] 

wolfgang hoschek commented on SOLR-1301:


bq. it would be convenient if we could ignore the underscore (_) hidden files 
in hdfs as well as the . hidden files when reading input files from hdfs.

+1

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-13 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848097#comment-13848097
 ] 

wolfgang hoschek commented on SOLR-1301:


Might be best to write a program that generates the list of files and then 
explicitly provide that file list to the MR job, e.g. via the --input-list 
option. For example you could use the HDFS version of the Linux file system 
'find' command for that (HdfsFindTool doc and code here: 
https://github.com/cloudera/search/tree/master_1.1.0/search-mr)



 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-09 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843443#comment-13843443
 ] 

wolfgang hoschek commented on SOLR-1301:


I'm not aware of anything needing jersey except perhaps hadoop pulls that in.

The combined dependencies of all morphline modules is here: 
http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html

The dependencies of each individual morphline modules is here: 
http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html

The source and POMs are here, as usual: 
https://github.com/cloudera/cdk/tree/master/cdk-morphlines

By the way, a somewhat separate issue is that it seems to me that the ivy 
dependences for solr-morphlines-core and solr-morphlines-cell and 
solr-map-reduce are a bit backwards upstream in that solr-morphlines-core pulls 
in a ton of dependencies that it doesn't need, and those deps should rather be 
pulled in by the solr-map-reduce (which is a essentially an out-of-the-box 
app). Would be good to organize ivy and mvn upstream in such a way that 

* solr-map-reduce should depend on solr-morphlines-cell plus cdk-morphlines-all 
plus xyz
* solr-morphlines-cell should depend on solr-morphlines-core plus xyz
* solr-morphlines-core should depend on cdk-morphlines-core plus xyz 

More concretely, FWIW, to see how the deps look like in production releases 
downstream review the following POMs: 

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-core/pom.xml

and

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-cell/pom.xml

and

https://github.com/cloudera/search/blob/master_1.1.0/search-mr/pom.xml

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored

[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-09 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843443#comment-13843443
 ] 

wolfgang hoschek edited comment on SOLR-1301 at 12/9/13 7:30 PM:
-

I'm not aware of anything needing jersey except perhaps hadoop pulls that in.

The combined dependencies of all morphline modules is here: 
http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html

The dependencies of each individual morphline modules is here: 
http://cloudera.github.io/cdk/docs/current/dependencies.html

The source and POMs are here, as usual: 
https://github.com/cloudera/cdk/tree/master/cdk-morphlines

By the way, a somewhat separate issue is that it seems to me that the ivy 
dependences for solr-morphlines-core and solr-morphlines-cell and 
solr-map-reduce are a bit backwards upstream in that currently 
solr-morphlines-core pulls in a ton of dependencies that it doesn't need, and 
those deps should rather be pulled in by the solr-map-reduce (which is a 
essentially an out-of-the-box app that bundles user level deps). 
Correspondingly, would be good to organize ivy and mvn upstream in such a way 
that 

* solr-map-reduce should depend on solr-morphlines-cell plus cdk-morphlines-all 
minus cdk-morphlines-solr-cell (now upstream) minus cdk-morphlines-solr-core 
(now upstream) plus xyz
* solr-morphlines-cell should depend on solr-morphlines-core plus xyz
* solr-morphlines-core should depend on cdk-morphlines-core plus xyz 

More concretely, FWIW, to see how the deps look like in production releases 
downstream review the following POMs: 

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-core/pom.xml

and

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-cell/pom.xml

and

https://github.com/cloudera/search/blob/master_1.1.0/search-mr/pom.xml


was (Author: whoschek):
I'm not aware of anything needing jersey except perhaps hadoop pulls that in.

The combined dependencies of all morphline modules is here: 
http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html

The dependencies of each individual morphline modules is here: 
http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html

The source and POMs are here, as usual: 
https://github.com/cloudera/cdk/tree/master/cdk-morphlines

By the way, a somewhat separate issue is that it seems to me that the ivy 
dependences for solr-morphlines-core and solr-morphlines-cell and 
solr-map-reduce are a bit backwards upstream in that solr-morphlines-core pulls 
in a ton of dependencies that it doesn't need, and those deps should rather be 
pulled in by the solr-map-reduce (which is a essentially an out-of-the-box 
app). Would be good to organize ivy and mvn upstream in such a way that 

* solr-map-reduce should depend on solr-morphlines-cell plus cdk-morphlines-all 
plus xyz
* solr-morphlines-cell should depend on solr-morphlines-core plus xyz
* solr-morphlines-core should depend on cdk-morphlines-core plus xyz 

More concretely, FWIW, to see how the deps look like in production releases 
downstream review the following POMs: 

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-core/pom.xml

and

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-cell/pom.xml

and

https://github.com/cloudera/search/blob/master_1.1.0/search-mr/pom.xml

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-09 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843523#comment-13843523
 ] 

wolfgang hoschek commented on SOLR-1301:


Apologies for the confusion. We are upstreaming cdk-morphlines-solr-cell into 
the solr contrib solr-morphlines-cell as well as cdk-morphlines-solr-core into 
the solr contrib solr-morphlines-core as well as search-mr into the solr 
contrib solr-map-reduce. Once the upstreaming is done these old modules will go 
away. Next, downstream will be made identical to upstream plus perhaps some 
critical fixes as necessary, and the upstream/downstream terms will apply in 
the way folks usually think about them, but we are not quite yet there today, 
but getting there...

cdk-morphlines-all is simply a convenience pom that includes all the other 
morphline poms so there's less to type for users who like a bit more auto magic.

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-06 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13842034#comment-13842034
 ] 

wolfgang hoschek commented on SOLR-1301:


There are also some important fixes downstream in 0.9.0 of cdk-morphlines-core 
and cdk-morphlines-solr-cell that would be good to merge upstream (solr locator 
race, solr cell bug, etc). Also there are new morphline modules jars to add 
with 0.9.0 and jars to update (plus upstream is also missing some morphline 
modules from 0.8 as well)

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-06 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13842034#comment-13842034
 ] 

wolfgang hoschek edited comment on SOLR-1301 at 12/7/13 2:57 AM:
-

There are also some important fixes downstream in 0.9.0 of 
cdk-morphlines-solr-core and cdk-morphlines-solr-cell that would be good to 
merge upstream (solr locator race, solr cell bug, etc). Also there are new 
morphline modules jars to add with 0.9.0 and jars to update (plus upstream is 
also missing some morphline modules from 0.8 as well)


was (Author: whoschek):
There are also some important fixes downstream in 0.9.0 of cdk-morphlines-core 
and cdk-morphlines-solr-cell that would be good to merge upstream (solr locator 
race, solr cell bug, etc). Also there are new morphline modules jars to add 
with 0.9.0 and jars to update (plus upstream is also missing some morphline 
modules from 0.8 as well)

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-04 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839308#comment-13839308
 ] 

wolfgang hoschek commented on SOLR-1301:


There are also some fixes downstream in cdk-morphlines-core and 
cdk-morphlines-solr-cell that would be good to push upstream.


 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-04 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839311#comment-13839311
 ] 

wolfgang hoschek commented on SOLR-1301:


Minor nit: could remove 
jobConf.setBoolean(ExtractingParams.IGNORE_TIKA_EXCEPTION, false) in 
MorphlineBasicMiniMRTest + MorphlineGoLiveMiniMRTest because such a flag is 
nomore needed, and it removes an unnecessary dependency on tika.



 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-04 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839556#comment-13839556
 ] 

wolfgang hoschek commented on SOLR-1301:


FWIW, a current printout of --help showing the CLI options is here: 
https://github.com/cloudera/search/tree/master_1.0.0/search-mr


 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-04 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839556#comment-13839556
 ] 

wolfgang hoschek edited comment on SOLR-1301 at 12/5/13 12:55 AM:
--

FWIW, a current printout of --help showing the CLI options is here: 
https://github.com/cloudera/search/tree/master_1.1.0/search-mr



was (Author: whoschek):
FWIW, a current printout of --help showing the CLI options is here: 
https://github.com/cloudera/search/tree/master_1.0.0/search-mr


 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [JENKINS] Lucene-Solr-trunk-Linux (32bit/jdk1.8.0-ea-b117) - Build # 8549 - Still Failing!

2013-12-03 Thread Wolfgang Hoschek

On Dec 3, 2013, at 12:11 AM, Uwe Schindler wrote:

 Looks like Java's service loader lookup impl has become more strict in Java8.
 This issue on Java 8 is kind of unfortunate because morphlines and solr-mr
 doesn't actually use JAXP at all.
 
 For the time being might be best to disable testing on Java8 for this 
 contrib,
 in order to get a stable build and make progress on other issues.
 
 A couple of options that come to mind in how to deal with this longer term:
 
 1) Remove the dependency on cdk-morphlines-saxon (which pulls in the
 saxon jar)
 
 
 What ist he effect of this? I would prefer this!

The effect is that the convertHTML, xquery and xslt commands won't be available 
anymore: 

http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html#/cdk-morphlines-saxon

 
 2) Replace all Solr calls to JAXP XPathFactory.newInstance() with a little
 helper that first tries to use one of a list of well known XPathFactory
 subclasses, and only if that fails falls back to the generic
 XPathFactory.newInstance(). E.g. use something like
 
 XPathFactory.newInstance(XPathFactory.DEFAULT_OBJECT_MODEL_URI,
 com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl,
 ClassLoader.getSystemClassLoader());
 
 This is a hack, just because of this craziness, I don't want to have non 
 conformant code in Solr Core!

This is actually quite common practice because the JAXP service loader 
mechanism is a bit flawed. Also, most XSLT and XPath and StaX implementations 
have serious bugs in various areas. Thus many XML intensive apps that require 
reliability and predictable behavior explicitly choose one of the JAXP 
implementation that's known to work for them, rather than hoping for the best 
with some potentially buggy default impl. JAXP plug-ability really only exists 
for simple XPath use cases. The good news is that Solr Config et al seems to 
fit into that simple pluggable bucket.

 
 There are 14 such XPathFactory.newInstance() calls in the Solr codebase.
 
 Definite -1
 
 3) Somehow remove the META-INF/services/javax.xml.xpath.XPathFactory
 file from the saxon jar (this is what's causing this, and we don't need that 
 file,
 but it's not clear how to remove it, realistically)
 
 The only correct way to solve this: File a bug in Jackson and apply (1). 
 Jackson violates the standards. And this violation fails in a number of JVMs 
 (not only in Java 8, also IBM J9 is affected).

I'll file a bug with saxon and see what Mike Kay's take is. Meanwhile, we could 
remove the saxon jar or disable tests on java8  J9 to be able to move forward 
on this.

 Because of this I don't want to have Jackson in Solr at all (you have to 
 know, I am a fan of XSLT and XPath, but Jackson is the worst implementation I 
 have seen and I avoid it whenever possible - Only if you need XPath2 / XSLT 2 
 you may want to use it).

All XML libs have bugs but most XML intensive apps use saxon in production 
rather than other impls, at least from what I've seen over the years. Anyway, 
just my 2 cents.

Wolfgang.

 
 Uwe
 
 On Dec 2, 2013, at 4:41 PM, Mark Miller wrote:
 
 Uwe mentioned this in IRC - I guess Saxon doesn’t play nice with java 8.
 
 http://stackoverflow.com/questions/7914915/syntax-error-in-javax-xml-x
 path-xpathfactory-provider-configuration-file-of-saxo
 
 - Mark
 
 On Dec 2, 2013, at 7:06 PM, Policeman Jenkins Server
 jenk...@thetaphi.de wrote:
 
 Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/8549/
 Java: 32bit/jdk1.8.0-ea-b117 -server -XX:+UseSerialGC
 
 3 tests failed.
 FAILED:
 
 junit.framework.TestSuite.org.apache.solr.hadoop.MorphlineReducerTest
 
 Error Message:
 1 thread leaked from SUITE scope at
 org.apache.solr.hadoop.MorphlineReducerTest: 1) Thread[id=17,
 name=Thread-4, state=TIMED_WAITING, group=TGRP-
 MorphlineReducerTest] at sun.misc.Unsafe.park(Native Method) 
 at
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
 at
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNa
 nos(AbstractQueuedSynchronizer.java:1037) at
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNa
 nos(AbstractQueuedSynchronizer.java:1328) at
 java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
 at org.apache.solr.hadoop.HeartBeater.run(HeartBeater.java:108)
 
 Stack Trace:
 com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked
 from SUITE scope at org.apache.solr.hadoop.MorphlineReducerTest:
  1) Thread[id=17, name=Thread-4, state=TIMED_WAITING, group=TGRP-
 MorphlineReducerTest]
   at sun.misc.Unsafe.park(Native Method)
   at
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
   at
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNa
 nos(AbstractQueuedSynchronizer.java:1037)
   at
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNa
 nos(AbstractQueuedSynchronizer.java:1328)
   at
 

Re: [JENKINS] Lucene-Solr-trunk-Linux (32bit/jdk1.8.0-ea-b117) - Build # 8549 - Still Failing!

2013-12-03 Thread Wolfgang Hoschek
FYI, I filed this saxon ticket: https://saxonica.plan.io/issues/1944

On Dec 3, 2013, at 12:52 AM, Wolfgang Hoschek wrote:

 
 On Dec 3, 2013, at 12:11 AM, Uwe Schindler wrote:
 
 Looks like Java's service loader lookup impl has become more strict in 
 Java8.
 This issue on Java 8 is kind of unfortunate because morphlines and solr-mr
 doesn't actually use JAXP at all.
 
 For the time being might be best to disable testing on Java8 for this 
 contrib,
 in order to get a stable build and make progress on other issues.
 
 A couple of options that come to mind in how to deal with this longer term:
 
 1) Remove the dependency on cdk-morphlines-saxon (which pulls in the
 saxon jar)
 
 
 What ist he effect of this? I would prefer this!
 
 The effect is that the convertHTML, xquery and xslt commands won't be 
 available anymore: 
 
 http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html#/cdk-morphlines-saxon
 
 
 2) Replace all Solr calls to JAXP XPathFactory.newInstance() with a little
 helper that first tries to use one of a list of well known XPathFactory
 subclasses, and only if that fails falls back to the generic
 XPathFactory.newInstance(). E.g. use something like
 
 XPathFactory.newInstance(XPathFactory.DEFAULT_OBJECT_MODEL_URI,
 com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl,
 ClassLoader.getSystemClassLoader());
 
 This is a hack, just because of this craziness, I don't want to have non 
 conformant code in Solr Core!
 
 This is actually quite common practice because the JAXP service loader 
 mechanism is a bit flawed. Also, most XSLT and XPath and StaX implementations 
 have serious bugs in various areas. Thus many XML intensive apps that require 
 reliability and predictable behavior explicitly choose one of the JAXP 
 implementation that's known to work for them, rather than hoping for the best 
 with some potentially buggy default impl. JAXP plug-ability really only 
 exists for simple XPath use cases. The good news is that Solr Config et al 
 seems to fit into that simple pluggable bucket.
 
 
 There are 14 such XPathFactory.newInstance() calls in the Solr codebase.
 
 Definite -1
 
 3) Somehow remove the META-INF/services/javax.xml.xpath.XPathFactory
 file from the saxon jar (this is what's causing this, and we don't need 
 that file,
 but it's not clear how to remove it, realistically)
 
 The only correct way to solve this: File a bug in Jackson and apply (1). 
 Jackson violates the standards. And this violation fails in a number of JVMs 
 (not only in Java 8, also IBM J9 is affected).
 
 I'll file a bug with saxon and see what Mike Kay's take is. Meanwhile, we 
 could remove the saxon jar or disable tests on java8  J9 to be able to move 
 forward on this.
 
 Because of this I don't want to have Jackson in Solr at all (you have to 
 know, I am a fan of XSLT and XPath, but Jackson is the worst implementation 
 I have seen and I avoid it whenever possible - Only if you need XPath2 / 
 XSLT 2 you may want to use it).
 
 All XML libs have bugs but most XML intensive apps use saxon in production 
 rather than other impls, at least from what I've seen over the years. Anyway, 
 just my 2 cents.
 
 Wolfgang.
 
 
 Uwe
 
 On Dec 2, 2013, at 4:41 PM, Mark Miller wrote:
 
 Uwe mentioned this in IRC - I guess Saxon doesn’t play nice with java 8.
 
 http://stackoverflow.com/questions/7914915/syntax-error-in-javax-xml-x
 path-xpathfactory-provider-configuration-file-of-saxo
 
 - Mark
 
 On Dec 2, 2013, at 7:06 PM, Policeman Jenkins Server
 jenk...@thetaphi.de wrote:
 
 Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/8549/
 Java: 32bit/jdk1.8.0-ea-b117 -server -XX:+UseSerialGC
 
 3 tests failed.
 FAILED:
 
 junit.framework.TestSuite.org.apache.solr.hadoop.MorphlineReducerTest
 
 Error Message:
 1 thread leaked from SUITE scope at
 org.apache.solr.hadoop.MorphlineReducerTest: 1) Thread[id=17,
 name=Thread-4, state=TIMED_WAITING, group=TGRP-
 MorphlineReducerTest] at sun.misc.Unsafe.park(Native Method)
  at
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
 at
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNa
 nos(AbstractQueuedSynchronizer.java:1037) at
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNa
 nos(AbstractQueuedSynchronizer.java:1328) at
 java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
 at org.apache.solr.hadoop.HeartBeater.run(HeartBeater.java:108)
 
 Stack Trace:
 com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked
 from SUITE scope at org.apache.solr.hadoop.MorphlineReducerTest:
 1) Thread[id=17, name=Thread-4, state=TIMED_WAITING, group=TGRP-
 MorphlineReducerTest]
  at sun.misc.Unsafe.park(Native Method)
  at
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
  at
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNa
 nos(AbstractQueuedSynchronizer.java

Re: [JENKINS] Lucene-Solr-trunk-Linux (32bit/jdk1.8.0-ea-b117) - Build # 8549 - Still Failing!

2013-12-03 Thread Wolfgang Hoschek
Actually, Mike's opinion has changed because now Saxon doesn't need to support 
Java5 anymore - https://saxonica.plan.io/issues/1944

Wolfgang.

On Dec 3, 2013, at 2:07 AM, Dawid Weiss wrote:

 I'll file a bug with saxon and see what Mike Kay's take is
 
 I think Mike has already expressed his opinion on the subject in that
 stack overflow topic... :)
 
 Dawid
 
 
 On Tue, Dec 3, 2013 at 9:52 AM, Wolfgang Hoschek whosc...@cloudera.com 
 wrote:
 
 On Dec 3, 2013, at 12:11 AM, Uwe Schindler wrote:
 
 Looks like Java's service loader lookup impl has become more strict in 
 Java8.
 This issue on Java 8 is kind of unfortunate because morphlines and solr-mr
 doesn't actually use JAXP at all.
 
 For the time being might be best to disable testing on Java8 for this 
 contrib,
 in order to get a stable build and make progress on other issues.
 
 A couple of options that come to mind in how to deal with this longer term:
 
 1) Remove the dependency on cdk-morphlines-saxon (which pulls in the
 saxon jar)
 
 
 What ist he effect of this? I would prefer this!
 
 The effect is that the convertHTML, xquery and xslt commands won't be 
 available anymore:
 
 http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html#/cdk-morphlines-saxon
 
 
 2) Replace all Solr calls to JAXP XPathFactory.newInstance() with a little
 helper that first tries to use one of a list of well known XPathFactory
 subclasses, and only if that fails falls back to the generic
 XPathFactory.newInstance(). E.g. use something like
 
 XPathFactory.newInstance(XPathFactory.DEFAULT_OBJECT_MODEL_URI,
 com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl,
 ClassLoader.getSystemClassLoader());
 
 This is a hack, just because of this craziness, I don't want to have non 
 conformant code in Solr Core!
 
 This is actually quite common practice because the JAXP service loader 
 mechanism is a bit flawed. Also, most XSLT and XPath and StaX 
 implementations have serious bugs in various areas. Thus many XML intensive 
 apps that require reliability and predictable behavior explicitly choose one 
 of the JAXP implementation that's known to work for them, rather than hoping 
 for the best with some potentially buggy default impl. JAXP plug-ability 
 really only exists for simple XPath use cases. The good news is that Solr 
 Config et al seems to fit into that simple pluggable bucket.
 
 
 There are 14 such XPathFactory.newInstance() calls in the Solr codebase.
 
 Definite -1
 
 3) Somehow remove the META-INF/services/javax.xml.xpath.XPathFactory
 file from the saxon jar (this is what's causing this, and we don't need 
 that file,
 but it's not clear how to remove it, realistically)
 
 The only correct way to solve this: File a bug in Jackson and apply (1). 
 Jackson violates the standards. And this violation fails in a number of 
 JVMs (not only in Java 8, also IBM J9 is affected).
 
 I'll file a bug with saxon and see what Mike Kay's take is. Meanwhile, we 
 could remove the saxon jar or disable tests on java8  J9 to be able to move 
 forward on this.
 
 Because of this I don't want to have Jackson in Solr at all (you have to 
 know, I am a fan of XSLT and XPath, but Jackson is the worst implementation 
 I have seen and I avoid it whenever possible - Only if you need XPath2 / 
 XSLT 2 you may want to use it).
 
 All XML libs have bugs but most XML intensive apps use saxon in production 
 rather than other impls, at least from what I've seen over the years. 
 Anyway, just my 2 cents.
 
 Wolfgang.
 
 
 Uwe
 
 On Dec 2, 2013, at 4:41 PM, Mark Miller wrote:
 
 Uwe mentioned this in IRC - I guess Saxon doesn’t play nice with java 8.
 
 http://stackoverflow.com/questions/7914915/syntax-error-in-javax-xml-x
 path-xpathfactory-provider-configuration-file-of-saxo
 
 - Mark
 
 On Dec 2, 2013, at 7:06 PM, Policeman Jenkins Server
 jenk...@thetaphi.de wrote:
 
 Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/8549/
 Java: 32bit/jdk1.8.0-ea-b117 -server -XX:+UseSerialGC
 
 3 tests failed.
 FAILED:
 
 junit.framework.TestSuite.org.apache.solr.hadoop.MorphlineReducerTest
 
 Error Message:
 1 thread leaked from SUITE scope at
 org.apache.solr.hadoop.MorphlineReducerTest: 1) Thread[id=17,
 name=Thread-4, state=TIMED_WAITING, group=TGRP-
 MorphlineReducerTest] at sun.misc.Unsafe.park(Native Method)   
   at
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
 at
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNa
 nos(AbstractQueuedSynchronizer.java:1037) at
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNa
 nos(AbstractQueuedSynchronizer.java:1328) at
 java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
 at org.apache.solr.hadoop.HeartBeater.run(HeartBeater.java:108)
 
 Stack Trace:
 com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked
 from SUITE scope at org.apache.solr.hadoop.MorphlineReducerTest:
 1) Thread

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-03 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837976#comment-13837976
 ] 

wolfgang hoschek commented on SOLR-1301:


bq. module/dir names

I propose morphlines-solr-core and morphlines-solr-cell as names. Thoughts?

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-03 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837979#comment-13837979
 ] 

wolfgang hoschek commented on SOLR-1301:


+1 to map-reduce-indexer module name/dir.

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [JENKINS] Lucene-Solr-trunk-Linux (32bit/jdk1.8.0-ea-b117) - Build # 8549 - Still Failing!

2013-12-03 Thread Wolfgang Hoschek
Hi Uwe,

There is no need for the saxon jar to be in the WAR. The mr contrib module is 
intended to be run in a separate process.

The saxon jar should only be pulled in by the MR contrib module aka 
map-reduce-indexer contrib module. If that's not the case that's a packaging 
bug that we should fix. 

For some more background, here is how the morphline dependency graph looks 
downstream: 
http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html

Wolfgang.

On Dec 3, 2013, at 5:14 AM, Uwe Schindler wrote:

 Wolfgang,
 
 does this problem affect all the hadoop modules (because the saxon jar is in 
 all the modules classpath)? If yes, I have to disable all of them with IBM J9 
 and Oracle Java 8.
 My biggest problem is the fact that this could also affect the release of 
 Solr. If the saxon.jar is in the WAR file of Solr, then it breaks whole of 
 Solr. But as it is a module, it should be loaded by the SolrResourceLoader 
 from the core's lib folder, so all should be fine, if installed.
 
 I hope the huge Hadoop stuff is not in the WAR (not only because of this 
 issue) and needs to be installed by the user in the instance's lib folder!!!
 
 Uwe
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
 -Original Message-
 From: dawid.we...@gmail.com [mailto:dawid.we...@gmail.com] On Behalf
 Of Dawid Weiss
 Sent: Tuesday, December 03, 2013 12:10 PM
 To: dev@lucene.apache.org
 Subject: Re: [JENKINS] Lucene-Solr-trunk-Linux (32bit/jdk1.8.0-ea-b117) -
 Build # 8549 - Still Failing!
 
 Ha! Thanks for filing the issue, Wolfgang.
 
 D.
 
 On Tue, Dec 3, 2013 at 12:01 PM, Wolfgang Hoschek
 whosc...@cloudera.com wrote:
 Actually, Mike's opinion has changed because now Saxon doesn't need to
 support Java5 anymore - https://saxonica.plan.io/issues/1944
 
 Wolfgang.
 
 On Dec 3, 2013, at 2:07 AM, Dawid Weiss wrote:
 
 I'll file a bug with saxon and see what Mike Kay's take is
 
 I think Mike has already expressed his opinion on the subject in that
 stack overflow topic... :)
 
 Dawid
 
 
 On Tue, Dec 3, 2013 at 9:52 AM, Wolfgang Hoschek
 whosc...@cloudera.com wrote:
 
 On Dec 3, 2013, at 12:11 AM, Uwe Schindler wrote:
 
 Looks like Java's service loader lookup impl has become more strict in
 Java8.
 This issue on Java 8 is kind of unfortunate because morphlines and
 solr-mr doesn't actually use JAXP at all.
 
 For the time being might be best to disable testing on Java8 for
 this contrib, in order to get a stable build and make progress on other
 issues.
 
 A couple of options that come to mind in how to deal with this longer
 term:
 
 1) Remove the dependency on cdk-morphlines-saxon (which pulls in
 the saxon jar)
 
 
 What ist he effect of this? I would prefer this!
 
 The effect is that the convertHTML, xquery and xslt commands won't be
 available anymore:
 
 http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlines
 ReferenceGuide.html#/cdk-morphlines-saxon
 
 
 2) Replace all Solr calls to JAXP XPathFactory.newInstance() with
 a little helper that first tries to use one of a list of well
 known XPathFactory subclasses, and only if that fails falls back
 to the generic XPathFactory.newInstance(). E.g. use something like
 
 
 XPathFactory.newInstance(XPathFactory.DEFAULT_OBJECT_MODEL_URI,
 com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl,
 ClassLoader.getSystemClassLoader());
 
 This is a hack, just because of this craziness, I don't want to have non
 conformant code in Solr Core!
 
 This is actually quite common practice because the JAXP service loader
 mechanism is a bit flawed. Also, most XSLT and XPath and StaX
 implementations have serious bugs in various areas. Thus many XML
 intensive apps that require reliability and predictable behavior explicitly
 choose one of the JAXP implementation that's known to work for them,
 rather than hoping for the best with some potentially buggy default impl.
 JAXP plug-ability really only exists for simple XPath use cases. The good 
 news
 is that Solr Config et al seems to fit into that simple pluggable bucket.
 
 
 There are 14 such XPathFactory.newInstance() calls in the Solr
 codebase.
 
 Definite -1
 
 3) Somehow remove the
 META-INF/services/javax.xml.xpath.XPathFactory
 file from the saxon jar (this is what's causing this, and we don't
 need that file, but it's not clear how to remove it,
 realistically)
 
 The only correct way to solve this: File a bug in Jackson and apply (1).
 Jackson violates the standards. And this violation fails in a number of JVMs
 (not only in Java 8, also IBM J9 is affected).
 
 I'll file a bug with saxon and see what Mike Kay's take is. Meanwhile, we
 could remove the saxon jar or disable tests on java8  J9 to be able to move
 forward on this.
 
 Because of this I don't want to have Jackson in Solr at all (you have to
 know, I am a fan of XSLT and XPath, but Jackson is the worst implementation
 I have seen and I avoid

[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-03 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837976#comment-13837976
 ] 

wolfgang hoschek edited comment on SOLR-1301 at 12/3/13 6:40 PM:
-

bq. module/dir names

I propose morphlines-solr-core and morphlines-solr-cell as names. This avoids 
confusion by fitting nicely with the existing naming pattern, which is 
cdk-morphlines-solr-core and cdk-morphlines-solr-cell. 
(https://github.com/cloudera/cdk/tree/master/cdk-morphlines). Thoughts?


was (Author: whoschek):
bq. module/dir names

I propose morphlines-solr-core and morphlines-solr-cell as names. Thoughts?

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-03 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838054#comment-13838054
 ] 

wolfgang hoschek commented on SOLR-1301:


bq. The problem with these two names is that the artifact names will have 
solr- prepended, and then solr will occur twice in their names: 
solr-morphlines-solr-core-4.7.0.jar, solr-morphlines-solr-cell-4.7.0.jar. Yuck.

Ah, argh. In this light, what Mark suggested seems good to me as well.

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-03 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838064#comment-13838064
 ] 

wolfgang hoschek commented on SOLR-1301:


+1 on  Steve's suggestion as well. Thanks for helping out!

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-03 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838305#comment-13838305
 ] 

wolfgang hoschek edited comment on SOLR-1301 at 12/3/13 11:11 PM:
--

Upon a bit more reflection might be better to call the contrib map-reduce and 
the artifact solr-map-reduce. This keeps the door open to potentially later 
add things like a Hadoop SolrInputFormat, i.e. read from solr via MR, rather 
than just write to solr via MR.


was (Author: whoschek):
Upon a bit more reflection might be better to call the contrib map-reduce and 
the artifact solr-map-reduce. This keeps the door upon to potentially later 
add things like a Hadoop SolrInputFormat, i.e. read from solr via MR, rather 
than just write to solr via MR.

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-03 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838305#comment-13838305
 ] 

wolfgang hoschek commented on SOLR-1301:


Upon a bit more reflection might be better to call the contrib map-reduce and 
the artifact solr-map-reduce. This keeps the door upon to potentially later 
add things like a Hadoop SolrInputFormat, i.e. read from solr via MR, rather 
than just write to solr via MR.

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-02 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837068#comment-13837068
 ] 

wolfgang hoschek commented on SOLR-1301:


There is also a known issue in that Morphlines don't work on Windows because 
the Guava Classpath utility doesn't work with windows path conventions. For 
example, see 
http://mail-archives.apache.org/mod_mbox/flume-dev/201310.mbox/%3c5acffcd9-4ad7-4e6e-8365-ceadfac78...@cloudera.com%3E

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [JENKINS] Lucene-Solr-trunk-Linux (32bit/jdk1.8.0-ea-b117) - Build # 8549 - Still Failing!

2013-12-02 Thread Wolfgang Hoschek
Looks like Java's service loader lookup impl has become more strict in Java8. 
This issue on Java 8 is kind of unfortunate because morphlines and solr-mr 
doesn't actually use JAXP at all. 

For the time being might be best to disable testing on Java8 for this contrib, 
in order to get a stable build and make progress on other issues.

A couple of options that come to mind in how to deal with this longer term:

1) Remove the dependency on cdk-morphlines-saxon (which pulls in the saxon jar)

or 

2) Replace all Solr calls to JAXP XPathFactory.newInstance() with a little 
helper that first tries to use one of a list of well known XPathFactory 
subclasses, and only if that fails falls back to the generic 
XPathFactory.newInstance(). E.g. use something like 

XPathFactory.newInstance(XPathFactory.DEFAULT_OBJECT_MODEL_URI,
com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl, 
ClassLoader.getSystemClassLoader());

There are 14 such XPathFactory.newInstance() calls in the Solr codebase.

or 

3) Somehow remove the META-INF/services/javax.xml.xpath.XPathFactory file from 
the saxon jar (this is what's causing this, and we don't need that file, but 
it's not clear how to remove it, realistically)

Approach 2) might be best.

Thoughts?
Wolfgang.

On Dec 2, 2013, at 4:41 PM, Mark Miller wrote:

 Uwe mentioned this in IRC - I guess Saxon doesn’t play nice with java 8.
 
 http://stackoverflow.com/questions/7914915/syntax-error-in-javax-xml-xpath-xpathfactory-provider-configuration-file-of-saxo
 
 - Mark
 
 On Dec 2, 2013, at 7:06 PM, Policeman Jenkins Server jenk...@thetaphi.de 
 wrote:
 
 Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/8549/
 Java: 32bit/jdk1.8.0-ea-b117 -server -XX:+UseSerialGC
 
 3 tests failed.
 FAILED:  
 junit.framework.TestSuite.org.apache.solr.hadoop.MorphlineReducerTest
 
 Error Message:
 1 thread leaked from SUITE scope at 
 org.apache.solr.hadoop.MorphlineReducerTest: 1) Thread[id=17, 
 name=Thread-4, state=TIMED_WAITING, group=TGRP-MorphlineReducerTest] 
 at sun.misc.Unsafe.park(Native Method) at 
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)   
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
  at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
  at 
 java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
 at org.apache.solr.hadoop.HeartBeater.run(HeartBeater.java:108)
 
 Stack Trace:
 com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked from 
 SUITE scope at org.apache.solr.hadoop.MorphlineReducerTest: 
   1) Thread[id=17, name=Thread-4, state=TIMED_WAITING, 
 group=TGRP-MorphlineReducerTest]
at sun.misc.Unsafe.park(Native Method)
at 
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
at org.apache.solr.hadoop.HeartBeater.run(HeartBeater.java:108)
  at __randomizedtesting.SeedInfo.seed([FA8A1D94A2BB2925]:0)
 
 
 FAILED:  
 junit.framework.TestSuite.org.apache.solr.hadoop.MorphlineReducerTest
 
 Error Message:
 There are still zombie threads that couldn't be terminated:1) 
 Thread[id=17, name=Thread-4, state=TIMED_WAITING, 
 group=TGRP-MorphlineReducerTest] at sun.misc.Unsafe.park(Native 
 Method) at 
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)   
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
  at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
  at 
 java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
 at org.apache.solr.hadoop.HeartBeater.run(HeartBeater.java:108)
 
 Stack Trace:
 com.carrotsearch.randomizedtesting.ThreadLeakError: There are still zombie 
 threads that couldn't be terminated:
   1) Thread[id=17, name=Thread-4, state=TIMED_WAITING, 
 group=TGRP-MorphlineReducerTest]
at sun.misc.Unsafe.park(Native Method)
at 
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
at org.apache.solr.hadoop.HeartBeater.run(HeartBeater.java:108)
  at 

Re: Welcome Joel Bernstein

2013-10-04 Thread Wolfgang Hoschek
Welcome Joel!

Wolfgang.

On Oct 3, 2013, at 9:56 AM, Erick Erickson wrote:

 Welcome Joel!
 
 On Thu, Oct 3, 2013 at 9:33 AM, Martijn v Groningen
 martijn.v.gronin...@gmail.com wrote:
 Welcome Joel!
 
 
 On 3 October 2013 15:45, Shawn Heisey s...@elyograg.org wrote:
 
 On 10/2/2013 11:24 PM, Grant Ingersoll wrote:
 The Lucene PMC is happy to welcome Joel Bernstein as a committer on the
 Lucene and Solr project.  Joel has been working on a number of issues on 
 the
 project and we look forward to his continued contributions going forward.
 
 Welcome to the project!  Best of luck to you!
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
 
 --
 Met vriendelijke groet,
 
 Martijn van Groningen
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome back, Wolfgang Hoschek!

2013-09-26 Thread Wolfgang Hoschek
Thanks to all! Looking forward to more contributions.

Wolfgang.

On Sep 26, 2013, at 3:21 AM, Uwe Schindler wrote:

 Hi,
 
 I'm pleased to announce that after a long abstinence, Wolfgang Hoschek 
 rejoined the Lucene/Solr committer team. He is working now at Cloudera and 
 plans to help with the integration of Solr and Hadoop.
 Wolfgang originally wrote the MemoryIndex, which is used by the classical 
 Lucene highlighter and ElasticSearch's percolator module.
 
 Looking forward to new contributions.
 
 Welcome back  heavy committing! :-)
 Uwe
 
 P.S.: Wolfgang, as soon as you have setup your subversion access, you should 
 add yourself back to the committers list on the website as well.
 
 -
 Uwe Schindler
 uschind...@apache.org 
 Apache Lucene PMC Chair / Committer
 Bremen, Germany
 http://lucene.apache.org/
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-09-16 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768629#comment-13768629
 ] 

wolfgang hoschek commented on SOLR-1301:


cdk-morphlines-solr-core and cdk-morphlines-solr-cell should remain separate 
and be available through separate maven modules so that clients such as Flume 
Solr Sink and Hbase Indexer can continue to choose to depend (or not depend) on 
them. For example, not everyone wants tika and it's dependency chain.

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 4.5, 5.0

 Attachments: commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-09-16 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768662#comment-13768662
 ] 

wolfgang hoschek commented on SOLR-1301:


Seems like the patch still misses tika-xmp.

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 4.5, 5.0

 Attachments: commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-09-10 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763618#comment-13763618
 ] 

wolfgang hoschek commented on SOLR-1301:


FYI, One things that's definitely off in that adhoc ivy.xml above is that it 
should use com.typesafe rather than org.skife.com.typesafe.config. Use version 
1.0.2 of it. See http://search.maven.org/#search%7Cga%7C1%7Ctypesafe-config

Maybe best to wait for Mark to post our full ivy.xml, though. 

(Moving all our solr-mr dependencies from Cloudera Search maven to ivy was a 
bit of a beast). 

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 4.5, 5.0

 Attachments: commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-09-10 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763636#comment-13763636
 ] 

wolfgang hoschek commented on SOLR-1301:


By the way, docs and the downstream code for our solr-mr contrib submission is 
here: https://github.com/cloudera/search/tree/master/search-mr



 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 4.5, 5.0

 Attachments: commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-09-10 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763644#comment-13763644
 ] 

wolfgang hoschek commented on SOLR-1301:


This new solr-mr contrib uses morphlines for ETL from MapReduce into Solr. To 
get started, here are some pointers for morphlines background material and code:

code:

https://github.com/cloudera/cdk/tree/master/cdk-morphlines

blog post:


http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/

reference guide:


http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html

slides:

http://www.slideshare.net/cloudera/using-morphlines-for-onthefly-etl

talk recording:

http://www.youtube.com/watch?v=iR48cRSbW6A


 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 4.5, 5.0

 Attachments: commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4661) Reduce default maxMerge/ThreadCount for ConcurrentMergeScheduler

2013-01-08 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13547367#comment-13547367
 ] 

wolfgang hoschek commented on LUCENE-4661:
--

Might be good to experiment with Linux block device read-ahead settings 
(/sbin/blockdev --setra) and ensure using a file system that does write behind 
(e.g. ext4 or xfs). Larger buffer sizes typically allow for more concurrent seq 
streams even on spindles.

 Reduce default maxMerge/ThreadCount for ConcurrentMergeScheduler
 

 Key: LUCENE-4661
 URL: https://issues.apache.org/jira/browse/LUCENE-4661
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.1, 5.0


 I think our current defaults (maxThreadCount=#cores/2,
 maxMergeCount=maxThreadCount+2) are too high ... I've frequently found
 merges falling behind and then slowing each other down when I index on
 a spinning-magnets drive.
 As a test, I indexed all of English Wikipedia with term-vectors (=
 heavy on merging), using 6 threads ... at the defaults
 (maxThreadCount=3, maxMergeCount=5, for my machine) it took 5288 sec
 to index  wait for merges  commit.  When I changed to
 maxThreadCount=1, maxMergeCount=2, indexing time sped up to 2902
 seconds (45% faster).  This is on a spinning-magnets disk... basically
 spinning-magnets disk don't handle the concurrent IO well.
 Then I tested an OCZ Vertex 3 SSD: at the current defaults it took
 1494 seconds and at maxThreadCount=1, maxMergeCount=2 it took 1795 sec
 (20% slower).  Net/net the SSD can handle merge concurrency just fine.
 I think we should change the defaults: spinning magnet drives are hurt
 by the current defaults more than SSDs are helped ... apps that know
 their IO system is fast can always increase the merge concurrency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Field constructor, avoiding String.intern()

2007-02-23 Thread Wolfgang Hoschek


On Feb 23, 2007, at 10:28 AM, James Kennedy wrote:



True. However, in the case where you are processing Documents one  
at a time
and discarding them (e.g. We use hitCollector to process all  
documents from
a search), or memory is not an issue, it would be nice to have the  
ability

to disable the interning for performance sake.


I don't know how much it would increase overall throughput in a  
variety of use cases, but one approach could be to add a copy-like- 
this factory method like Field.createField(Reader) to Field.java,  
analog to the method Term.createTerm(String text) that was added to  
Term.java sometime ago for a similar reason.


This would guarantee that the name continues to be interned yet  
allows to avoid the interning overhead on use cases where a field  
with the same parametrization (yet a different content String/Reader)  
is constructed many times, which is probably the most common case  
where intern() overhead might matter.


For example, something like

Field f1 = ...
Field f2 = f1.createSimilarField(Reader);

  /**
   * Optimized construction of new Terms by reusing same field as  
this Term

   * - avoids field.intern() overhead
   * @param text The text of the new term (field is implicitly same  
as this Term instance)

   * @return A new Term
   */
  public Term createTerm(String text)
  {
  return new Term(field,text,false);
  }

Wolfgang.






Robert Engels wrote:


I don't think it is just the performance gain of equals() where  
intern

() matters.

It also reduces memory consumption dramatically when working with
large collections of documents in memory - although this could also
be done with constants, there is nothing in Java to enforce it (thus
the use of intern()).


On Feb 23, 2007, at 12:02 PM, James Kennedy wrote:



In our case, we're trying to optimize document() retrieval and we
found that
disabling the String interning in the Field constructor improved
performance
dramatically. I agree that interning should be an option on the
constructor.
For document retrieval, at least for a small of amount of fields,  
the

performance gain of using equals() on interned strings is no match
for the
performance loss of interning the field name of each field.



Wolfgang Hoschek-2 wrote:


I noticed that, too, but in my case the difference was often much
more extreme: it was one of the primary bottlenecks on indexing.  
This
is the primary reason why MemoryIndex.addField(...) navigates  
around
the problem by taking a parameter of type String fieldName  
instead

of type Field:

public void addField(String fieldName, TokenStream stream) {
/*
 * Note that this method signature avoids having a user call new
		 * o.a.l.d.Field(...) which would be much too expensive due to  
the

 * String.intern() usage of that class.
  */

Wolfgang.

On Feb 14, 2006, at 1:42 PM, Tatu Saloranta wrote:


After profiling in-memory indexing, I noticed that
calls to String.intern() showed up surprisingly high;
especially the one from Field() constructor. This is
understandable due to overhead String.intern() has
(being native and synchronized method; overhead
incurred even if String is already interned), and the
fact this essentially gets called once per
document+field combination.

Now, it would be quite easy to improve things a bit
(in theory), such that most intern() calls could be
avoid, transparent to the calling app; for example,
for each IndexWriter() one could use a simple
HashMap() for caching interned Strings. This approach
is more than twice as fast as directly calling
intern(). One could also use per-thread cache, or
global one; all of which would probably be faster.
However, Field constructor hard-codes call to
intern(), so it would be necessary to add a new
constructor that indicates that field name is known to
be interned.
And there would also need to be a way to invoke the
new optional functionality.

Has anyone tried this approach to see if speedup is
worth the hassle (in my case it'd probably be
something like 2 - 3%, assuming profiler's 5% for
intern() is accurate)?

-+ Tatu +-


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

-- 
--

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--- 
--

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
View this message in context: http://www.nabble.com/Field-
constructor%2C-avoiding-String.intern%28%29-tf1123597.html#a9123600
Sent from the Lucene - Java Developer mailing list archive at
Nabble.com.


 
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional

Re: [jira] Commented: (LUCENE-794) Beginnings of a span based highlighter

2007-02-05 Thread Wolfgang Hoschek



I need to read the TokenStream at least twice
I used the horribly hackey but quick-for-me method of adding a  
method to MemoryIndex that accepts a List of Tokens. Any ideas?


I'm not sure about modifying MemoryIndex. It should be easy enough  
to create a subclass of TokenStream - (CachedTokenStream  
perhaps?) which takes a real TokenStream in it's constructor and  
delegates all next calls to it (and also records them in a List)  
for the the first use. This can then be rewound and re-used to  
run through the same set of tokens held in the list  from the first  
run.




Yes, as Marks points out this can be done without API change via the  
existing MemoryIndex.addField(String fieldName, TokenStream stream)


The TokenStream could be constructed along similar lines as done in  
MemoryIndex.keywordTokenStream(Collection) or perhaps along similar  
lines as in  
org.apache.lucene.index.memory.AnalyzerUtil.getTokenCachingAnalyzer 
(Analyzer)


If needed, an IndexReader can be created from a MemoryIndex via  
MemoryIndex.createSearcher().getIndexReader(), again without API change.


Wolfgang.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-129) Finalizers are non-canonical

2007-01-05 Thread wolfgang hoschek (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462579
 ] 

wolfgang hoschek commented on LUCENE-129:
-

Just to clarify: The empty finalize() method body in MemoryIndex measurabley 
improves performance of this class and it does not harm correctness because 
MemoryIndex does not require the superclass semantics wrt. concurrency.

 Finalizers are non-canonical
 

 Key: LUCENE-129
 URL: https://issues.apache.org/jira/browse/LUCENE-129
 Project: Lucene - Java
  Issue Type: Bug
  Components: Other
Affects Versions: unspecified
 Environment: Operating System: other
 Platform: All
Reporter: Esmond Pitt
 Assigned To: Michael McCandless
Priority: Minor
 Fix For: 2.1


 The canonical form of a Java finalizer is:
 protected void finalize() throws Throwable()
 {
  try
  {
// ... local code to finalize this class
  }
  catch (Throwable t)
  {
  }
  super.finalize(); // finalize base class.
 }
 The finalizers in IndexReader, IndexWriter, and FSDirectory don't conform. 
 This
 is probably minor or null in effect, but the principle is important.
 As a matter of fact FSDirectory.finaliz() is entirely redundant and could be
 removed, as it doesn't do anything that RandomAccessFile.finalize would do
 automatically.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-550) InstanciatedIndex - faster but memory consuming index

2006-11-21 Thread wolfgang hoschek (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451817 ] 

wolfgang hoschek commented on LUCENE-550:
-

 All Lucene unit tests have been adapted to work with my alternate index. 
 Everything but proximity queries pass. 

Sounds like you're almost there :-)

Regarding indexing performance with MemoryIndex: Performance is more than good 
enough. I've observed and measured that often the bottleneck is not the 
MemoryIndex itself, but rather the Analyzer type (e.g. StandardAnalayzer) or 
the I/O for the input files or term lower casing 
(http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265809) or something else 
entirely.

Regarding query performance with MemoryIndex: Some queries are more efficient 
than others. For example, fuzzy queries are much less efficient than wild card 
queries, which in turn are much less efficient than simple term queries. Such 
effects seem partly inherent due too the nature of the query type, partly a 
function of the chosen data structure (RAMDirectory, MemoryIndex, II, ...), and 
partly a consequence of the overall Lucene API design.

The query mix found in testqueries.txt is more intended for correctness testing 
than benchmarking. Therein, certain query types dominate over others, and thus, 
conclusions about the performance of individual aspects cannot easily be drawn.

Wolfgang.


 InstanciatedIndex - faster but memory consuming index
 -

 Key: LUCENE-550
 URL: http://issues.apache.org/jira/browse/LUCENE-550
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 1.9
Reporter: Karl Wettin
 Attachments: class_diagram.png, class_diagram.png, 
 instanciated_20060527.tar, InstanciatedIndexTermEnum.java, 
 lucene.1.9-karl1.jpg, lucene2-karl_20060722.tar.gz, 
 lucene2-karl_20060723.tar.gz


 After fixing the bugs, it's now 4.5 - 5 times the speed. This is true for 
 both at index and query time. Sorry if I got your hopes up too much. There 
 are still things to be done though. Might not have time to do anything with 
 this until next month, so here is the code if anyone wants a peek.
 Not good enough for Jira yet, but if someone wants to fool around with it, 
 here it is. The implementation passes a TermEnum - TermDocs - Fields - 
 TermVector comparation against the same data in a Directory.
 When it comes to features, offsets don't exists and positions are stored ugly 
 and has bugs.
 You might notice that norms are float[] and not byte[]. That is me who 
 refactored it to see if it would do any good. Bit shifting don't take many 
 ticks, so I might just revert that.
 I belive the code is quite self explaining.
 InstanciatedIndex ii = ..
 ii.new InstanciatedIndexReader();
 ii.addDocument(s).. replace IndexWriter for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-550) InstanciatedIndex - faster but memory consuming index

2006-11-21 Thread wolfgang hoschek (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451768 ] 

wolfgang hoschek commented on LUCENE-550:
-

Ok. That means a basic test passes. For some more exhaustive tests, run all the 
queries in 

src/test/org/apache/lucene/index/memory/testqueries.txt

against matching files such as 

String[] files = listFiles(new String[] {
  *.txt, //*.html, *.xml, xdocs/*.xml, 
  src/java/test/org/apache/lucene/queryParser/*.java,
  src/java/org/apache/lucene/index/memory/*.java,
});
 

See testMany() for details. Repeat for various analyzer, stopword toLowerCase 
settings, such as 

boolean toLowerCase = true;
//boolean toLowerCase = false;
//Set stopWords = null;
Set stopWords = StopFilter.makeStopSet(StopAnalyzer.ENGLISH_STOP_WORDS);

Analyzer[] analyzers = new Analyzer[] { 
//new SimpleAnalyzer(),
//new StopAnalyzer(),
//new StandardAnalyzer(),
PatternAnalyzer.DEFAULT_ANALYZER,
//new WhitespaceAnalyzer(),
//new PatternAnalyzer(PatternAnalyzer.NON_WORD_PATTERN, false, null),
//new PatternAnalyzer(PatternAnalyzer.NON_WORD_PATTERN, true, 
stopWords),
//new SnowballAnalyzer(English, StopAnalyzer.ENGLISH_STOP_WORDS),
};
 


 InstanciatedIndex - faster but memory consuming index
 -

 Key: LUCENE-550
 URL: http://issues.apache.org/jira/browse/LUCENE-550
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 1.9
Reporter: Karl Wettin
 Attachments: class_diagram.png, class_diagram.png, 
 instanciated_20060527.tar, InstanciatedIndexTermEnum.java, 
 lucene.1.9-karl1.jpg, lucene2-karl_20060722.tar.gz, 
 lucene2-karl_20060723.tar.gz


 After fixing the bugs, it's now 4.5 - 5 times the speed. This is true for 
 both at index and query time. Sorry if I got your hopes up too much. There 
 are still things to be done though. Might not have time to do anything with 
 this until next month, so here is the code if anyone wants a peek.
 Not good enough for Jira yet, but if someone wants to fool around with it, 
 here it is. The implementation passes a TermEnum - TermDocs - Fields - 
 TermVector comparation against the same data in a Directory.
 When it comes to features, offsets don't exists and positions are stored ugly 
 and has bugs.
 You might notice that norms are float[] and not byte[]. That is me who 
 refactored it to see if it would do any good. Bit shifting don't take many 
 ticks, so I might just revert that.
 I belive the code is quite self explaining.
 InstanciatedIndex ii = ..
 ii.new InstanciatedIndexReader();
 ii.addDocument(s).. replace IndexWriter for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-550) InstanciatedIndex - faster but memory consuming index

2006-11-21 Thread wolfgang hoschek (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451731 ] 

wolfgang hoschek commented on LUCENE-550:
-

Other question: when running the driver in test mode (checking for equality of 
query results against RAMDirectory) does InstantiatedIndex pass all tests? That 
would be great!

 InstanciatedIndex - faster but memory consuming index
 -

 Key: LUCENE-550
 URL: http://issues.apache.org/jira/browse/LUCENE-550
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 1.9
Reporter: Karl Wettin
 Attachments: class_diagram.png, class_diagram.png, 
 instanciated_20060527.tar, InstanciatedIndexTermEnum.java, 
 lucene.1.9-karl1.jpg, lucene2-karl_20060722.tar.gz, 
 lucene2-karl_20060723.tar.gz


 After fixing the bugs, it's now 4.5 - 5 times the speed. This is true for 
 both at index and query time. Sorry if I got your hopes up too much. There 
 are still things to be done though. Might not have time to do anything with 
 this until next month, so here is the code if anyone wants a peek.
 Not good enough for Jira yet, but if someone wants to fool around with it, 
 here it is. The implementation passes a TermEnum - TermDocs - Fields - 
 TermVector comparation against the same data in a Directory.
 When it comes to features, offsets don't exists and positions are stored ugly 
 and has bugs.
 You might notice that norms are float[] and not byte[]. That is me who 
 refactored it to see if it would do any good. Bit shifting don't take many 
 ticks, so I might just revert that.
 I belive the code is quite self explaining.
 InstanciatedIndex ii = ..
 ii.new InstanciatedIndexReader();
 ii.addDocument(s).. replace IndexWriter for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-550) InstanciatedIndex - faster but memory consuming index

2006-11-21 Thread wolfgang hoschek (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451730 ] 

wolfgang hoschek commented on LUCENE-550:
-

What's the benchmark configuration? For example, is throughput bounded by 
indexing or querying?  Measuring N queries against a single preindexed document 
vs. 1 precompiled query against N documents? See the line

boolean measureIndexing = false; // toggle this to measure query performance

in my driver. If measuring indexing, what kind of analyzer / token filter chain 
is used? If measuring queries, what kind of query types are in the mix, with 
which relative frequencies? 

You may want to experiment with modifying/commenting/uncommenting various parts 
of the driver setup, for any given target scenario. Would it be possible to 
post the benchmark code, test data, queries for analysis?


 InstanciatedIndex - faster but memory consuming index
 -

 Key: LUCENE-550
 URL: http://issues.apache.org/jira/browse/LUCENE-550
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 1.9
Reporter: Karl Wettin
 Attachments: class_diagram.png, class_diagram.png, 
 instanciated_20060527.tar, InstanciatedIndexTermEnum.java, 
 lucene.1.9-karl1.jpg, lucene2-karl_20060722.tar.gz, 
 lucene2-karl_20060723.tar.gz


 After fixing the bugs, it's now 4.5 - 5 times the speed. This is true for 
 both at index and query time. Sorry if I got your hopes up too much. There 
 are still things to be done though. Might not have time to do anything with 
 this until next month, so here is the code if anyone wants a peek.
 Not good enough for Jira yet, but if someone wants to fool around with it, 
 here it is. The implementation passes a TermEnum - TermDocs - Fields - 
 TermVector comparation against the same data in a Directory.
 When it comes to features, offsets don't exists and positions are stored ugly 
 and has bugs.
 You might notice that norms are float[] and not byte[]. That is me who 
 refactored it to see if it would do any good. Bit shifting don't take many 
 ticks, so I might just revert that.
 I belive the code is quite self explaining.
 InstanciatedIndex ii = ..
 ii.new InstanciatedIndexReader();
 ii.addDocument(s).. replace IndexWriter for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: MemoryIndex

2006-05-02 Thread Wolfgang Hoschek
MemoryIndex was designed to maximize performance for a specific use  
case: pure in-memory datastructure, at most one document per  
MemoryIndex instance, any number of fields, high frequency reads,  
high frequency index writes, no thread-safety required, optional  
support for storing offsets.


I briefly considered extending it to the multi-document case, but  
eventually refrained from doing so, because I didn't really need such  
functionality myself (no itch). Here are some issues to consider when  
attempting such an extension:


- The internal datastructure would probably look quite different
- Datastructure/algorithmic trade-offs regarding time vs space, read  
vs. write frequency, common vs. less common use cases

- Hence, it may well turn out that there's not much to reuse.
- A priori, it isn't clear whether a new solution would be  
significantly faster than normal RAMDirectory usage. Thus...

- Need benchmark suite to evaluate the chosen trade-offs.
- Need tests to ensure correctness (in practise, meaning, it behaves  
just like the existing alternative).


I'd say it's a non-trival untertaking. For example, right now, I  
don't have time for such an effort. That doesn't mean it's impossible  
or shouldn't be done, of course. If someone would like to run with it  
that would be great, but in light of the above issues, I'd suggest  
doing it in a new class (say MultiMemoryIndex or similar).


I believe Mark has dome some initial work in that direction, based on  
an independent (and different) implementation strategy.


Wolfgang.

On May 2, 2006, at 12:25 AM, Robert Engels wrote:

Along the lines of Lucene-550, what about having a MemoryIndex that  
accepts
multiple documents, then wrote the index once at the end in the  
Lucene file

format (so it could be merged) during close.

When adding documents using an IndexWriter, a new segment is  
created for
each document, and then the segments are periodically merged in  
memory,

and/or with disk segments. It seems that when constructing an Index or
updating a lot of documents in an existing index, the write,  
read, merge
cycle is inefficient, and if the documents/field information were  
maintained

in order (TreeMaps) greater efficiency would be realized.

With a memory index, the memory needed during update will increase
dramatically, but this could still be bounded, and a disk based  
index

segment written when too many documents are in the memory index (max
buffered documents).

Does this sound like an improvement? Has anyone else tried  
something like

this?




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimizing/minimizing memory usage of memory-based indexes

2006-02-11 Thread Wolfgang Hoschek


Initially it might, but probably eventually not. I was
thinking Lucene formats might also be bit more compact
than vanilla hash maps, but I guess that depends on
many factors. But I will probably want to play with
actual queries later on, based on frequencies.


OK.




In the latter case, are you using
org.apache.lucene.store.RAMDirectory or
org.apache.lucene.index.memory.MemoryIndex?


I'm using RAMDirectory. Should I be using MemoryIndex
maybe instead (I'll check it out)?



The main constraint is that a MemoryIndex instance can only hold  
*one* lucene document (though it can have any number of fields).  
MemoryIndex is designed to be a transient throw away data structure,  
for streaming / publish-subscribe usecases. If it's applicable,  
MemoryIndex has better performance but worse memory consumption than  
RAMDirectory. I can't tell whether that may or may not be an issue  
for your case.


Wolfgang.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Advanced query language

2005-12-18 Thread Wolfgang Hoschek

On Dec 17, 2005, at 2:36 PM, Paul Elschot wrote:


Gentlemen,

While maintaining my bookmarks I ran into this:
Case Study: Enabling Low-Cost XML-Aware Searching
Capable of Complex Querying:
http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/ 
03-02-08/03-02-08.html


Some loose thoughts:

In the system described there a Lucene document is used for each
low level xml construct, even when it contains very few characters  
of text.

The resulting Lucene indexes are at least 2.5 times the size of the
original document, which is not a surprise given this document  
structure.

Normal index size is about one third of  the indexed text.

I don't know about the XQuery standard, but I was wondering
whether this unusual document structure and the non straightforward
fit between Lucene queries and XQuery queries are related.


Seems that a lot of metadata beyond the actual text is stored. For  
example, node type, ancestors, parent, number of children, etc., for  
each element and attribute. If the fulltext is relatively small, as  
is often the case in quite structured XML such as the shakespeare  
collection, that should significantly increase storage space.


For example, romeo and juliet goes along the following lines:

SPEECH
SPEAKERFRIAR LAURENCE/SPEAKER
LINENot in a grave,/LINE
LINETo lay one in, another out to have./LINE
/SPEECH

SPEECH
SPEAKERROMEO/SPEAKER
LINEI pray thee, chide not; she whom I love now/LINE
LINEDoth grace for grace and love for love allow;/LINE
LINEThe other did not so./LINE
/SPEECH

SPEECH
SPEAKERFRIAR LAURENCE/SPEAKER
LINEO, she knew well/LINE
LINEThy love did read by rote and could not spell./LINE
LINEBut come, young waverer, come, go with me,/LINE
LINEIn one respect I'll thy assistant be;/LINE
LINEFor this alliance may so happy prove,/LINE
LINETo turn your households' rancour to pure love./LINE
/SPEECH





As for the  joines and iterations over items from the stream of XML
results: iteration over matching XML constructs should be no problem
in Lucene. Joins in Lucene are normally done via boolean filters,
so I was wondering how XQuery joins fit these.


Similar as in SQL. The engine constructs a locial execution plan for  
the query, and rewrites it into an optimized physical plan as deemed  
appropriate, perhaps guided by statistics, using a nested loop, hash  
join, or any other more sophisticated strategy.


Wolfgang.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Advanced query language

2005-12-17 Thread Wolfgang Hoschek
 over matching XML constructs should be no problem
in Lucene. Joins in Lucene are normally done via boolean filters,
so I was wondering how XQuery joins fit these.
The case study above has a note a the end of par 5.3:
The Search Result list that comes back could then be organized
by document id to group together all the results for a single XML
document. This is not provided by default, but has been done with
extension to this code.

Regards,
Paul Elschot

On Friday 16 December 2005 03:45, Wolfgang Hoschek wrote:


I think implementing an XQuery Full-Text engine is far beyond the
scope of Lucene.

Implementing a building block for the fulltext aspect of it would be
more manageable. Unfortunately The W3C fulltext drafts
indiscriminately mix and mingle two completely different languages
into a single language, without clear boundaries. That's why most
practical folks implement XQuery fulltext search via extension
functions rather than within XQuery itself. This also allows for much
more detailed tokenization, configuration and extensibility than what
would be possible with the W3C draft.

Wolfgang.

On Dec 15, 2005, at 4:20 PM, [EMAIL PROTECTED] wrote:



Mark,

This is very cool. When I was at TripleHop we did something very
similar where both query and results conformed to an XML Schema and
we used XML over HTTP as our main vehicle to do remote/federated
searches with quick rendering with stylesheets.

That however is the first piece of the puzzle. If you really want
to go beyond search (in the traditional sense) and be able to
perform more complex operations such as joines and iterations over
items from the stream of XML results you are getting you should
consider implementing an XQuery Full-Text engine with Lucene
adopting the now standard XQuery language.

Here is the pointer to the working draft on the W3C working draft
on XQuery 1.0 and XPath 2.0 Full-Text:
http://www.w3.org/TR/xquery-full-text/

Now I'm part of the task force editing this draft so your comments
are very much welcomed.

-- J.D.


http://www.inperspective.com/lucene/LXQueryV0_1.zip

I've implemented just a few queries (Boolean, Term, FilteredQuery,
BoostingQuery ...) but other queries are fairly trivial to add.
At this stage I am more interested in feedback on parser design/
approach
rather than trying to achieve complete coverage of all the Lucene
Query
types or debating the choice of tag names.

Please see the readme.txt in the package for more details.

Cheers
Mark




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


While maintaining my bookmarks I ran into this:
Case Study: Enabling Low-Cost XML-Aware Searching
Capable of Complex Querying:
http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/ 
03-02-08/03-02-08.html


Some loose thoughts:

In the system described there a Lucene document is used for each
low level xml construct, even when it contains very few characters  
of text.

The resulting Lucene indexes are at least 2.5 times the size of the
original document, which is not a surprise given this document  
structure.

Normal index size is about one third of  the indexed text.

I don't know about the XQuery standard, but I was wondering
whether this unusual document structure and the non straightforward
fit between Lucene queries and XQuery queries are related.

As for the  joines and iterations over items from the stream of XML
results: iteration over matching XML constructs should be no problem
in Lucene. Joins in Lucene are normally done via boolean filters,
so I was wondering how XQuery joins fit these.
The case study above has a note a the end of par 5.3:
The Search Result list that comes back could then be organized
by document id to group together all the results for a single XML
document. This is not provided by default, but has been done with
extension to this code.

Regards,
Paul Elschot

On Friday 16 December 2005 03:45, Wolfgang Hoschek wrote:


I think implementing an XQuery Full-Text engine is far beyond the
scope of Lucene.

Implementing a building block for the fulltext aspect of it would be
more manageable. Unfortunately The W3C fulltext drafts
indiscriminately mix and mingle two completely different languages
into a single language, without clear boundaries. That's why most
practical folks implement XQuery fulltext search via extension
functions rather than within XQuery itself. This also allows for much
more detailed tokenization, configuration and extensibility than what
would be possible with the W3C draft.

Wolfgang.

On Dec 15, 2005, at 4:20 PM, [EMAIL PROTECTED] wrote:



Mark,

This is very cool. When I was at TripleHop we did something very
similar where both query and results conformed to an XML Schema and
we used XML over HTTP as our main vehicle to do remote/federated
searches with quick rendering with stylesheets.

That however is the first piece of the puzzle. If you really want

Re: Advanced query language

2005-12-15 Thread Wolfgang Hoschek
I think implementing an XQuery Full-Text engine is far beyond the  
scope of Lucene.


Implementing a building block for the fulltext aspect of it would be  
more manageable. Unfortunately The W3C fulltext drafts  
indiscriminately mix and mingle two completely different languages  
into a single language, without clear boundaries. That's why most  
practical folks implement XQuery fulltext search via extension  
functions rather than within XQuery itself. This also allows for much  
more detailed tokenization, configuration and extensibility than what  
would be possible with the W3C draft.


Wolfgang.

On Dec 15, 2005, at 4:20 PM, [EMAIL PROTECTED] wrote:


Mark,

This is very cool. When I was at TripleHop we did something very  
similar where both query and results conformed to an XML Schema and  
we used XML over HTTP as our main vehicle to do remote/federated  
searches with quick rendering with stylesheets.


That however is the first piece of the puzzle. If you really want  
to go beyond search (in the traditional sense) and be able to  
perform more complex operations such as joines and iterations over  
items from the stream of XML results you are getting you should  
consider implementing an XQuery Full-Text engine with Lucene  
adopting the now standard XQuery language.


Here is the pointer to the working draft on the W3C working draft  
on XQuery 1.0 and XPath 2.0 Full-Text:

http://www.w3.org/TR/xquery-full-text/

Now I'm part of the task force editing this draft so your comments  
are very much welcomed.


-- J.D.


http://www.inperspective.com/lucene/LXQueryV0_1.zip

I've implemented just a few queries (Boolean, Term, FilteredQuery,
BoostingQuery ...) but other queries are fairly trivial to add.
At this stage I am more interested in feedback on parser design/ 
approach
rather than trying to achieve complete coverage of all the Lucene  
Query

types or debating the choice of tag names.

Please see the readme.txt in the package for more details.

Cheers
Mark



___
How much free photo storage do you get? Store your holiday
snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


http://www.inperspective.com/lucene/LXQueryV0_1.zip

I've implemented just a few queries (Boolean, Term, FilteredQuery,
BoostingQuery ...) but other queries are fairly trivial to add.
At this stage I am more interested in feedback on parser design/ 
approach
rather than trying to achieve complete coverage of all the Lucene  
Query

types or debating the choice of tag names.

Please see the readme.txt in the package for more details.

Cheers
Mark



___
How much free photo storage do you get? Store your holiday
snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Advanced query language

2005-12-15 Thread Wolfgang Hoschek
Right now the Sun STAX impl is decidedly buggy compared to xerces SAX  
(and it's not faster either). The most complete, reliable and  
efficient STAX impl seems to be woodstox.


Wolfgang.

On Dec 15, 2005, at 7:22 PM, Yonik Seeley wrote:


Agreed, that is a significant downside.
StAX is included in Java 6, but that doesn't help too much given the
Java 1.4 req.

-Yonik

On 12/15/05, Wolfgang Hoschek [EMAIL PROTECTED] wrote:


STAX would probably make coding easier, but unfortunately complicates
the packaging side: one must ship at least two additional external
jars (stax interfaces and impl) for it to become usable. Plus, STAX
is quite underspecified (I wrote a STAX parser + serializer impl
lately), so there's room for runtime suprises with different impls.
The primary advantage of SAX is that everything is included in JDK =
1.4, and that impls tend to be more mature. SAX bottom line: more
hassle early on, less hassle later.

Wolfgang.

On Dec 15, 2005, at 5:47 PM, Yonik Seeley wrote:



On 12/15/05, markharw00d [EMAIL PROTECTED] wrote:



At this stage I am more interested in feedback on parser design/
approach




Excellent idea.
While SAX is fast, I've found callback interfaces more difficult to
deal with while generating nested object graphs... it normally
requires one to maintain state in stack(s).

Have you considered a pull-parser like StAX or XPP?  They are as  
fast
as SAX, and allow you to ask for the next XML event you are  
interested
in, eliminating the need to keep track of where you are by other  
means

(the place in your own code and normal variables do that).  It
normally turns into much more natural code.

-Yonik



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Advanced query language

2005-12-06 Thread Wolfgang Hoschek
That's basically what I'm implementing with Nux, except that the  
syntax and calling conventions are a bit different, and that Lucene  
analyzers can optionally be specified, which makes it a lot more  
powerful (but also a bit more complicated).


Wolfgang.

On Dec 6, 2005, at 10:48 AM, Incze Lajos wrote:


Maybe, I'm a bit late with this, but.

There is an ongoing effort at w3c to define a fulltext
search language that could extend their xpath and xquery
languages (which clearly makes sense).

These are the current documents on the topic:

http://www.w3.org/TR/2005/WD-xquery-full-text-20051103/
http://www.w3.org/TR/2005/WD-xmlquery-full-text-use-cases-20051103/

incze

(This case, the query language itself is not xml, as has to
serve as a selection criteria in an xpath or xquery expression,
but xml conform, so may be embedded in any xml doc.)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Advanced query language

2005-12-05 Thread Wolfgang Hoschek


Hopefully that makes sense to someone besides just me.  It's  
certainly a
lot more complexity then a simple one to one mapping, but it seems  
to me
like the flexability is worth spending the extra time to design/ 
build it.




Makes perfect sense to me, and it doesn't seem any more complex than  
what's been proposed before. Actually, this may be a quite  
straightforward, compact and extensible way of doing it all.


Though, I'd be careful with proposing a variety of equivalent  
syntaxes as it may easily lead to more confusion than good. Let's  
start with one canonical syntax. If desired, other (more pleasant)  
syntaxes may then be converted to that as part of a preprocessing step.


Wolfgang.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Advanced query language

2005-12-05 Thread Wolfgang Hoschek


Hopefully that makes sense to someone besides just me.  It's  
certainly a
lot more complexity then a simple one to one mapping, but it seems  
to me
like the flexability is worth spending the extra time to design/ 
build it.





Makes perfect sense to me, and it doesn't seem any more complex  
than what's been proposed before. Actually, this may be a quite  
straightforward, compact and extensible way of doing it all.


Though, I'd be careful with proposing a variety of equivalent  
syntaxes as it may easily lead to more confusion than good. Let's  
start with one canonical syntax. If desired, other (more pleasant)  
syntaxes may then be converted to that as part of a preprocessing  
step.




I should add that I'd love to see a powerful, extensible yet easy to  
read XML based query syntax, and make that available to users of  
XQuery fulltext search.


Here is an example fulltext XQuery that finds all books authored by  
James that have something to do with 'salmon fishing manuals', sorted  
by relevance


declare namespace lucene = java:nux.xom.pool.FullTextUtil;

declare variable $query := +salmon~ +fish* manual~;
(: any arbitrary Lucene query can go here :)
(: declare variable $query as xs:string external; :)

for $book in /books/book[author=James and lucene:match(abstract,  
$query)  0.0]

let $score := lucene:match($book/abstract, $query)
order by $score descending
return $book


Now, instead of handing a quite limited lucene query string to  
lucene:match($query), as above, I'd love to pass it an XML query  
blurb that makes all of lucene's power accessible without the user  
having to construct query objects himself.


Consider it an additional use case beyond what Erik and others  
brought up so far...


Wolfgang.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: open source YourKit licence

2005-12-02 Thread Wolfgang Hoschek
Yonik, I haven't been terribly active lately, but I've been voted in  
as committer as well... :-)


http://marc.theaimsgroup.com/?l=lucene-devw=2r=1s=hoschek 
+committerq=b


Cheers,
Wolfgang.

On Dec 2, 2005, at 2:53 PM, Yonik Seeley wrote:


~yonik/yourkit/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene does NOT use UTF-8.

2005-08-31 Thread Wolfgang Hoschek

On Aug 30, 2005, at 12:47 PM, Doug Cutting wrote:


Yonik Seeley wrote:

I've been looking around... do you have a pointer to the source  
where just the suffix is converted from UTF-8?
I understand the index format, but I'm not sure I understand the  
problem that would be posed by the prefix length being a byte count.




TermBuffer.java:66

Things could work fine if the prefix length were a byte count.  A  
byte buffer could easily be constructed that contains the full byte  
sequence (prefix + suffix), and then this could be converted to a  
String.  The inefficiency would be if prefix were re-converted from  
UTF-8 for each term, e.g., in order to compare it to the target.   
Prefixes are frequently longer than suffixes, so this could be  
significant.  Does that make sense?  I don't know whether it would  
actually be significant, although TermBuffer.java was added  
recently as a measurable performance enhancement, so this is  
performance critical code.


We need to stop discussing this in the abstract and start coding  
alternatives and benchmarking them.  Is  
java.nio.charset.CharsetEncoder fast enough?  Will moving things  
through CharBuffer and ByteBuffer be too slow?  Should Lucene keep  
maintaining its own UTF-8 implementation for performance?  I don't  
know, only some experiments will tell.


Doug



I don't know if it matters for Lucene usage. But if using  
CharsetEncoder/CharBuffer/ByteBuffer should turn out to be a  
significant problem, it's probably due to startup/init time of these  
methods for individually converting many small strings, not  
inherently due to UTF-8 usage. I'm confident that a custom UTF-8  
implementation can almost completely eliminate these issues. I've  
done this before for binary XML with great success, and it could  
certainly be done for lucene just as well. Bottom line: It's probably  
an issue that can be dealt with via proper impl; it probably  
shouldn't dictate design directions.


Wolfgang.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[ANN] Nux-1.3 released

2005-08-03 Thread Wolfgang Hoschek

The Nux-1.3 release has been uploaded to

http://dsd.lbl.gov/nux/

Nux is an open-source Java toolkit making efficient and powerful XML  
processing easy.



Changelog:

•Upgraded to saxonb-8.5 (saxon-8.4 and 8.3 should continue  
to work as well).


•Upgraded to xom-1.1-rc1 (with compatible performance  
patches). Plain xom-1.0 should continue to work as well, albeit less  
efficiently.


•Numerous bnux Binary XML performance enhancements for  
serialization and deserialization (UTF-8 character encoding, buffer  
management, symbol table, pack sorting, cache locality, etc).  
Overall, bnux is now about twice as fast, and, perhaps more  
importantly, has a much more uniform performance profile, no matter  
what kind of document flavour is thrown at it. It routinely delivers  
50-100 MB/sec deserialization performance, and 30-70 MB/sec  
serialization performance (commodity PC 2004). It is roughly 5-10  
times faster than xom-1.1 with xerces-2.7.1 (which, in turn, is  
faster than saxonb-8.5, dom4j-1.6.1 and xerces-2.7.1 DOM). Further,  
preliminary measurements indicate bnux deserialization and  
serialization to be consistently 2-3 times faster than Sun's  
FastInfoSet implementation, using XOM. Saxon's PTree could not be  
tested as it is only available in the commercial version. The only  
remaining area with substantial potential for performance improvement  
seems to be complex namespace handling. This might be addressed by  
slightly restructuring private XOM internals in a future version.


•BinaryXMLTest now also has command line support for testing  
and benchmarking Saxon, DOM and FastInfoSet (besides bnux and XOM).


•Rewrote XQueryCommand. The new nux/bin/fire-xquery is a  
more powerful, flexible and reliable command line test tool that runs  
a given XQuery against a set of files and prints the result sequence.  
In addition, it supports schema validation, XInclude (via XOM), an  
XQuery update facility, malformed HTML parsing (via TagSoup) and much  
more. It's available for Unix and Windows, and works like any other  
decent Unix command line tool.


•Removed ValidationCommand (made obsolete by the fire-xquery  
functionality).


•Added experimental XQuery in-place update functionality.  
Comments on the usefulness of the current behaviour are especially  
welcome, as are suggestions for potential improvements.


•Added nux.xom.xquery.ResultSequenceSerializer, which  
serializes an XQuery/XPath2 result sequence onto a given output  
stream, using various configurable serialization options such  
encoding and indentation. Implements the W3C XQuery/XSLT2  
Serialization Draft Spec. Also implements an alternative wrapping  
algorithm that ensures that any arbitrary result sequence can always  
be output as a well-formed XML document.


•Added XQueryFactory.createXQuery(File file, URI baseURI)  
and XQueryPool.getXQuery(File file, URI baseURI) to allow for  
separation of the location of the query file and input XML files.


•The default XQuery DocumentURIResolver now recognizes the  
.bnux file extension as binary XML, and parses it accordingly. For  
example, a query can be 'doc(samples/data/articles.xml.bnux)/ 
articles/*'


•Added FileUtil.listFiles(). Returns the URIs of all files  
who's path matches at least one of the given inclusion wildcard or  
regular expressions but none of the given exclusion wildcard or  
regular expressions; starting from the given directory, optionally  
with recursive directory traversal, insensitive to underlying  
operating system conventions.


•XOMUtil.Normalizer now uses XML whitespace definition  
rather than Java whitespace definition.


•Added XOMUtil.Normalizer.STRIP, which removes Texts that  
consist of whitespace-only (boundary whitespace), retaining other  
strings unchanged.


•Added AnalyzerUtil.getPorterStemmerAnalyzer() for English  
language stemming on full text search.


•Added XOMUtil.toDocument(String xml) convenience method to  
parse a string.


•Moved XOMUtil.toByteArray() and XOMUtil.toString() into  
class FileUtil. The old methods remain available but have been  
deprecated.


•Added jar-bnux ant target to optionally build a minimal  
jar file (20 KB) for binary XML only.


•Added more test documents to samples/data directory.

•Updated license blurbs to 2005.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Analyzer as an Interface?

2005-07-19 Thread Wolfgang Hoschek

On Jul 19, 2005, at 12:58 PM, Daniel Naber wrote:


Hi,

currently Analyzer is an abstract class. Shouldn't we make it an  
Interface?
Currently that's not possible, but it will be as soon as the  
deprecated

method is removed (i.e. after Lucene 1.9).

Regards
 Daniel



Daniel, what's the use case that would make this a significant  
improvement over extending and overriding the single abstract method?  
Classes that implement multiple interfaces? For consistency, similar  
thoughts would apply to TokenStream, IndexReader/Writer, etc. Also  
note that once it's become an interface the API is effectively frozen  
forever. With abstract classes the option remains open to later add  
methods with a default impl. (e.g. tokenStream(String fieldName,  
String text) or whatever).


Thanks,
Wolfgang.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene vs. Ruby/Odeum

2005-06-02 Thread Wolfgang Hoschek

 poor java startup time

For the one's really keen on reducing startup time the Jolt Java VM  
daemon may perhaps be of some interest:

http://www.dystance.net/software/jolt/index.html

I played with it a year ago when I was curious to see what could be  
done about startup time in the context of simple unix-scriptable  
command line XML webservice clients (the ones that require tons of  
jars as dependencies and take ages to initialize). Startup time went  
from 3-5 secs to zero. Feels like ls - you hit ENTER and the  
program completes *instantly*. Of course there's a catch. It requires  
some more work, and it's not a general solution wrt. isolation,  
security, reliability, etc. but for a simple command line lucene  
query tool it might just do fine, FWIW.


Long-term Sun's MVM might be a more comprehensive solution, with some  
luck.


Wolfgang.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene vs. Ruby/Odeum

2005-06-01 Thread Wolfgang Hoschek
As an aside, in my performance testing of Lucene using JProfiler,  
it seems
to me that the only way to improve Lucene's performance greatly can  
come

from 2 areas

1. optimizing the JVM array/looping/JIT constructs/capabilities to  
avoid

bounds checking/improve performance
2. improve function call overhead

Other than that, other changes will require a significant change in  
the code

structure (manually unrolling loops), at the sacrifice of
readability/maintainability.



Just curious: are you more happy with JProfiler than with the JDK 1.5  
profiler?


I haven't used JProfiler in quite a while but my impression back then  
was that it's overheads tend to significantly perturb measurement  
results. When I switched to the low-level JDK 1.5 profiler CPU tuning  
efforts got a lot more targetted and meaningful.


So, in my experience, the least perturbing and most accurate profiler  
is the one built into JDK 1.5. run java with
-server -agentlib:hprof=cpu=samples,depth=10' flags for long enough  
to collect enough samples to be statistically meaningful, then study  
the trace log and correlate its hotspot trailer with its call stack  
headers (grep is your friend, a GUI isn't really needed). For a  
background article on hprof see http://java.sun.com/developer/ 
technicalArticles/Programming/HPROF.html


Wolfgang.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: contrib/queryParsers/surround

2005-05-28 Thread Wolfgang Hoschek
Cool stuff. Once this has stabilized and settled down I might start  
exposing the surround language from XQuery/XPath as an experimental  
match facility.


Wolfgang.

On May 28, 2005, at 10:07 AM, Paul Elschot wrote:





On Saturday 28 May 2005 17:06, Erik Hatcher wrote:






On May 28, 2005, at 10:04 AM, Paul Elschot wrote:





Dear readers,

I've started moving the surround query language
http://issues.apache.org/bugzilla/show_bug.cgi?id=34331
into the directory named by the title in my working copy of the  
lucene

trunk. When the tests pass I'll repost it there.
In case someone  needs this earlier, please holler.






As for naming conventions and where this should live in contrib,
consider that a user will only want a single query parser and more
than that would be unneeded bloat in her application.  The contrib
pieces are all packaged as a separate JAR per directory under  
contrib.


My recommendation would be to put your wonderful surround parser and
supporting infrastructure under contrib/surround.

I'm very much looking forward to having this available!






Meanwhile the tests pass again with some expected standard ouput.

A little bit of deprecation is left in the CharStream (getLine and
getColumn) in the parser. Would you have any idea how to deal with  
that?


I'll leave the build.xml stand alone with constants for the  
environment.

It was derived from a lucene build.xml of a few eons ago, so
I hope someone can still integrate it...

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]










-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[ANN] nux-1.2 release

2005-05-25 Thread Wolfgang Hoschek

The nux-1.2 release has been uploaded to

http://dsd.lbl.gov/nux/

Nux is an open-source Java XML toolset geared towards embedded use in  
high-throughput XML messaging middleware such as large-scale Peer-to- 
Peer infrastructures, message queues, publish-subscribe and  
matchmaking systems for Blogs/newsfeeds, text chat, data acquisition  
and distribution systems, application level routers, firewalls,  
classifiers, etc. It is not an XML database, and does not attempt to  
be one.



Changelog:

XQuery/XPath: Added optional fulltext search via Apache Lucene  
engine. Similar to Google search, it is easy to use, powerful,  
efficient and goes far beyond what can be done with standard XPath  
regular expressions and string manipulation functions. It is similar  
in intent but not directly related to preliminary W3C fulltext search  
drafts. Rather than targetting fulltext search of infrequent queries  
over huge persistent data archives (historic search), Nux targets  
fulltext search of huge numbers of queries over comparatively small  
transient realtime data (prospective search). See FullTextUtil and  
MemoryIndex.


Example fulltext XQuery that finds all books authored by James that  
have something to do with 'salmon fishing manuals', sorted by relevance:


declare namespace lucene = java:nux.xom.pool.FullTextUtil;
declare variable $query := +salmon~ +fish* manual~;
(: any arbitrary Lucene query can go here :)
(: declare variable $query as xs:string external; :)
for $book in /books/book[author=James and lucene:match(abstract,  
$query)  0.0]

let $score := lucene:match($book/abstract, $query)
order by $score descending
return $book


Example fulltext XQuery that matches on extracted sentences:

declare namespace lucene = java:nux.xom.pool.FullTextUtil;
for $book in /books/book
for $s in lucene:sentences($book/abstract, 0)
return
if (lucene:match($s, +salmon~ +fish* manual~)  0.0)
then normalize-space($s)
else ()

It is designed to enable maximum efficiency for on-the-fly  
matchmaking combining structured and fuzzy fulltext search in  
realtime streaming applications such as XQuery based XML message  
queues, publish-subscribe systems for Blogs/newsfeeds, text chat,  
data acquisition and distribution systems, application level routers,  
firewalls, classifiers, etc.


Arbitrary Lucene fulltext queries can be run from Java or from XQuery/ 
XPath/XSLT via a simple extension function. The former approach is  
more flexible whereas the latter is more convenient. Lucene analyzers  
can split on whitespace, normalize to lower case for case  
insensitivity, ignore common terms with little discriminatory value  
such as he, in, and (stop words), reduce the terms to their  
natural linguistic root form such as fishing being reduced to  
fish (stemming), resolve synonyms/inflexions/thesauri (upon  
indexing and/or querying), etc. Also see Lucene Query Syntax as well  
as Query Parser Rules.


Background: The first prototype was put together over the weekend.  
The functionality worked just fine, except that it took ages to index  
and search text in a high-frequency environment. Subsequently I wrote  
a complete reimplementation of the Lucene interfaces and contributed  
that back to Lucene (the bits in org.apache.lucene.index.memory.*).  
Next, I placed a smart cache in front of it (the bits in  
nux.xom.pool.FullTextUtil / FullTextPool). The net effect is that  
fulltext queries over realtime data now run some three orders of  
magnitude faster while preserving the same general functionality  
(e.g. 10-50 queries/sec ballpark). In fact, you'll probably  
notice little or no overhead when adding fulltext search to your  
streaming apps. See MemoryIndexBenchmark and XQueryBenchmark.


Explore and enjoy, perhaps using the queries and sample data from the  
samples/fulltext directory as a starting point.


Wolfgang.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Add Term.createTerm to avoid 99% of String.intern() calls

2005-05-18 Thread Wolfgang Hoschek
For the MemoryIndex, I'm seeing large performance overheads due to  
repetitive temporary string interning of o.a.l.index.Term.
For example, consider a FuzzyTermQuery or similar, scanning all terms  
via TermEnum in the index: 40% of the time is spent in String.intern 
() of new Term(). [Allocating temporary memory and  
FuzzyTermEnum.termCompare are less of a problem according to profiling].

Note that the field name would only need to be interned once, not  
time and again for each term. But the non-iterning Term constructor  
is private and hence not accessible from o.a.l.index.memory.*.  
TermBuffer isn't what I'm looking for, and it's private anyway. The  
best solution I came up with is to have an additional safe public  
method in Term.java:

  /** Constructs a term with the given text and the same interned  
field name as
   * this term (minimizes interning overhead). */
  public Term createTerm(String txt) { // WH
  return new Term(field, txt, false);
  }

Besides dramatically improving performance, this has the benefit of  
keeping the non-interning constructor private.
Comments/opinions, anyone?

Here's a sketch of how it can be used:
public Term term() {
...
if (cachedTerm == null) cachedTerm = new Term 
((String) sortedFields[j].getKey(), );
return cachedTerm.createTerm((String)  
info.sortedTerms[i].getKey());
}

public boolean next() {
...
if (...) cachedTerm = null;
}
I'll send the full patch for MemoryIndex if this is accepted.
Wolfgang.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene vs. Ruby/Odeum

2005-05-17 Thread Wolfgang Hoschek
Right. One doesn't need to run those benchmarks to immediately see  
that most time is spent in VM startup, class loading, hotspot  
compilation rather than anything Lucene related. Even a simple  
System.out.println(hello) typically takes some 0.3 secs on a fast  
box and JVM.

Wolfgang.
On May 17, 2005, at 7:33 AM, Scott Ganyo wrote:
Interesting, but questionable.  I can imagine three problems with  
the write-up just off-hand:

1) JVM startup time.  As the author noted, this can be an issue  
with short-running Java applications.

2) JVM warm-up time.  The HotSpot VM is designed to optimize itself  
and become faster over time rather than being the fastest right out  
of the blocks.

3) Data access patterns.  It is possible (I don't know) that Odeum  
is designed for quick one-time search on the data without reading  
and caching the index like Lucene does for subsequent queries.

In each case, there is a common theme:  Lucene and Java are  
designed to perform better for longer-running applications... not  
start, lookup, and terminate utilities.

S
On May 16, 2005, at 9:41 PM, Otis Gospodnetic wrote:

Some interesting stuff...
http://www.zedshaw.com/projects/ruby_odeum/performance.html
http://blog.innerewut.de/articles/2005/05/16/ruby-odeum-vs-apache- 
lucene

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [Performance] Streaming main memory indexing of single strings

2005-05-03 Thread Wolfgang Hoschek
Here's a performance patch for MemoryIndex.MemoryIndexReader that 
caches the norms for a given field, avoiding repeated recomputation of 
the norms. Recall that, depending on the query, norms() can be called 
over and over again with mostly the same parameters. Thus, replace 
public byte[] norms(String fieldName) with the following code:

		/** performance hack: cache norms to avoid repeated expensive 
calculations */
		private byte[] cachedNorms;
		private String cachedFieldName;
		private Similarity cachedSimilarity;
		
		public byte[] norms(String fieldName) {
			byte[] norms = cachedNorms;
			Similarity sim = getSimilarity();
			if (fieldName != cachedFieldName || sim != cachedSimilarity) { // 
not cached?
Info info = getInfo(fieldName);
int numTokens = info != null ? info.numTokens : 0;
float n = sim.lengthNorm(fieldName, numTokens);
byte norm = Similarity.encodeNorm(n);
norms = new byte[] {norm};

cachedNorms = norms;
cachedFieldName = fieldName;
cachedSimilarity = sim;
if (DEBUG) System.err.println(MemoryIndexReader.norms:  + 
fieldName + : + n + : + norm + : + numTokens);
			}
			return norms;
		}

The effect can be substantial when measured with the profiler, so it's 
worth it.
Wolfgang.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


contrib: keywordTokenStream

2005-05-03 Thread Wolfgang Hoschek
Here's a convenience add-on method to MemoryIndex. If it turns out that 
this could be of wider use, it could be moved into the core analysis 
package. For the moment the MemoryIndex might be a better home. 
Opinions, anyone?

Wolfgang.
	/**
	 * Convenience method; Creates and returns a token stream that 
generates a
	 * token for each keyword in the given collection, as is, without any
	 * transforming text analysis. The resulting token stream can be fed 
into
	 * [EMAIL PROTECTED] #addField(String, TokenStream)}, perhaps wrapped into another
	 * [EMAIL PROTECTED] org.apache.lucene.analysis.TokenFilter}, as desired.
	 *
	 * @param keywords
	 *the keywords to generate tokens for
	 * @return the corresponding token stream
	 */
	public TokenStream keywordTokenStream(final Collection keywords) {
		if (keywords == null)
			throw new IllegalArgumentException(keywords must not be null);
		
		return new TokenStream() {
			Iterator iter = keywords.iterator();
			int pos = 0;
			int start = 0;
			public Token next() {
if (!iter.hasNext()) return null;

Object obj = iter.next();
if (obj == null)
	throw new IllegalArgumentException(keyword must not be null);

String term = obj.toString();
Token token = new Token(term, start, start + term.length());
start += term.length() + 1; // separate words by 1 (blank) character
pos++;
return token;
			}
		};
	}

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: contrib: keywordTokenStream

2005-05-03 Thread Wolfgang Hoschek
On May 3, 2005, at 5:26 PM, Erik Hatcher wrote:
Wolfgang,
I've now added this.
Thanks :-)
I'm not seeing how this could be generally useful.  I'm curious how 
you are using it and why it is better suited for what you're doing 
than any other analyzer.

keyword tokenizer is a bit overloaded terminology-wise, though - 
look in the contrib/analyzers/src/java area to see what I mean.

Erik
The difference between this and the KeywordTokenizer from the 
contrib/analyzer is that it

- can operate on multiple keywords rather than just a single one. So 
it's slighly more general.
- Takes a collection (typically of String values) as a input rather 
than a Reader. I can see the java.io.Reader scalability rationale used 
throughout the analysis APIs, but for many use cases (including my own) 
Strings are a lot handier (and more efficient to deal with) - the 
string values are small anyway.

So it's a convenient way to add terms (keywords if you like) that have 
been parsed/massaged into string(s) by some existing external means 
(e.g. grouped regex scanning of legacy formatted text files into 
various fields, etc) into an index as is, without any further 
transforming analysis. Most folks could write such a (non-essential) 
utility themselves but it's handy in a similar way that you have the 
Field.Keyword convenience infrastructure...

keyword tokenizer is a bit overloaded terminology-wise, though
If you come up with a better name feel free to rename it.
Wolfgang.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [Performance] Streaming main memory indexing of single strings

2005-05-02 Thread Wolfgang Hoschek
I'm looking at it right now. The tests pass fine when you put  
lucene-1.4.3.jar instead of the current lucene onto the classpath which  
is what I've been doing so far. Something seems to have changed in the  
scoring calculation. No idea what that might be. I'll see if I can find  
out.

Wolfgang.
The test case is failing (type ant test at the contrib/memory  
working directory) with this:

[junit] Testcase:  
testMany(org.apache.lucene.index.memory.MemoryIndexTest): Caused an  
ERROR
[junit] BUG DETECTED:69 at query=term AND NOT phrase term,  
file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,  
[EMAIL PROTECTED]
[junit] java.lang.IllegalStateException: BUG DETECTED:69 at  
query=term AND NOT phrase term,  
file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,  
[EMAIL PROTECTED]
[junit] at  
org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.java 
:305)
[junit] at  
org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTest 
.java:228)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [Performance] Streaming main memory indexing of single strings

2005-05-02 Thread Wolfgang Hoschek
This is what I have as scoring calculation, and it seems to do exactly  
what lucene-1.4.3 does because the tests pass.

		public byte[] norms(String fieldName) {
			if (DEBUG) System.err.println(MemoryIndexReader.norms:  +  
fieldName);
			Info info = getInfo(fieldName);
			int numTokens = info != null ? info.numTokens : 0;
			byte norm =  
Similarity.encodeNorm(getSimilarity().lengthNorm(fieldName,  
numTokens));
			return new byte[] {norm};
		}
	
		public void norms(String fieldName, byte[] bytes, int offset) {
			if (DEBUG) System.err.println(MemoryIndexReader.norms:  +  
fieldName + *);
			byte[] norms = norms(fieldName);
			System.arraycopy(norms, 0, bytes, offset, norms.length);
		}

		private Similarity getSimilarity() {
			return searcher.getSimilarity(); // this is the normal lucene  
IndexSearcher
		}
		

Can anyone see what's wrong with it for lucene current SVN? Should my  
calculation now be done differently? If so, how?
Thanks for any clues into the right direction.
Wolfgang.

On May 2, 2005, at 9:05 AM, Wolfgang Hoschek wrote:
I'm looking at it right now. The tests pass fine when you put  
lucene-1.4.3.jar instead of the current lucene onto the classpath  
which is what I've been doing so far. Something seems to have changed  
in the scoring calculation. No idea what that might be. I'll see if I  
can find out.

Wolfgang.
The test case is failing (type ant test at the contrib/memory  
working directory) with this:

[junit] Testcase:  
testMany(org.apache.lucene.index.memory.MemoryIndexTest): Caused an  
ERROR
[junit] BUG DETECTED:69 at query=term AND NOT phrase term,  
file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,  
[EMAIL PROTECTED]
[junit] java.lang.IllegalStateException: BUG DETECTED:69 at  
query=term AND NOT phrase term,  
file=src/java/org/apache/lucene/index/memory/MemoryIndex.java,  
[EMAIL PROTECTED]
[junit] at  
org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.jav 
a:305)
[junit] at  
org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTes 
t.java:228)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [Performance] Streaming main memory indexing of single strings

2005-05-02 Thread Wolfgang Hoschek
Yes, the svn trunk uses skipTo more often than 1.4.3.
However, your implementation of skipTo() needs some improvement.
See the javadoc of skipTo of class Scorer:
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/ 
Scorer.html#skipTo(int)
What's wrong with the version I sent? Remeber that there can be at most  
one document in a MemoryIndex. So the target parameter can safely be  
ignored, as far as I can see.

In case the underlying scorers provide skipTo() it's even better to  
use that.

The version I sent returns in O(1), if performance was your concern. Or  
did you mean something else?

Wolfgang.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [Performance] Streaming main memory indexing of single strings

2005-05-02 Thread Wolfgang Hoschek
The version I sent returns in O(1), if performance was your concern. 
Or
did you mean something else?
Since 0 is the only document number in the index, a
return target == 0;
might be nice for skipTo(). It doesn't really help performance, though,
and the next() works just as well.
Regards,
Paul Elschot.

It's not just return target == 0. Internally next() switches a 
hasNext flag to false, and that makes it a safer operation...

BTW, did you give the unit tests a shot? Or even better, run it against 
some of your own queries/test data? That might help to shake out other 
bugs that might potentially be lurking in remote corners...

Cheers,
Wolfgang.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [Performance] Streaming main memory indexing of single strings

2005-05-02 Thread Wolfgang Hoschek
Thanks!
Wolfgang.
I've committed this change after it successfully worked for me.
Thanks!
Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


[Patch] IndexReader.finalize() performance

2005-04-28 Thread Wolfgang Hoschek
Here is the first and most high-priority patch I've settled on to get  
Lucene to work efficiently for the typical usage scenarios of  
MemoryIndex. More patches are forthcoming if this one is received  
favourably...

There's large overhead involved in forcing all IndexReader impls to  
have a finalize() method.  Remember that allocating and registering  
finalizable objects in a JVM isn't cheap at all when it's done at high  
frequency, which is the case for my single document MemoryIndex usage.  
MemoryIndex.createSearcher() does a new MemoryIndexReader() which is a  
subclass of IndexReader and thus carries what for this case amounts to  
unnecessary IndexReader superclass baggage.

The proposal is to rename IndexReader.finalize() to  
IndexReader.doFinalize(), and for each subclass of IndexReader that  
wants or needs finalization add a method

XYZReader.finalize() { doFinalize(); }
That way subclass are not forced to be finalizable and incur the  
associated overheads. Note that it would not help to simply have an  
empty finalize() {} method, because that would still incur the  
finalizer JVM registration costs.

[The other option would be to have IndexReader be an interface, but  
that would be a change that's a lot more involved]

Here are two test runs without and with the patch:
[grolsch /home/portnoy/u5/hoschek/tmp/tmp/firefish] cat xjames.txt
James is out in the woods
** NOW WITHOUT THE PATCH APPLIED: *
[grolsch /home/portnoy/u5/hoschek/tmp/tmp/firefish] bin/fire-java  
org.apache.lucene.index.memory.MemoryIndexTest 3 100 mem James  
xjames.txt
### iteration=0

*** FILE=xjames.txt
secs = 15.046
queries/sec= 66462.85
MB/sec = 1.6479818
### iteration=1
*** FILE=xjames.txt
secs = 15.507
queries/sec= 64487.008
MB/sec = 1.5989896
### iteration=2
*** FILE=xjames.txt
secs = 15.923
queries/sec= 62802.234
MB/sec = 1.5572149
Done benchmarking (without checking correctness).
Dumping CPU usage by sampling running threads ... done.
[grolsch /home/portnoy/u5/hoschek/tmp/tmp/firefish]

** NOW WITH THE PATCH APPLIED: *
[grolsch /home/portnoy/u5/hoschek/tmp/tmp/firefish] bin/fire-java  
org.apache.lucene.index.memory.MemoryIndexTest 3 100 mem James  
xjames.txt
### iteration=0

*** FILE=xjames.txt
secs = 4.974
queries/sec= 201045.44
MB/sec = 4.9850287
### iteration=1
*** FILE=xjames.txt
secs = 4.495
queries/sec= 222469.42
MB/sec = 5.5162477
### iteration=2
*** FILE=xjames.txt
secs = 4.49
queries/sec= 222717.16
MB/sec = 5.5223904
Done benchmarking (without checking correctness).
Dumping CPU usage by sampling running threads ... done.
 * If you're curious about
 * the whereabouts of bottlenecks, run java 1.5 with the non-perturbing  
'-server
 * -agentlib:hprof=cpu=samples,depth=10' flags, then study the trace  
log and
 * correlate its hotspot trailer with its call stack headers (see a
 * target=_blank
 *  
href=http://java.sun.com/developer/technicalArticles/Programming/ 
HPROF.html
 * hprof tracing /a).

See the tail of the profiler output below and in particular note the  
following:
CPU SAMPLES BEGIN (total = 918) Thu Apr 28 11:39:14 2005
rank   self  accum   count trace method
   1 57.41% 57.41% 527 300154  
org.apache.lucene.index.memory.MemoryIndex.createSearcher
   2  5.01% 62.42%  46 300152 java.lang.StrictMath.log
   3  3.05% 65.47%  28 300164  
java.lang.ref.Finalizer.invokeFinalizeMethod


cat java.hprof.txt:
JAVA PROFILE 1.0.1, created Thu Apr 28 11:38:27 2005
Header for -agentlib:hprof (or -Xrunhprof) ASCII Output (J2SE 1.5 JVMTI  
based)

@(#)jvm.hprof.txt   1.3 04/02/09
 Copyright (c) 2004 Sun Microsystems, Inc. All  Rights Reserved.
WARNING!  This file format is under development, and is subject to
change without notice.
This file contains the following types of records:
THREAD START
THREAD END  mark the lifetime of Java threads
TRACE   represents a Java stack trace.  Each trace consists
of a series of stack frames.  Other records refer to
TRACEs to identify (1) where object allocations have
taken place, (2) the frames in which GC roots were
found, and (3) frequently executed methods.
HEAP DUMP   is a complete snapshot of all live objects in the Java
heap.  Following distinctions are made:
ROOTroot set as determined by GC
CLS classes
OBJ instances
ARR arrays
SITES   is a sorted list of allocation sites.  This identifies
the most heavily allocated object types, and the TRACE
at which those allocations occurred.
CPU SAMPLES is a statistical profile of program execution.  The VM
periodically samples all running threads, and assigns
a quantum to active TRACEs in those threads.  

Re: [Performance] Streaming main memory indexing of single strings

2005-04-27 Thread Wolfgang Hoschek
Whichever place you settle on is fine with me.
[In case it might make a difference: Just note that MemoryIndex has a 
small auxiliary dependency on PatternAnalyzer in addField() because the 
Analyzer superclass doesn't have a tokenStream(String fieldName, String 
text) method. And PatternAnalyzer requires JDK 1.4 or higher]

Wolfgang.
On Apr 27, 2005, at 9:22 AM, Doug Cutting wrote:
Erik Hatcher wrote:
I'm not quite sure  where to put MemoryIndex - maybe it deserves to 
stand on its own in a  new contrib area?
That sounds good to me.
Or does it make sense to put this into misc (still  in sandbox/misc)? 
 Or where?
Isn't the goal for sandbox/ to go away, replaced with contrib/?
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [Performance] Streaming main memory indexing of single strings

2005-04-27 Thread Wolfgang Hoschek
OK. I'll send an update as soon as I get round to it...
Wolfgang.
On Apr 27, 2005, at 12:22 PM, Doug Cutting wrote:
Erik Hatcher wrote:
I'm not quite sure  where to put MemoryIndex - maybe it deserves to 
stand on its own in a  new contrib area?
That sounds good to me.
Ok... once Wolfgang gives me one last round up updates (JUnit tests 
instead of main() and upgrade it to work with trunk) I'll do that.  I 
had put it in miscellaneous but will create its only sub-contrib area 
instead.


Or does it make sense to put this into misc (still  in 
sandbox/misc)?  Or where?
Isn't the goal for sandbox/ to go away, replaced with contrib/?
Yes.  In fact, I moved the last relevant piece 
(sandbox/contributions/miscellaneous) to contrib last night.   I think 
both the parsers and XML-Indexing-Demo found in the sandbox are not 
worth preserving.  Anyone feel that these pieces left in the sandbox 
should be preserved?

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [Performance] Streaming main memory indexing of single strings

2005-04-26 Thread Wolfgang Hoschek
I've uploaded slightly improved versions of the fast MemoryIndex  
contribution to http://issues.apache.org/bugzilla/show_bug.cgi?id=34585  
along with another contrib - PatternAnalyzer.
 	
For a quick overview without downloading code, there's javadoc for it  
all at  
http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- 
summary.html

I'm happy to maintain these classes externally as part of the Nux  
project. But from the preliminary discussion on the list some time ago  
I gathered there'd be some wider interest, hence I prepared the  
contribs for the community. What would be the next steps for taking  
this further, if any?

Thanks,
Wolfgang.
/**
 * Efficient Lucene analyzer/tokenizer that preferably operates on a  
String
rather than a
 * [EMAIL PROTECTED] java.io.Reader}, that can flexibly separate on a regular  
expression
[EMAIL PROTECTED] Pattern}
 * (with behaviour idential to [EMAIL PROTECTED] String#split(String)}),
 * and that combines the functionality of
 * [EMAIL PROTECTED] org.apache.lucene.analysis.LetterTokenizer},
 * [EMAIL PROTECTED] org.apache.lucene.analysis.LowerCaseTokenizer},
 * [EMAIL PROTECTED] org.apache.lucene.analysis.WhitespaceTokenizer},
 * [EMAIL PROTECTED] org.apache.lucene.analysis.StopFilter} into a single efficient
 * multi-purpose class.
 * p
 * If you are unsure how exactly a regular expression should look like,
consider
 * prototyping by simply trying various expressions on some test texts  
via
 * [EMAIL PROTECTED] String#split(String)}. Once you are satisfied, give that  
regex to
 * PatternAnalyzer. Also see a target=_blank
 * href=http://java.sun.com/docs/books/tutorial/extra/regex/;Java  
Regular
Expression Tutorial/a.
 * p
 * This class can be considerably faster than the normal Lucene  
tokenizers.
 * It can also serve as a building block in a compound Lucene
 * [EMAIL PROTECTED] org.apache.lucene.analysis.TokenFilter} chain. For example as  
in this

 * stemming example:
 * pre
 * PatternAnalyzer pat = ...
 * TokenStream tokenStream = new SnowballFilter(
 * pat.tokenStream(content, James is running round in the  
woods),
 * English));
 * /pre


On Apr 22, 2005, at 1:53 PM, Wolfgang Hoschek wrote:
I've now got the contrib code cleaned up, tested and documented into a  
decent state, ready for your review and comments.
Consider this a formal contrib (Apache license is attached).

The relevant files are attached to the following bug ID:
http://issues.apache.org/bugzilla/show_bug.cgi?id=34585
For a quick overview without downloading code, there's some javadoc at  
http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- 
summary.html

There are several small open issues listed in the javadoc and also  
inside the code. Thoughts? Comments?

I've also got small performance patches for various parts of Lucene  
core (not submitted yet). Taken together they lead to substantially  
improved performance for MemoryIndex, and most likely also for Lucene  
in general. Some of them are more involved than others. I'm now  
figuring out how much performance each of these contributes and how to  
propose potential integration - stay tuned for some follow-ups to  
this.

The code as submitted would certainly benefit a lot from said patches,  
but they are not required for correct operation. It should work out of  
the box (currently only on 1.4.3 or lower). Try running

cd lucene-cvs
java org.apache.lucene.index.memory.MemoryIndexTest
with or without custom arguments to see it in action.
Before turning to a performance patch discussion I'd a this point  
rather be most interested in folks giving it a spin, comments on the  
API, or any other issues.

Cheers,
Wolfgang.
On Apr 20, 2005, at 11:26 AM, Wolfgang Hoschek wrote:
On Apr 20, 2005, at 9:22 AM, Erik Hatcher wrote:
On Apr 20, 2005, at 12:11 PM, Wolfgang Hoschek wrote:
By the way, by now I have a version against 1.4.3 that is 10-100  
times faster (i.e. 3 - 20 index+query steps/sec) than the  
simplistic RAMDirectory approach, depending on the nature of the  
input data and query. From some preliminary testing it returns  
exactly what RAMDirectory returns.
Awesome.  Using the basic StringIndexReader I sent?
Yep, it's loosely based on the empty skeleton you sent.
I've been fiddling with it a bit more to get other query types.   
I'll add it to the contrib area when its a bit more robust.
Perhaps we could merge up once I'm ready and put that into the  
contrib area? My version now supports tokenization with any analyzer  
and it supports any arbitrary Lucene query. I might make the API for  
adding terms a little more general, perhaps allowing arbitrary  
Document objects if that's what other folks really need...


As an aside, is there any work going on to potentially support  
prefix (and infix) wild card queries ala *fish?
WildcardQuery supports wildcard characters anywhere in the string.   
QueryParser itself restricts expressions that have leading

Re: [Performance] Streaming main memory indexing of single strings

2005-04-22 Thread Wolfgang Hoschek
I've now got the contrib code cleaned up, tested and documented into a  
decent state, ready for your review and comments.
Consider this a formal contrib (Apache license is attached).

The relevant files are attached to the following bug ID:
http://issues.apache.org/bugzilla/show_bug.cgi?id=34585
For a quick overview without downloading code, there's some javadoc at  
http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- 
summary.html

There are several small open issues listed in the javadoc and also  
inside the code. Thoughts? Comments?

I've also got small performance patches for various parts of Lucene  
core (not submitted yet). Taken together they lead to substantially  
improved performance for MemoryIndex, and most likely also for Lucene  
in general. Some of them are more involved than others. I'm now  
figuring out how much performance each of these contributes and how to  
propose potential integration - stay tuned for some follow-ups to this.

The code as submitted would certainly benefit a lot from said patches,  
but they are not required for correct operation. It should work out of  
the box (currently only on 1.4.3 or lower). Try running

cd lucene-cvs
java org.apache.lucene.index.memory.MemoryIndexTest
with or without custom arguments to see it in action.
Before turning to a performance patch discussion I'd a this point  
rather be most interested in folks giving it a spin, comments on the  
API, or any other issues.

Cheers,
Wolfgang.
On Apr 20, 2005, at 11:26 AM, Wolfgang Hoschek wrote:
On Apr 20, 2005, at 9:22 AM, Erik Hatcher wrote:
On Apr 20, 2005, at 12:11 PM, Wolfgang Hoschek wrote:
By the way, by now I have a version against 1.4.3 that is 10-100  
times faster (i.e. 3 - 20 index+query steps/sec) than the  
simplistic RAMDirectory approach, depending on the nature of the  
input data and query. From some preliminary testing it returns  
exactly what RAMDirectory returns.
Awesome.  Using the basic StringIndexReader I sent?
Yep, it's loosely based on the empty skeleton you sent.
I've been fiddling with it a bit more to get other query types.  I'll  
add it to the contrib area when its a bit more robust.
Perhaps we could merge up once I'm ready and put that into the contrib  
area? My version now supports tokenization with any analyzer and it  
supports any arbitrary Lucene query. I might make the API for adding  
terms a little more general, perhaps allowing arbitrary Document  
objects if that's what other folks really need...


As an aside, is there any work going on to potentially support  
prefix (and infix) wild card queries ala *fish?
WildcardQuery supports wildcard characters anywhere in the string.   
QueryParser itself restricts expressions that have leading wildcards  
from being accepted.
Any particular reason for this restriction? Is this simply a current  
parser limitation or something inherent?

QueryParser supports wildcard characters in the middle of strings no  
problem though.  Are you seeing otherwise?
I ment an infix query such as *fish*
Wolfgang.
---
Wolfgang Hoschek  |   email: [EMAIL PROTECTED]
Distributed Systems Department|   phone: (415)-533-7610
Berkeley Laboratory   |   http://dsd.lbl.gov/~hoschek/
---
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [Performance] Streaming main memory indexing of single strings

2005-04-20 Thread Wolfgang Hoschek
Good point.
By the way, by now I have a version against 1.4.3 that is 10-100 times 
faster (i.e. 3 - 20 index+query steps/sec) than the simplistic 
RAMDirectory approach, depending on the nature of the input data and 
query. From some preliminary testing it returns exactly what 
RAMDirectory returns.

I'll do some cleanup and documentation and then post this to the list 
for review RSN.

As an aside, is there any work going on to potentially support prefix 
(and infix) wild card queries ala *fish?

Wolfgang.
On Apr 20, 2005, at 6:10 AM, Vanlerberghe, Luc wrote:
One reason to choose the 'simplistic IndexReader' approach to this
problem over regex's is that the result should be 'bug-compatible' with
a standard search over all documents.
Differences between the two systems would be difficult to explain to an
end-user (let alone for the developer to debug and find the reason in
the first place!)
Luc
-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Saturday, April 16, 2005 2:09 AM
To: java-dev@lucene.apache.org
Subject: Re: [Performance] Streaming main memory indexing of single
strings
On Apr 15, 2005, at 6:15 PM, Wolfgang Hoschek wrote:
Cool! For my use case it would need to be able to handle arbitrary
queries (previously parsed from a general lucene query string).
Something like:
float match(String Text, Query query)
it's fine with me if it also works for
float[] match(String[] texts, Query query) or
float(Document doc, Query query)
but that isn't required by the use case.
My implementation is nearly that.  The score is available as
hits.score(0).  You would also need an analyzer, I presume, passed to
your proposed match() method if you want the text broken into terms.
My current implementation is passed a String[] where each item is
considered a term for the document.  match() would also need a field
name to be fully accurate - since the analyzer needs a field name and
terms used for searching need a field name.  The Query may contain 
terms
for any number of fields - how should that be handled?  Should only a
single field name be passed in and any terms request for other fields 
be
ignored?  Or should this utility morph to assume any words in the text
is in any field being asked of it?

As for Doug's devil advocate questions - I really don't know what I'd
use it for personally (other than the match this single string against
a bunch of queries), I just thought it was clever that it could be
done.  Clever regex's could come close, but it'd be a lot more effort
than reusing good ol' QueryParser and this simplistic IndexReader, 
along
with an Analyzer.

Erik
Wolfgang.
I am intrigued by this and decided to mock a quick and dirty example
of such an IndexReader.  After a little trial-and-error I got it
working at least for TermQuery and WildcardQuery.  I've pasted my
code below as an example, but there is much room for improvement,
especially in terms of performance and also in keeping track of term
frequency, and also it would be nicer if it handled the analysis
internally.
I think something like this would make a handy addition to our
contrib area at least.  I'd be happy to receive improvements to this
and then add it to a contrib subproject.
Perhaps this would be a handy way to handle situations where users
have queries saved in a system and need to be alerted whenever a new
document arrives matching the saved queries?
Erik


-Original Message-
From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 14, 2005 4:04 PM
To: java-dev@lucene.apache.org
Subject: Re: [Performance] Streaming main memory indexing of single
strings
This seems to be a promising avenue worth exploring. My gutfeeling
is that this could easily be 10-100 times faster.
The drawback is that it requires a fair amount of understanding of
intricate Lucene internals, pulling those pieces together and
adapting them as required for the seemingly simple float
match(String text, Query query).
I might give it a shot but I'm not sure I'll be able to pull this
off!
Is there any similar code I could look at as a starting point?
Wolfgang.
On Apr 14, 2005, at 1:13 PM, Robert Engels wrote:
I think you are not approaching this the correct way.
Pseudo code:
Subclass IndexReader.
Get tokens from String 'document' using Lucene analyzers.
Build simple hash-map based data structures using tokens for terms,

and term positions.
reimplement termDocs() and termPositions() to use the structures
from above.
run searches.
start again with next document.

-Original Message-
From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 14, 2005 2:56 PM
To: java-dev@lucene.apache.org
Subject: Re: [Performance] Streaming main memory indexing of single

strings
Otis, this might be a misunderstanding.
- I'm not calling optimize(). That piece is commented out you if
look again at the code.
- The *streaming* use case requires that for each query I add one

  1   2   >