from:"Wolfgang Hoschek"


[ 
https://issues.apache.org/jira/browse/SOLR-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047223#comment-14047223
 ] 

wolfgang hoschek commented on SOLR-6212:


This is already fixed in the latest stable morphline release per 
http://kitesdk.org/docs/current/release_notes.html

 upgrade Saxon-HE to 9.5.1-5 and reinstate Morphline tests that were affected 
 under java 8/9 with 9.5.1-4
 

 Key: SOLR-6212
 URL: https://issues.apache.org/jira/browse/SOLR-6212
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.7, 5.0
Reporter: Michael Dodsworth
Assignee: Mark Miller
Priority: Minor

 From SOLR-1301:
 For posterity, there is a thread on the dev list where we are working 
 through an issue with Saxon on java 8 and ibm's j9. Wolfgang filed 
 https://saxonica.plan.io/issues/1944 upstream. (Saxon is pulled in via 
 cdk-morphlines-saxon).
 Due to this issue, several Morphline tests were made to be 'ignored' in java 
 8+. The Saxon issue has been fixed in 9.5.1-5, so we should upgrade and 
 reinstate those tests.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5109) Solr 4.4 will not deploy in Glassfish 4.x


[ 
https://issues.apache.org/jira/browse/SOLR-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047391#comment-14047391
 ] 

wolfgang hoschek commented on SOLR-5109:


FWIW, morphlines currently won't work with guava-16 or guava-17 because of the 
incompatible guava API changes in the guava Closeables class in those two guava 
releases. However, there's a fix for this issue that will show up soon in 
kite-morphlines 0.15.0. See 
https://github.com/kite-sdk/kite/commit/0ab2795872e4e5721f477d79e5049371a17ab8db

 Solr 4.4 will not deploy in Glassfish 4.x
 -

 Key: SOLR-5109
 URL: https://issues.apache.org/jira/browse/SOLR-5109
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.4
 Environment: Glassfish 4.x
Reporter: jamon camisso
Priority: Blocker
  Labels: guava
 Attachments: LUCENE-5109.patch, guava-15.0-SNAPSHOT.jar


 The bundled Guava 14.0.1 JAR blocks deploying Solr 4.4 in Glassfish 4.x.
 This failure is a known issue with upstream Guava and is described here:
 https://code.google.com/p/guava-libraries/issues/detail?id=1433
 Building Guava guava-15.0-SNAPSHOT.jar from master and bundling it in Solr 
 allows for a successful deployment.
 Until the Guava developers release version 15 using their HEAD or even an RC 
 tag seems like the only way to resolve this.
 This is frustrating since it was proposed that Guava be removed as a 
 dependency before Solr 4.0 was released and yet it remains and blocks 
 upgrading: https://issues.apache.org/jira/browse/SOLR-3601



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-5109) Solr 4.4 will not deploy in Glassfish 4.x


[ 
https://issues.apache.org/jira/browse/SOLR-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047394#comment-14047394
 ] 

wolfgang hoschek edited comment on SOLR-5109 at 6/30/14 5:36 AM:
-

Another potential issue is that hadoop ships with guava-11.0.2 on the classpath 
of the task tracker (the JVM that runs the job). So this old guava version will 
race with any other guava version that happens to be on the classpath.


was (Author: whoschek):
Another potential issue is that hadoop ships with guava-12.0.1 on the classpath 
of the task tracker (the JVM that runs the job). So this old guava version will 
race with any other guava version that happens to be on the classpath.

 Solr 4.4 will not deploy in Glassfish 4.x
 -

 Key: SOLR-5109
 URL: https://issues.apache.org/jira/browse/SOLR-5109
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.4
 Environment: Glassfish 4.x
Reporter: jamon camisso
Priority: Blocker
  Labels: guava
 Attachments: LUCENE-5109.patch, guava-15.0-SNAPSHOT.jar


 The bundled Guava 14.0.1 JAR blocks deploying Solr 4.4 in Glassfish 4.x.
 This failure is a known issue with upstream Guava and is described here:
 https://code.google.com/p/guava-libraries/issues/detail?id=1433
 Building Guava guava-15.0-SNAPSHOT.jar from master and bundling it in Solr 
 allows for a successful deployment.
 Until the Guava developers release version 15 using their HEAD or even an RC 
 tag seems like the only way to resolve this.
 This is frustrating since it was proposed that Guava be removed as a 
 dependency before Solr 4.0 was released and yet it remains and blocks 
 upgrading: https://issues.apache.org/jira/browse/SOLR-3601



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5109) Solr 4.4 will not deploy in Glassfish 4.x


[ 
https://issues.apache.org/jira/browse/SOLR-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047394#comment-14047394
 ] 

wolfgang hoschek commented on SOLR-5109:


Another potential issue is that hadoop ships with guava-12.0.1 on the classpath 
of the task tracker (the JVM that runs the job). So this old guava version will 
race with any other guava version that happens to be on the classpath.

 Solr 4.4 will not deploy in Glassfish 4.x
 -

 Key: SOLR-5109
 URL: https://issues.apache.org/jira/browse/SOLR-5109
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.4
 Environment: Glassfish 4.x
Reporter: jamon camisso
Priority: Blocker
  Labels: guava
 Attachments: LUCENE-5109.patch, guava-15.0-SNAPSHOT.jar


 The bundled Guava 14.0.1 JAR blocks deploying Solr 4.4 in Glassfish 4.x.
 This failure is a known issue with upstream Guava and is described here:
 https://code.google.com/p/guava-libraries/issues/detail?id=1433
 Building Guava guava-15.0-SNAPSHOT.jar from master and bundling it in Solr 
 allows for a successful deployment.
 Until the Guava developers release version 15 using their HEAD or even an RC 
 tag seems like the only way to resolve this.
 This is frustrating since it was proposed that Guava be removed as a 
 dependency before Solr 4.0 was released and yet it remains and blocks 
 upgrading: https://issues.apache.org/jira/browse/SOLR-3601



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Adding Morphline support to DIH - worth the effort?

2014-06-11 Thread Wolfgang Hoschek

From our perspective we don’t really see use cases for DIH anymore.

Morphlines was developed primarily with Lucene in mind (even though it doesn’t 
require Lucene).

Flume Morphline Solr Sink handles streaming ingestion into Solr in reliable, 
scalable, flexible and loosely coupled ways, in separate processes. Neither 
Flume nor Morphlines requires Hadoop.

MapReduceIndexerTool uses Morphlines for reliable, scalable and flexible batch 
ingestion on Hadoop.

On Hadoop, even the JDBC/SQL portion of DIH now seems mostly covered by a 
combination of Sqoop and MapReduceIndexerTool, and perhaps a bit of Hive.

I’m not sure what the use cases for DIH still are these days.

(I wrote most of the Morphlines framework, Flume Morphline Solr Sink, 
MapReduceIndexerTool and the hbase-indexer-morphline integration.)

Just my 0.02c,
Wolfgang.

On Jun 11, 2014, at 1:05 PM, Dyer, James james.d...@ingramcontent.com wrote:

 Mikhail,
  
 It would be nice if the DIH could be run separately from Solr (SOLR-853 and 
 others).  I think a lot of us have already expressed support for this, and at 
 one time I was looking into what it would take to complete.  Then again, 
 having watched the solr morphline sink be created for Flume, I realized there 
 are other teams out there possibly building an awesome DIH killer.  If that 
 happens, then we just saved ourselves a boatload of work, right?  I think if 
 someone out there can create a nice POC that uses a different tool, that 
 would be a great first step.
  
 But there is also SOLR-3671 which was just committed as a follow-on to 
 SOLR-2382.  This makes DIH able to send documents to places other than Solr.  
 Turns out someone here is using DIH to import to Mongo.  (See SOLR-5981 for 
 details).  So we already have one side of the functionality to generalize DIH.
  
 James Dyer
 Ingram Content Group
 (615) 213-4311
  
 From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] 
 Sent: Wednesday, June 11, 2014 11:56 AM
 To: dev@lucene.apache.org
 Subject: Re: Adding Morphline support to DIH - worth the effort?
  
 James,
 Don't you think that the spawning DIH2.0 as separate war is a priority?
  
 
 On Wed, Jun 11, 2014 at 6:39 PM, Dyer, James james.d...@ingramcontent.com 
 wrote:
 Alexandre,
 
 I think that writing a new entity processor for DIH is a much less risky 
 thing to commit than, say, SOLR-4799.  Entity Processors work as plug-ins and 
 they aren't likely to break anything else.  So a Morphline EntityProcessor is 
 much more likely to be evaluated and committed.
 
 But like anything else, you're going to need to explain what the need is and 
 what this new e.p. buys the user community.   There needs to be unit tests, 
 etc.
 
 Besides this, if you can show how a morphline e.p. can be a step towards 
 migrating away from DIH entirely, then that would be a plus.  Perhaps create 
 a new solr example along the lines of the dih solr example that demonstrates 
 to users this new way forward.  This would go a long way in convincing the 
 community we have a viable alternative to dih.
 
 James Dyer
 Ingram Content Group
 (615) 213-4311
 
 
 -Original Message-
 From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
 Sent: Tuesday, June 10, 2014 9:55 PM
 To: dev@lucene.apache.org
 Subject: Re: Adding Morphline support to DIH - worth the effort?
 
 Ripples in the pond again. Spreading and dying. Understandable, but
 still somewhat annoying.
 
 So, what would be the minimal viable next step to move this
 conversation forward? Something for 4.11 as opposed to 5.0?
 
 Anyone with commit status has a feeling of what - minimal -
 deliverable they would put their own weight behind?
 
 Regards,
Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr 
 proficiency
 
 
 On Mon, Jun 9, 2014 at 10:50 AM, david.w.smi...@gmail.com
 david.w.smi...@gmail.com wrote:
  One of the ideas over DIH discussed earlier is making it standalone.
 
  Yeah; my beef with the DIH is that it’s tied to Solr.  But I’d rather see
  something other than the DIH outside Solr; it’s not worthy IMO.  Why have
  something Solr specific even?  A great pipeline shouldn’t tie itself to any
  end-point.  There are a variety of solutions out there that I tried.  There
  are the big 3 open-source ETLs: Kettle, Clover, Talend) and they aren’t
  quite ideal in one way or another.  And Spring-Integration.  And some
  half-baked data pipelines like OpenPipe  Open Pipeline.  I never got around
  to taking a good look at Findwise’s open-sourced Hydra but I learned enough
  to know to my surprise it was configured in code versus a config file (like
  all the others) and that's a big turn-off to me.  Today I read through most
  of the Morphlines docs and a few choice source files and I’m
  super-impressed.  But as you note it’s missing a lot of other stuff.  I
  think something great could be built using it as a core piece.
 
  ~ David Smiley

[jira] [Commented] (SOLR-6126) MapReduce's GoLive script should support replicas

2014-06-02 Thread wolfgang hoschek (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-6126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015266#comment-14015266
]

wolfgang hoschek commented on SOLR-6126:

[~dsmiley] It uses the --zk-host CLI options to fetch the solr URLs of each
replica from zk - see extractShardUrls(). This info gets passed via the
Options.shardUrls parameter into the go-live phase. In the go-live phase the
segments of each shard are explicitly merged via a separate REST merge request
per replica into the corresponding replica. The result is that each input
segment is explicitly merged N times where N is the replication factor. Each
such merge reads from HDFS and writes to HDFS.

(BTW, I'll be unreachable on an transatlantic flight very soon)

MapReduce's GoLive script should support replicas
-

Key: SOLR-6126
URL: https://issues.apache.org/jira/browse/SOLR-6126
Project: Solr
Issue Type: Improvement
Components: contrib - MapReduce
Reporter: David Smiley

The GoLive feature of the MapReduce contrib module is pretty cool. But a
comment in there indicates that it doesn't support replicas. Every
production SolrCloud setup I've seen has had replicas!
I wonder what is needed to support this. For GoLive to work, it assumes a
shared file system (be it HDFS or whatever, like a SAN). If perhaps the
replicas in such a system read from the very same network disk location, then
all we'd need to do is send a commit() to replicas; right?

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-6126) MapReduce's GoLive script should support replicas

2014-06-01 Thread wolfgang hoschek (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015092#comment-14015092
 ] 

wolfgang hoschek commented on SOLR-6126:


The comment in the code is a bit outdated. The code does actually support 
replicas.

 MapReduce's GoLive script should support replicas
 -

 Key: SOLR-6126
 URL: https://issues.apache.org/jira/browse/SOLR-6126
 Project: Solr
  Issue Type: Improvement
  Components: contrib - MapReduce
Reporter: David Smiley

 The GoLive feature of the MapReduce contrib module is pretty cool.  But a 
 comment in there indicates that it doesn't support replicas.  Every 
 production SolrCloud setup I've seen has had replicas!
 I wonder what is needed to support this.  For GoLive to work, it assumes a 
 shared file system (be it HDFS or whatever, like a SAN).  If perhaps the 
 replicas in such a system read from the very same network disk location, then 
 all we'd need to do is send a commit() to replicas; right?  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5848) Morphlines is not resolving

2014-03-12 Thread wolfgang hoschek (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13932328#comment-13932328
 ] 

wolfgang hoschek commented on SOLR-5848:


Going forward I'd recommend upgrading to version 0.12.0 rather than dealing 
with 0.11.0 because 0.12.0 is compatible and there are some nice performance 
improvements and a couple of new features - 
http://kitesdk.org/docs/current/release_notes.html

 Morphlines is not resolving
 ---

 Key: SOLR-5848
 URL: https://issues.apache.org/jira/browse/SOLR-5848
 Project: Solr
  Issue Type: Bug
Reporter: Dawid Weiss
Assignee: Mark Miller
Priority: Critical
 Fix For: 4.8, 5.0


 This version of morphlines does not resolve for me and Grant.
 {code}
 ::
 ::  UNRESOLVED DEPENDENCIES ::
 ::
 :: org.kitesdk#kite-morphlines-saxon;0.11.0: not found
 :: org.kitesdk#kite-morphlines-hadoop-sequencefile;0.11.0: not found
 {code}
 Has this been deleted from Cloudera's repositories or something? This would 
 be pretty bad -- maven repos should be immutable...



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5848) Morphlines is not resolving

2014-03-12 Thread wolfgang hoschek (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13932378#comment-13932378
 ] 

wolfgang hoschek commented on SOLR-5848:


Sounds good. Thx!

 Morphlines is not resolving
 ---

 Key: SOLR-5848
 URL: https://issues.apache.org/jira/browse/SOLR-5848
 Project: Solr
  Issue Type: Bug
Reporter: Dawid Weiss
Assignee: Mark Miller
Priority: Critical
 Fix For: 4.8, 5.0


 This version of morphlines does not resolve for me and Grant.
 {code}
 ::
 ::  UNRESOLVED DEPENDENCIES ::
 ::
 :: org.kitesdk#kite-morphlines-saxon;0.11.0: not found
 :: org.kitesdk#kite-morphlines-hadoop-sequencefile;0.11.0: not found
 {code}
 Has this been deleted from Cloudera's repositories or something? This would 
 be pretty bad -- maven repos should be immutable...



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-5786) MapReduceIndexerTool --help text is missing large parts of the help text

wolfgang hoschek created SOLR-5786:
--

 Summary: MapReduceIndexerTool --help text is missing large parts 
of the help text
 Key: SOLR-5786
 URL: https://issues.apache.org/jira/browse/SOLR-5786
 Project: Solr
  Issue Type: Bug
  Components: contrib - MapReduce
Affects Versions: 4.7
Reporter: wolfgang hoschek
Assignee: Mark Miller
 Fix For: 4.8


As already mentioned repeatedly and at length, this is a regression introduced 
by the fix in https://issues.apache.org/jira/browse/SOLR-5605

Here is the diff of --help output before SOLR-5605 vs after SOLR-5605:

{code}
130,235c130
  lucene  segments  left  in   this  index.  Merging
  segments involves reading  and  rewriting all data
  in all these  segment  files, potentially multiple
  times,  which  is  very  I/O  intensive  and  time
  consuming. However, an  index  with fewer segments
  can later be merged  faster,  and  it can later be
  queried  faster  once  deployed  to  a  live  Solr
  serving shard. Set  maxSegments  to  1 to optimize
  the index for low query  latency. In a nutshell, a
  small maxSegments  value  trades  indexing latency
  for subsequently improved query  latency. This can
  be  a  reasonable  trade-off  for  batch  indexing
  systems. (default: 1)
   --fair-scheduler-pool STRING
  Optional tuning knob  that  indicates  the name of
  the fair scheduler  pool  to  submit  jobs to. The
  Fair Scheduler is a  pluggable MapReduce scheduler
  that provides a way to  share large clusters. Fair
  scheduling is a method  of  assigning resources to
  jobs such that all jobs  get, on average, an equal
  share of resources  over  time.  When  there  is a
  single job  running,  that  job  uses  the  entire
  cluster. When  other  jobs  are  submitted,  tasks
  slots that free up are  assigned  to the new jobs,
  so that each job gets  roughly  the same amount of
  CPU time.  Unlike  the  default  Hadoop scheduler,
  which forms a queue of  jobs, this lets short jobs
  finish in reasonable time  while not starving long
  jobs. It is also an  easy  way  to share a cluster
  between multiple of users.  Fair  sharing can also
  work with  job  priorities  -  the  priorities are
  used as  weights  to  determine  the  fraction  of
  total compute time that each job gets.
   --dry-run  Run in local mode  and  print  documents to stdout
  instead of loading them  into  Solr. This executes
  the  morphline  in  the  client  process  (without
  submitting a job  to  MR)  for  quicker turnaround
  during early  trialdebug  sessions. (default:
  false)
   --log4j FILE   Relative or absolute  path  to  a log4j.properties
  config file on the  local  file  system. This file
  will  be  uploaded  to   each  MR  task.  Example:
  /path/to/log4j.properties
   --verbose, -v  Turn on verbose output. (default: false)
   --show-non-solr-cloud  Also show options for  Non-SolrCloud  mode as part
  of --help. (default: false)
 
 Required arguments:
   --output-dir HDFS_URI  HDFS directory to  write  Solr  indexes to. Inside
  there one  output  directory  per  shard  will  be
  generated.Example: hdfs://c2202.mycompany.
  com/user/$USER/test
   --morphline-file FILE  Relative or absolute path  to  a local config file
  that contains one  or  more  morphlines.  The file
  must be  UTF-8  encoded.  Example:
  /path/to/morphline.conf
 
 Cluster arguments:
   Arguments that provide information about your Solr cluster. 
 
   --zk-host STRING   The address of a ZooKeeper  ensemble being used by
  a SolrCloud cluster. This  ZooKeeper ensemble will
  be examined  to  determine  the  number  of output

[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest


[ 
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13914549#comment-13914549
 ] 

wolfgang hoschek commented on SOLR-5605:


Correspondingly, I filed https://issues.apache.org/jira/browse/SOLR-5786

Look, as you know, I wrote almost all of the original solr-mapreduce contrib, 
and I know this code inside out. To be honest, this kind of repetitive 
ignorance is tiresome at best and completely turns me off.

 MapReduceIndexerTool fails in some locales -- seen in random failures of 
 MapReduceIndexerToolArgumentParserTest
 ---

 Key: SOLR-5605
 URL: https://issues.apache.org/jira/browse/SOLR-5605
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 4.7, 5.0


 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest 
 which is reproducible with any seed -- all that matters is the locale.
 The problem sounded familiar, and a quick search verified that jenkins has in 
 fact hit this a couple of times in the past -- Uwe commented on the list that 
 this is due to a real problem in one of the third-party dependencies (that 
 does the argument parsing) that will affect usage on some systems.
 If working around the bug in the arg parsing lib isn't feasible, 
 MapReduceIndexerTool should fail cleanly if the locale isn't one we know is 
 supported



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-5786) MapReduceIndexerTool --help output is missing large parts of the help text

[
https://issues.apache.org/jira/browse/SOLR-5786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

wolfgang hoschek updated SOLR-5786:
---

Summary: MapReduceIndexerTool --help output is missing large parts of the
help text (was: MapReduceIndexerTool --help text is missing large parts of the
help text)

MapReduceIndexerTool --help output is missing large parts of the help text
--

Key: SOLR-5786
URL: https://issues.apache.org/jira/browse/SOLR-5786
Project: Solr
Issue Type: Bug
Components: contrib - MapReduce
Affects Versions: 4.7
Reporter: wolfgang hoschek
Assignee: Mark Miller
Fix For: 4.8

As already mentioned repeatedly and at length, this is a regression
introduced by the fix in https://issues.apache.org/jira/browse/SOLR-5605
Here is the diff of --help output before SOLR-5605 vs after SOLR-5605:
{code}
130,235c130
lucene segments left in this index. Merging
segments involves reading and rewriting all data
in all these segment files, potentially multiple
times, which is very I/O intensive and time
consuming. However, an index with fewer segments
can later be merged faster, and it can later be
queried faster once deployed to a live Solr
serving shard. Set maxSegments to 1 to optimize
the index for low query latency. In a nutshell, a
small maxSegments value trades indexing latency
for subsequently improved query latency. This can
be a reasonable trade-off for batch indexing
systems. (default: 1)
--fair-scheduler-pool STRING
Optional tuning knob that indicates the name of
the fair scheduler pool to submit jobs to. The
Fair Scheduler is a pluggable MapReduce scheduler
that provides a way to share large clusters. Fair
scheduling is a method of assigning resources to
jobs such that all jobs get, on average, an equal
share of resources over time. When there is a
single job running, that job uses the entire
cluster. When other jobs are submitted, tasks
slots that free up are assigned to the new jobs,
so that each job gets roughly the same amount of
CPU time. Unlike the default Hadoop scheduler,
which forms a queue of jobs, this lets short jobs
finish in reasonable time while not starving long
jobs. It is also an easy way to share a cluster
between multiple of users. Fair sharing can also
work with job priorities - the priorities are
used as weights to determine the fraction of
total compute time that each job gets.
--dry-run Run in local mode and print documents to stdout
instead of loading them into Solr. This executes
the morphline in the client process (without
submitting a job to MR) for quicker turnaround
during early trialdebug sessions. (default:
false)
--log4j FILE Relative or absolute path to a log4j.properties
config file on the local file system. This file
will be uploaded to each MR task. Example:
/path/to/log4j.properties
--verbose, -v Turn on verbose output. (default: false)
--show-non-solr-cloud Also show options for Non-SolrCloud mode as part
of --help. (default: false)

Required arguments:
--output-dir HDFS_URI HDFS directory to write Solr indexes to. Inside
there one output directory per shard will be
generated.Example: hdfs://c2202.mycompany.
com/user/$USER/test
--morphline-file FILE Relative or absolute path to a local config file
that contains one or more morphlines. The file
must be UTF-8

[jira] [Updated] (SOLR-5786) MapReduceIndexerTool --help output is missing large parts of the help text

[
https://issues.apache.org/jira/browse/SOLR-5786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

wolfgang hoschek updated SOLR-5786:
---

Description:
As already mentioned repeatedly and at length, this is a regression introduced
by the fix in https://issues.apache.org/jira/browse/SOLR-5605

Here is the diff of --help output before SOLR-5605 vs after SOLR-5605:

{code}
130,235c130
lucene segments left in this index. Merging
segments involves reading and rewriting all data
in all these segment files, potentially multiple
times, which is very I/O intensive and time
consuming. However, an index with fewer segments
can later be merged faster, and it can later be
queried faster once deployed to a live Solr
serving shard. Set maxSegments to 1 to optimize
the index for low query latency. In a nutshell, a
small maxSegments value trades indexing latency
for subsequently improved query latency. This can
be a reasonable trade-off for batch indexing
systems. (default: 1)
--fair-scheduler-pool STRING
Optional tuning knob that indicates the name of
the fair scheduler pool to submit jobs to. The
Fair Scheduler is a pluggable MapReduce scheduler
that provides a way to share large clusters. Fair
scheduling is a method of assigning resources to
jobs such that all jobs get, on average, an equal
share of resources over time. When there is a
single job running, that job uses the entire
cluster. When other jobs are submitted, tasks
slots that free up are assigned to the new jobs,
so that each job gets roughly the same amount of
CPU time. Unlike the default Hadoop scheduler,
which forms a queue of jobs, this lets short jobs
finish in reasonable time while not starving long
jobs. It is also an easy way to share a cluster
between multiple of users. Fair sharing can also
work with job priorities - the priorities are
used as weights to determine the fraction of
total compute time that each job gets.
--dry-run Run in local mode and print documents to stdout
instead of loading them into Solr. This executes
the morphline in the client process (without
submitting a job to MR) for quicker turnaround
during early trialdebug sessions. (default:
false)
--log4j FILE Relative or absolute path to a log4j.properties
config file on the local file system. This file
will be uploaded to each MR task. Example:
/path/to/log4j.properties
--verbose, -v Turn on verbose output. (default: false)
--show-non-solr-cloud Also show options for Non-SolrCloud mode as part
of --help. (default: false)

Cluster arguments:
Arguments that provide information about your Solr cluster.

--zk-host STRING The address of a ZooKeeper ensemble being used by
a SolrCloud cluster. This ZooKeeper ensemble will
be examined to determine the number of output
shards to create as well as the Solr URLs to
merge the output shards into when using the --go-
live option. Requires that you also pass the --
collection to merge the shards

[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

[
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915037#comment-13915037
]

wolfgang hoschek commented on SOLR-5605:

bq. Are you not a committer? At Apache, those who do decide.

Yes, but you've clearly been assigned to upstream this stuff and I have plenty
of other things to attend to these days.

bq. I did not realize Patricks patch did not include the latest code updates
from MapReduce.

Might be good to pay more attention, also to CDH-14804?

bq. I had and still have bigger concerns around the usability of this code in
Solr than this issue. It is very, very far from easy for someone to get started
with this contrib right now.

The usability is fine downstream where maven automatically builds a job jar
that includes the necessary dependency jars inside of the lib dir of the MR job
jar. Hence no startup script or extra steps are required downstream, just one
(fat) jar. If it's not usable upstream it may be because no corresponding
packaging system has been used upstream, for reasons that escape me.

bq. which is why non of these smaller issues concern me very much at this point.

I'm afraid ignorance never helps.

MapReduceIndexerTool fails in some locales -- seen in random failures of
MapReduceIndexerToolArgumentParserTest
---

Key: SOLR-5605
URL: https://issues.apache.org/jira/browse/SOLR-5605
Project: Solr
Issue Type: Bug
Reporter: Hoss Man
Assignee: Mark Miller
Fix For: 4.7, 5.0

I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest
which is reproducible with any seed -- all that matters is the locale.
The problem sounded familiar, and a quick search verified that jenkins has in
fact hit this a couple of times in the past -- Uwe commented on the list that
this is due to a real problem in one of the third-party dependencies (that
does the argument parsing) that will affect usage on some systems.
If working around the bug in the arg parsing lib isn't feasible,
MapReduceIndexerTool should fail cleanly if the locale isn't one we know is
supported

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

[
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915037#comment-13915037
]

wolfgang hoschek edited comment on SOLR-5605 at 2/27/14 9:23 PM:
-

bq. Are you not a committer? At Apache, those who do decide.

Yes, but you've clearly been assigned to upstream those contribs and I have
plenty of other things to attend to these days.

bq. I did not realize Patricks patch did not include the latest code updates
from MapReduce.

Might be good to pay more attention, also to CDH-14804?

bq. I had and still have bigger concerns around the usability of this code in
Solr than this issue. It is very, very far from easy for someone to get started
with this contrib right now.

bq. which is why non of these smaller issues concern me very much at this point.

I'm afraid ignorance never helps.

was (Author: whoschek):
bq. Are you not a committer? At Apache, those who do decide.

Yes, but you've clearly been assigned to upstream this stuff and I have plenty
of other things to attend to these days.

bq. I did not realize Patricks patch did not include the latest code updates
from MapReduce.

Might be good to pay more attention, also to CDH-14804?

bq. I had and still have bigger concerns around the usability of this code in
Solr than this issue. It is very, very far from easy for someone to get started
with this contrib right now.

bq. which is why non of these smaller issues concern me very much at this point.

I'm afraid ignorance never helps.

MapReduceIndexerTool fails in some locales -- seen in random failures of
MapReduceIndexerToolArgumentParserTest
---

Key: SOLR-5605
URL: https://issues.apache.org/jira/browse/SOLR-5605
Project: Solr
Issue Type: Bug
Reporter: Hoss Man
Assignee: Mark Miller
Fix For: 4.7, 5.0

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

2014-02-25 Thread wolfgang hoschek (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911744#comment-13911744
]

wolfgang hoschek commented on SOLR-5605:

I have looked, have you? I have fixed this one before. Have you?

Pls take the time to diff before vs. after to see that some docs parts are
missing while other's are present (b/c of the funny missing buffer flush). It
is not the same. This is a regression. Thx.

MapReduceIndexerTool fails in some locales -- seen in random failures of
MapReduceIndexerToolArgumentParserTest
---

Key: SOLR-5605
URL: https://issues.apache.org/jira/browse/SOLR-5605
Project: Solr
Issue Type: Bug
Reporter: Hoss Man
Assignee: Mark Miller
Fix For: 4.7, 5.0

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Reopened] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

2014-02-19 Thread wolfgang hoschek (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wolfgang hoschek reopened SOLR-5605:



Without this the --help text is screwed. 
https://issues.apache.org/jira/secure/EditComment!default.jspa?id=12687301commentId=13862272

 MapReduceIndexerTool fails in some locales -- seen in random failures of 
 MapReduceIndexerToolArgumentParserTest
 ---

 Key: SOLR-5605
 URL: https://issues.apache.org/jira/browse/SOLR-5605
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 4.7, 5.0


 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest 
 which is reproducible with any seed -- all that matters is the locale.
 The problem sounded familiar, and a quick search verified that jenkins has in 
 fact hit this a couple of times in the past -- Uwe commented on the list that 
 this is due to a real problem in one of the third-party dependencies (that 
 does the argument parsing) that will affect usage on some systems.
 If working around the bug in the arg parsing lib isn't feasible, 
 MapReduceIndexerTool should fail cleanly if the locale isn't one we know is 
 supported



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

2014-02-19 Thread wolfgang hoschek (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905806#comment-13905806
 ] 

wolfgang hoschek commented on SOLR-5605:


Yes, as already mentioned, otherwise some of the --help text doesn't show up in 
the output because there's a change related to buffer flushing in 
argparse4j-0.4.2.

 MapReduceIndexerTool fails in some locales -- seen in random failures of 
 MapReduceIndexerToolArgumentParserTest
 ---

 Key: SOLR-5605
 URL: https://issues.apache.org/jira/browse/SOLR-5605
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man
Assignee: Mark Miller
 Fix For: 4.7, 5.0


 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest 
 which is reproducible with any seed -- all that matters is the locale.
 The problem sounded familiar, and a quick search verified that jenkins has in 
 fact hit this a couple of times in the past -- Uwe commented on the list that 
 this is due to a real problem in one of the third-party dependencies (that 
 does the argument parsing) that will affect usage on some systems.
 If working around the bug in the arg parsing lib isn't feasible, 
 MapReduceIndexerTool should fail cleanly if the locale isn't one we know is 
 supported



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Welcome Benson Margulies as Lucene/Solr committer!

2014-01-28 Thread Wolfgang Hoschek

Welcome on board!

Wolfgang.

On Jan 26, 2014, at 4:32 PM, Erick Erickson wrote:

 Good to have you aboard!
 
 Erick
 
 On Sat, Jan 25, 2014 at 10:52 PM, Mark Miller markrmil...@gmail.com wrote:
 Welcome!
 
 - Mark
 
 http://about.me/markrmiller
 
 On Jan 25, 2014, at 4:40 PM, Michael McCandless luc...@mikemccandless.com 
 wrote:
 
 I'm pleased to announce that Benson Margulies has accepted to join our
 ranks as a committer.
 
 Benson has been involved in a number of Lucene/Solr issues over time
 (see 
 http://jirasearch.mikemccandless.com/search.py?index=jirachg=ddsa1=allUsersa2=Benson+Margulies
 ), most recently on debugging tricky analysis issues.
 
 Benson, it is tradition that you introduce yourself with a brief bio.
 I know you're heavily involved in other Apache projects already...
 
 Once your account is set up, you should then be able to add yourself
 to the who we are page on the website as well.
 
 Congratulations and welcome!
 
 Mike McCandless
 
 http://blog.mikemccandless.com
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

2014-01-04 Thread wolfgang hoschek (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862272#comment-13862272
 ] 

wolfgang hoschek commented on SOLR-5605:


Thanks for getting to the bottom of this! 

Looks like we'll now be good on upgrade to argparse4j-0.4.3, except we'll also 
need to apply CDH-16434 to MapReduceIndexerTool.java because there's a change 
related to flushing in 0.4.2:

-parser.printHelp(new PrintWriter(System.out));  
+parser.printHelp();

Otherwise some of the --help text doesn't show up in the output :-(

 MapReduceIndexerTool fails in some locales -- seen in random failures of 
 MapReduceIndexerToolArgumentParserTest
 ---

 Key: SOLR-5605
 URL: https://issues.apache.org/jira/browse/SOLR-5605
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man

 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest 
 which is reproducible with any seed -- all that matters is the locale.
 The problem sounded familiar, and a quick search verified that jenkins has in 
 fact hit this a couple of times in the past -- Uwe commented on the list that 
 this is due to a real problem in one of the third-party dependencies (that 
 does the argument parsing) that will affect usage on some systems.
 If working around the bug in the arg parsing lib isn't feasible, 
 MapReduceIndexerTool should fail cleanly if the locale isn't one we know is 
 supported



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest

2014-01-04 Thread wolfgang hoschek (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862272#comment-13862272
 ] 

wolfgang hoschek edited comment on SOLR-5605 at 1/4/14 11:42 AM:
-

Thanks for getting to the bottom of this! 

Looks like we'll now be good on upgrade to argparse4j-0.4.3, except we'll also 
need to apply CDH-16434 to MapReduceIndexerTool.java because there's a change 
related to flushing in 0.4.2:

{code}
-parser.printHelp(new PrintWriter(System.out));  
+parser.printHelp();
{code}

Otherwise some of the --help text doesn't show up in the output :-(


was (Author: whoschek):
Thanks for getting to the bottom of this! 

Looks like we'll now be good on upgrade to argparse4j-0.4.3, except we'll also 
need to apply CDH-16434 to MapReduceIndexerTool.java because there's a change 
related to flushing in 0.4.2:

-parser.printHelp(new PrintWriter(System.out));  
+parser.printHelp();

Otherwise some of the --help text doesn't show up in the output :-(

 MapReduceIndexerTool fails in some locales -- seen in random failures of 
 MapReduceIndexerToolArgumentParserTest
 ---

 Key: SOLR-5605
 URL: https://issues.apache.org/jira/browse/SOLR-5605
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man

 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest 
 which is reproducible with any seed -- all that matters is the locale.
 The problem sounded familiar, and a quick search verified that jenkins has in 
 fact hit this a couple of times in the past -- Uwe commented on the list that 
 this is due to a real problem in one of the third-party dependencies (that 
 does the argument parsing) that will affect usage on some systems.
 If working around the bug in the arg parsing lib isn't feasible, 
 MapReduceIndexerTool should fail cleanly if the locale isn't one we know is 
 supported



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5584) Update to Guava 15.0

2014-01-04 Thread wolfgang hoschek (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862273#comment-13862273
 ] 

wolfgang hoschek commented on SOLR-5584:


As mentioned above, morphlines was designed to run fine with any guava version 
= 11.0.2. 

But the hadoop task tracker always puts guava 11.0.2 on the classpath of any MR 
job that it executes, so solr-mapreduce would need to figure out how to 
override or reorder that.

 Update to Guava 15.0
 

 Key: SOLR-5584
 URL: https://issues.apache.org/jira/browse/SOLR-5584
 Project: Solr
  Issue Type: Improvement
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 5.0, 4.7






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: The Old Git Discussion

2014-01-02 Thread Wolfgang Hoschek

+1

On Jan 2, 2014, at 10:53 PM, Simon Willnauer wrote:

 +1
 
 On Thu, Jan 2, 2014 at 9:51 PM, Mark Miller markrmil...@gmail.com wrote:
 bzr is dying; Emacs needs to move
 
 
 http://lists.gnu.org/archive/html/emacs-devel/2014-01/msg5.html
 
 Interesting thread.
 
 For similar reasons, I think that Lucene and Solr should eventually move to
 Git. It's not GitHub, but it's a lot closer. The new Apache projects I see
 are all choosing Git. It's the winners road I think. I don't know that there
 is a big hurry right now, but I think it's inevitable that we should switch.
 
 --
 - Mark
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-5584) Update to Guava 15.0

2013-12-30 Thread wolfgang hoschek (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13858699#comment-13858699
 ] 

wolfgang hoschek commented on SOLR-5584:


What exactly is failing for you? morphlines was designed to run fine with any 
guava version = 11.0.2. At least it did last I checked...

 Update to Guava 15.0
 

 Key: SOLR-5584
 URL: https://issues.apache.org/jira/browse/SOLR-5584
 Project: Solr
  Issue Type: Improvement
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 5.0, 4.7






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-25 Thread wolfgang hoschek (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856657#comment-13856657
]

wolfgang hoschek commented on SOLR-1301:

Also see https://issues.cloudera.org/browse/CDK-262

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 5.0, 4.7

Attachments: README.txt, SOLR-1301-hadoop-0-20.patch,
SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar,
commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar,
hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch,
log4j-1.2.15.jar

This patch contains a contrib module that provides distributed indexing
(using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is
twofold:
* provide an API that is familiar to Hadoop developers, i.e. that of
OutputFormat
* avoid unnecessary export and (de)serialization of data maintained on HDFS.
SolrOutputFormat consumes data produced by reduce tasks directly, without
storing it in intermediate files. Furthermore, by using an
EmbeddedSolrServer, the indexing task is split into as many parts as there
are reducers, and the data to be indexed is not sent over the network.
Design
--
Key/value pairs produced by reduce tasks are passed to SolrOutputFormat,
which in turn uses SolrRecordWriter to write this data. SolrRecordWriter
instantiates an EmbeddedSolrServer, and it also instantiates an
implementation of SolrDocumentConverter, which is responsible for turning
Hadoop (key, value) into a SolrInputDocument. This data is then added to a
batch, which is periodically submitted to EmbeddedSolrServer. When reduce
task completes, and the OutputFormat is closed, SolrRecordWriter calls
commit() and optimize() on the EmbeddedSolrServer.
The API provides facilities to specify an arbitrary existing solr.home
directory, from which the conf/ and lib/ files will be taken.
This process results in the creation of as many partial Solr home directories
as there were reduce tasks. The output shards are placed in the output
directory on the default filesystem (e.g. HDFS). Such part-N directories
can be used to run N shard servers. Additionally, users can specify the
number of reduce tasks, in particular 1 reduce task, in which case the output
will consist of a single shard.
An example application is provided that processes large CSV files and uses
this API. It uses a custom CSV processing to avoid (de)serialization overhead.
This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this
issue, you should put it in contrib/hadoop/lib.
Note: the development of this patch was sponsored by an anonymous contributor
and approved for release under Apache License.

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-15 Thread wolfgang hoschek (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848097#comment-13848097
]

wolfgang hoschek edited comment on SOLR-1301 at 12/16/13 2:27 AM:
--

Might be best to write a program that generates the list of files and then
explicitly provide that file list to the MR job, e.g. via the --input-list
option. For example you could use the HDFS version of the Linux file system
'find' command for that (HdfsFindTool doc and code here:
https://github.com/cloudera/search/tree/master_1.1.0/search-mr#hdfsfindtool)

was (Author: whoschek):
Might be best to write a program that generates the list of files and then
explicitly provide that file list to the MR job, e.g. via the --input-list
option. For example you could use the HDFS version of the Linux file system
'find' command for that (HdfsFindTool doc and code here:
https://github.com/cloudera/search/tree/master_1.1.0/search-mr)

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 5.0, 4.7

--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-15 Thread wolfgang hoschek (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848775#comment-13848775
]

wolfgang hoschek commented on SOLR-1301:

bq. it would be convenient if we could ignore the underscore (_) hidden files
in hdfs as well as the . hidden files when reading input files from hdfs.

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 5.0, 4.7

--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-13 Thread wolfgang hoschek (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848097#comment-13848097
]

wolfgang hoschek commented on SOLR-1301:

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 5.0, 4.7

--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-09 Thread wolfgang hoschek (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843443#comment-13843443
 ] 

wolfgang hoschek commented on SOLR-1301:


I'm not aware of anything needing jersey except perhaps hadoop pulls that in.

The combined dependencies of all morphline modules is here: 
http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html

The dependencies of each individual morphline modules is here: 
http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html

The source and POMs are here, as usual: 
https://github.com/cloudera/cdk/tree/master/cdk-morphlines

By the way, a somewhat separate issue is that it seems to me that the ivy 
dependences for solr-morphlines-core and solr-morphlines-cell and 
solr-map-reduce are a bit backwards upstream in that solr-morphlines-core pulls 
in a ton of dependencies that it doesn't need, and those deps should rather be 
pulled in by the solr-map-reduce (which is a essentially an out-of-the-box 
app). Would be good to organize ivy and mvn upstream in such a way that 

* solr-map-reduce should depend on solr-morphlines-cell plus cdk-morphlines-all 
plus xyz
* solr-morphlines-cell should depend on solr-morphlines-core plus xyz
* solr-morphlines-core should depend on cdk-morphlines-core plus xyz 

More concretely, FWIW, to see how the deps look like in production releases 
downstream review the following POMs: 

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-core/pom.xml

and

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-cell/pom.xml

and

https://github.com/cloudera/search/blob/master_1.1.0/search-mr/pom.xml

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored

[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-09 Thread wolfgang hoschek (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843443#comment-13843443
 ] 

wolfgang hoschek edited comment on SOLR-1301 at 12/9/13 7:30 PM:
-

I'm not aware of anything needing jersey except perhaps hadoop pulls that in.

The combined dependencies of all morphline modules is here: 
http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html

The dependencies of each individual morphline modules is here: 
http://cloudera.github.io/cdk/docs/current/dependencies.html

The source and POMs are here, as usual: 
https://github.com/cloudera/cdk/tree/master/cdk-morphlines

By the way, a somewhat separate issue is that it seems to me that the ivy 
dependences for solr-morphlines-core and solr-morphlines-cell and 
solr-map-reduce are a bit backwards upstream in that currently 
solr-morphlines-core pulls in a ton of dependencies that it doesn't need, and 
those deps should rather be pulled in by the solr-map-reduce (which is a 
essentially an out-of-the-box app that bundles user level deps). 
Correspondingly, would be good to organize ivy and mvn upstream in such a way 
that 

* solr-map-reduce should depend on solr-morphlines-cell plus cdk-morphlines-all 
minus cdk-morphlines-solr-cell (now upstream) minus cdk-morphlines-solr-core 
(now upstream) plus xyz
* solr-morphlines-cell should depend on solr-morphlines-core plus xyz
* solr-morphlines-core should depend on cdk-morphlines-core plus xyz 

More concretely, FWIW, to see how the deps look like in production releases 
downstream review the following POMs: 

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-core/pom.xml

and

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-cell/pom.xml

and

https://github.com/cloudera/search/blob/master_1.1.0/search-mr/pom.xml


was (Author: whoschek):
I'm not aware of anything needing jersey except perhaps hadoop pulls that in.

The combined dependencies of all morphline modules is here: 
http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html

The dependencies of each individual morphline modules is here: 
http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html

The source and POMs are here, as usual: 
https://github.com/cloudera/cdk/tree/master/cdk-morphlines

By the way, a somewhat separate issue is that it seems to me that the ivy 
dependences for solr-morphlines-core and solr-morphlines-cell and 
solr-map-reduce are a bit backwards upstream in that solr-morphlines-core pulls 
in a ton of dependencies that it doesn't need, and those deps should rather be 
pulled in by the solr-map-reduce (which is a essentially an out-of-the-box 
app). Would be good to organize ivy and mvn upstream in such a way that 

* solr-map-reduce should depend on solr-morphlines-cell plus cdk-morphlines-all 
plus xyz
* solr-morphlines-cell should depend on solr-morphlines-core plus xyz
* solr-morphlines-core should depend on cdk-morphlines-core plus xyz 

More concretely, FWIW, to see how the deps look like in production releases 
downstream review the following POMs: 

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-core/pom.xml

and

https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-cell/pom.xml

and

https://github.com/cloudera/search/blob/master_1.1.0/search-mr/pom.xml

 Add a Solr contrib that allows for building Solr indexes via Hadoop's 
 Map-Reduce.
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: New Feature
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 5.0, 4.7

 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, 
 SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, 
 hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, 
 log4j-1.2.15.jar


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-09 Thread wolfgang hoschek (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843523#comment-13843523
]

wolfgang hoschek commented on SOLR-1301:

Apologies for the confusion. We are upstreaming cdk-morphlines-solr-cell into
the solr contrib solr-morphlines-cell as well as cdk-morphlines-solr-core into
the solr contrib solr-morphlines-core as well as search-mr into the solr
contrib solr-map-reduce. Once the upstreaming is done these old modules will go
away. Next, downstream will be made identical to upstream plus perhaps some
critical fixes as necessary, and the upstream/downstream terms will apply in
the way folks usually think about them, but we are not quite yet there today,
but getting there...

cdk-morphlines-all is simply a convenience pom that includes all the other
morphline poms so there's less to type for users who like a bit more auto magic.

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 5.0, 4.7

--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-06 Thread wolfgang hoschek (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13842034#comment-13842034
]

wolfgang hoschek commented on SOLR-1301:

There are also some important fixes downstream in 0.9.0 of cdk-morphlines-core
and cdk-morphlines-solr-cell that would be good to merge upstream (solr locator
race, solr cell bug, etc). Also there are new morphline modules jars to add
with 0.9.0 and jars to update (plus upstream is also missing some morphline
modules from 0.8 as well)

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 5.0, 4.7

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-06 Thread wolfgang hoschek (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13842034#comment-13842034
]

wolfgang hoschek edited comment on SOLR-1301 at 12/7/13 2:57 AM:
-

There are also some important fixes downstream in 0.9.0 of
cdk-morphlines-solr-core and cdk-morphlines-solr-cell that would be good to
merge upstream (solr locator race, solr cell bug, etc). Also there are new
morphline modules jars to add with 0.9.0 and jars to update (plus upstream is
also missing some morphline modules from 0.8 as well)

was (Author: whoschek):
There are also some important fixes downstream in 0.9.0 of cdk-morphlines-core
and cdk-morphlines-solr-cell that would be good to merge upstream (solr locator
race, solr cell bug, etc). Also there are new morphline modules jars to add
with 0.9.0 and jars to update (plus upstream is also missing some morphline
modules from 0.8 as well)

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 5.0, 4.7

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839308#comment-13839308
]

wolfgang hoschek commented on SOLR-1301:

There are also some fixes downstream in cdk-morphlines-core and
cdk-morphlines-solr-cell that would be good to push upstream.

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 5.0, 4.7

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839311#comment-13839311
]

wolfgang hoschek commented on SOLR-1301:

Minor nit: could remove
jobConf.setBoolean(ExtractingParams.IGNORE_TIKA_EXCEPTION, false) in
MorphlineBasicMiniMRTest + MorphlineGoLiveMiniMRTest because such a flag is
nomore needed, and it removes an unnecessary dependency on tika.

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 5.0, 4.7

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839556#comment-13839556
]

wolfgang hoschek commented on SOLR-1301:

FWIW, a current printout of --help showing the CLI options is here:
https://github.com/cloudera/search/tree/master_1.0.0/search-mr

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 5.0, 4.7

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839556#comment-13839556
]

wolfgang hoschek edited comment on SOLR-1301 at 12/5/13 12:55 AM:
--

FWIW, a current printout of --help showing the CLI options is here:
https://github.com/cloudera/search/tree/master_1.1.0/search-mr

was (Author: whoschek):
FWIW, a current printout of --help showing the CLI options is here:
https://github.com/cloudera/search/tree/master_1.0.0/search-mr

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 5.0, 4.7

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [JENKINS] Lucene-Solr-trunk-Linux (32bit/jdk1.8.0-ea-b117) - Build # 8549 - Still Failing!

2013-12-03 Thread Wolfgang Hoschek


On Dec 3, 2013, at 12:11 AM, Uwe Schindler wrote:

 Looks like Java's service loader lookup impl has become more strict in Java8.
 This issue on Java 8 is kind of unfortunate because morphlines and solr-mr
 doesn't actually use JAXP at all.
 
 For the time being might be best to disable testing on Java8 for this 
 contrib,
 in order to get a stable build and make progress on other issues.
 
 A couple of options that come to mind in how to deal with this longer term:
 
 1) Remove the dependency on cdk-morphlines-saxon (which pulls in the
 saxon jar)
 
 
 What ist he effect of this? I would prefer this!

The effect is that the convertHTML, xquery and xslt commands won't be available 
anymore: 

http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html#/cdk-morphlines-saxon

 
 2) Replace all Solr calls to JAXP XPathFactory.newInstance() with a little
 helper that first tries to use one of a list of well known XPathFactory
 subclasses, and only if that fails falls back to the generic
 XPathFactory.newInstance(). E.g. use something like
 
 XPathFactory.newInstance(XPathFactory.DEFAULT_OBJECT_MODEL_URI,
 com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl,
 ClassLoader.getSystemClassLoader());
 
 This is a hack, just because of this craziness, I don't want to have non 
 conformant code in Solr Core!

This is actually quite common practice because the JAXP service loader 
mechanism is a bit flawed. Also, most XSLT and XPath and StaX implementations 
have serious bugs in various areas. Thus many XML intensive apps that require 
reliability and predictable behavior explicitly choose one of the JAXP 
implementation that's known to work for them, rather than hoping for the best 
with some potentially buggy default impl. JAXP plug-ability really only exists 
for simple XPath use cases. The good news is that Solr Config et al seems to 
fit into that simple pluggable bucket.

 
 There are 14 such XPathFactory.newInstance() calls in the Solr codebase.
 
 Definite -1
 
 3) Somehow remove the META-INF/services/javax.xml.xpath.XPathFactory
 file from the saxon jar (this is what's causing this, and we don't need that 
 file,
 but it's not clear how to remove it, realistically)
 
 The only correct way to solve this: File a bug in Jackson and apply (1). 
 Jackson violates the standards. And this violation fails in a number of JVMs 
 (not only in Java 8, also IBM J9 is affected).

I'll file a bug with saxon and see what Mike Kay's take is. Meanwhile, we could 
remove the saxon jar or disable tests on java8  J9 to be able to move forward 
on this.

 Because of this I don't want to have Jackson in Solr at all (you have to 
 know, I am a fan of XSLT and XPath, but Jackson is the worst implementation I 
 have seen and I avoid it whenever possible - Only if you need XPath2 / XSLT 2 
 you may want to use it).

All XML libs have bugs but most XML intensive apps use saxon in production 
rather than other impls, at least from what I've seen over the years. Anyway, 
just my 2 cents.

Wolfgang.

 
 Uwe
 
 On Dec 2, 2013, at 4:41 PM, Mark Miller wrote:
 
 Uwe mentioned this in IRC - I guess Saxon doesn’t play nice with java 8.
 
 http://stackoverflow.com/questions/7914915/syntax-error-in-javax-xml-x
 path-xpathfactory-provider-configuration-file-of-saxo
 
 - Mark
 
 On Dec 2, 2013, at 7:06 PM, Policeman Jenkins Server
 jenk...@thetaphi.de wrote:
 
 Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/8549/
 Java: 32bit/jdk1.8.0-ea-b117 -server -XX:+UseSerialGC
 
 3 tests failed.
 FAILED:
 
 junit.framework.TestSuite.org.apache.solr.hadoop.MorphlineReducerTest
 
 Error Message:
 1 thread leaked from SUITE scope at
 org.apache.solr.hadoop.MorphlineReducerTest: 1) Thread[id=17,
 name=Thread-4, state=TIMED_WAITING, group=TGRP-
 MorphlineReducerTest] at sun.misc.Unsafe.park(Native Method) 
 at
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
 at
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNa
 nos(AbstractQueuedSynchronizer.java:1037) at
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNa
 nos(AbstractQueuedSynchronizer.java:1328) at
 java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
 at org.apache.solr.hadoop.HeartBeater.run(HeartBeater.java:108)
 
 Stack Trace:
 com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked
 from SUITE scope at org.apache.solr.hadoop.MorphlineReducerTest:
  1) Thread[id=17, name=Thread-4, state=TIMED_WAITING, group=TGRP-
 MorphlineReducerTest]
   at sun.misc.Unsafe.park(Native Method)
   at
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
   at
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNa
 nos(AbstractQueuedSynchronizer.java:1037)
   at
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNa
 nos(AbstractQueuedSynchronizer.java:1328)
   at

Re: [JENKINS] Lucene-Solr-trunk-Linux (32bit/jdk1.8.0-ea-b117) - Build # 8549 - Still Failing!

2013-12-03 Thread Wolfgang Hoschek

FYI, I filed this saxon ticket: https://saxonica.plan.io/issues/1944

On Dec 3, 2013, at 12:52 AM, Wolfgang Hoschek wrote:

 
 On Dec 3, 2013, at 12:11 AM, Uwe Schindler wrote:
 
 Looks like Java's service loader lookup impl has become more strict in 
 Java8.
 This issue on Java 8 is kind of unfortunate because morphlines and solr-mr
 doesn't actually use JAXP at all.
 
 For the time being might be best to disable testing on Java8 for this 
 contrib,
 in order to get a stable build and make progress on other issues.
 
 A couple of options that come to mind in how to deal with this longer term:
 
 1) Remove the dependency on cdk-morphlines-saxon (which pulls in the
 saxon jar)
 
 
 What ist he effect of this? I would prefer this!
 
 The effect is that the convertHTML, xquery and xslt commands won't be 
 available anymore: 
 
 http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html#/cdk-morphlines-saxon
 
 
 2) Replace all Solr calls to JAXP XPathFactory.newInstance() with a little
 helper that first tries to use one of a list of well known XPathFactory
 subclasses, and only if that fails falls back to the generic
 XPathFactory.newInstance(). E.g. use something like
 
 XPathFactory.newInstance(XPathFactory.DEFAULT_OBJECT_MODEL_URI,
 com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl,
 ClassLoader.getSystemClassLoader());
 
 This is a hack, just because of this craziness, I don't want to have non 
 conformant code in Solr Core!
 
 This is actually quite common practice because the JAXP service loader 
 mechanism is a bit flawed. Also, most XSLT and XPath and StaX implementations 
 have serious bugs in various areas. Thus many XML intensive apps that require 
 reliability and predictable behavior explicitly choose one of the JAXP 
 implementation that's known to work for them, rather than hoping for the best 
 with some potentially buggy default impl. JAXP plug-ability really only 
 exists for simple XPath use cases. The good news is that Solr Config et al 
 seems to fit into that simple pluggable bucket.
 
 
 There are 14 such XPathFactory.newInstance() calls in the Solr codebase.
 
 Definite -1
 
 3) Somehow remove the META-INF/services/javax.xml.xpath.XPathFactory
 file from the saxon jar (this is what's causing this, and we don't need 
 that file,
 but it's not clear how to remove it, realistically)
 
 The only correct way to solve this: File a bug in Jackson and apply (1). 
 Jackson violates the standards. And this violation fails in a number of JVMs 
 (not only in Java 8, also IBM J9 is affected).
 
 I'll file a bug with saxon and see what Mike Kay's take is. Meanwhile, we 
 could remove the saxon jar or disable tests on java8  J9 to be able to move 
 forward on this.
 
 Because of this I don't want to have Jackson in Solr at all (you have to 
 know, I am a fan of XSLT and XPath, but Jackson is the worst implementation 
 I have seen and I avoid it whenever possible - Only if you need XPath2 / 
 XSLT 2 you may want to use it).
 
 All XML libs have bugs but most XML intensive apps use saxon in production 
 rather than other impls, at least from what I've seen over the years. Anyway, 
 just my 2 cents.
 
 Wolfgang.
 
 
 Uwe
 
 On Dec 2, 2013, at 4:41 PM, Mark Miller wrote:
 
 Uwe mentioned this in IRC - I guess Saxon doesn’t play nice with java 8.
 
 http://stackoverflow.com/questions/7914915/syntax-error-in-javax-xml-x
 path-xpathfactory-provider-configuration-file-of-saxo
 
 - Mark
 
 On Dec 2, 2013, at 7:06 PM, Policeman Jenkins Server
 jenk...@thetaphi.de wrote:
 
 Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/8549/
 Java: 32bit/jdk1.8.0-ea-b117 -server -XX:+UseSerialGC
 
 3 tests failed.
 FAILED:
 
 junit.framework.TestSuite.org.apache.solr.hadoop.MorphlineReducerTest
 
 Error Message:
 1 thread leaked from SUITE scope at
 org.apache.solr.hadoop.MorphlineReducerTest: 1) Thread[id=17,
 name=Thread-4, state=TIMED_WAITING, group=TGRP-
 MorphlineReducerTest] at sun.misc.Unsafe.park(Native Method)
  at
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
 at
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNa
 nos(AbstractQueuedSynchronizer.java:1037) at
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNa
 nos(AbstractQueuedSynchronizer.java:1328) at
 java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
 at org.apache.solr.hadoop.HeartBeater.run(HeartBeater.java:108)
 
 Stack Trace:
 com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked
 from SUITE scope at org.apache.solr.hadoop.MorphlineReducerTest:
 1) Thread[id=17, name=Thread-4, state=TIMED_WAITING, group=TGRP-
 MorphlineReducerTest]
  at sun.misc.Unsafe.park(Native Method)
  at
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
  at
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNa
 nos(AbstractQueuedSynchronizer.java

Re: [JENKINS] Lucene-Solr-trunk-Linux (32bit/jdk1.8.0-ea-b117) - Build # 8549 - Still Failing!

2013-12-03 Thread Wolfgang Hoschek

Actually, Mike's opinion has changed because now Saxon doesn't need to support
Java5 anymore - https://saxonica.plan.io/issues/1944

Wolfgang.

On Dec 3, 2013, at 2:07 AM, Dawid Weiss wrote:

I'll file a bug with saxon and see what Mike Kay's take is

I think Mike has already expressed his opinion on the subject in that
stack overflow topic... :)

Dawid

On Tue, Dec 3, 2013 at 9:52 AM, Wolfgang Hoschek whosc...@cloudera.com
wrote:

On Dec 3, 2013, at 12:11 AM, Uwe Schindler wrote:

Looks like Java's service loader lookup impl has become more strict in
Java8.
This issue on Java 8 is kind of unfortunate because morphlines and solr-mr
doesn't actually use JAXP at all.

For the time being might be best to disable testing on Java8 for this
contrib,
in order to get a stable build and make progress on other issues.

A couple of options that come to mind in how to deal with this longer term:

1) Remove the dependency on cdk-morphlines-saxon (which pulls in the
saxon jar)

What ist he effect of this? I would prefer this!

The effect is that the convertHTML, xquery and xslt commands won't be
available anymore:

http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html#/cdk-morphlines-saxon

2) Replace all Solr calls to JAXP XPathFactory.newInstance() with a little
helper that first tries to use one of a list of well known XPathFactory
subclasses, and only if that fails falls back to the generic
XPathFactory.newInstance(). E.g. use something like

XPathFactory.newInstance(XPathFactory.DEFAULT_OBJECT_MODEL_URI,
com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl,
ClassLoader.getSystemClassLoader());

This is a hack, just because of this craziness, I don't want to have non
conformant code in Solr Core!

This is actually quite common practice because the JAXP service loader
mechanism is a bit flawed. Also, most XSLT and XPath and StaX
implementations have serious bugs in various areas. Thus many XML intensive
apps that require reliability and predictable behavior explicitly choose one
of the JAXP implementation that's known to work for them, rather than hoping
for the best with some potentially buggy default impl. JAXP plug-ability
really only exists for simple XPath use cases. The good news is that Solr
Config et al seems to fit into that simple pluggable bucket.

There are 14 such XPathFactory.newInstance() calls in the Solr codebase.

Definite -1

3) Somehow remove the META-INF/services/javax.xml.xpath.XPathFactory
file from the saxon jar (this is what's causing this, and we don't need
that file,
but it's not clear how to remove it, realistically)

The only correct way to solve this: File a bug in Jackson and apply (1).
Jackson violates the standards. And this violation fails in a number of
JVMs (not only in Java 8, also IBM J9 is affected).

I'll file a bug with saxon and see what Mike Kay's take is. Meanwhile, we
could remove the saxon jar or disable tests on java8 J9 to be able to move
forward on this.

Because of this I don't want to have Jackson in Solr at all (you have to
know, I am a fan of XSLT and XPath, but Jackson is the worst implementation
I have seen and I avoid it whenever possible - Only if you need XPath2 /
XSLT 2 you may want to use it).

All XML libs have bugs but most XML intensive apps use saxon in production
rather than other impls, at least from what I've seen over the years.
Anyway, just my 2 cents.

Wolfgang.

Uwe

On Dec 2, 2013, at 4:41 PM, Mark Miller wrote:

Uwe mentioned this in IRC - I guess Saxon doesn’t play nice with java 8.

http://stackoverflow.com/questions/7914915/syntax-error-in-javax-xml-x
path-xpathfactory-provider-configuration-file-of-saxo

- Mark

On Dec 2, 2013, at 7:06 PM, Policeman Jenkins Server
jenk...@thetaphi.de wrote:

Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/8549/
Java: 32bit/jdk1.8.0-ea-b117 -server -XX:+UseSerialGC

3 tests failed.
FAILED:

junit.framework.TestSuite.org.apache.solr.hadoop.MorphlineReducerTest

Error Message:
1 thread leaked from SUITE scope at
org.apache.solr.hadoop.MorphlineReducerTest: 1) Thread[id=17,
name=Thread-4, state=TIMED_WAITING, group=TGRP-
MorphlineReducerTest] at sun.misc.Unsafe.park(Native Method)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNa
nos(AbstractQueuedSynchronizer.java:1037) at
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNa
nos(AbstractQueuedSynchronizer.java:1328) at
java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
at org.apache.solr.hadoop.HeartBeater.run(HeartBeater.java:108)

Stack Trace:
com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked
from SUITE scope at org.apache.solr.hadoop.MorphlineReducerTest:
1) Thread

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837976#comment-13837976
]

wolfgang hoschek commented on SOLR-1301:

bq. module/dir names

I propose morphlines-solr-core and morphlines-solr-cell as names. Thoughts?

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 5.0, 4.7

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837979#comment-13837979
]

wolfgang hoschek commented on SOLR-1301:

+1 to map-reduce-indexer module name/dir.

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 5.0, 4.7

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [JENKINS] Lucene-Solr-trunk-Linux (32bit/jdk1.8.0-ea-b117) - Build # 8549 - Still Failing!

2013-12-03 Thread Wolfgang Hoschek

Hi Uwe,

There is no need for the saxon jar to be in the WAR. The mr contrib module is
intended to be run in a separate process.

The saxon jar should only be pulled in by the MR contrib module aka
map-reduce-indexer contrib module. If that's not the case that's a packaging
bug that we should fix.

For some more background, here is how the morphline dependency graph looks
downstream:
http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html

Wolfgang.

On Dec 3, 2013, at 5:14 AM, Uwe Schindler wrote:

Wolfgang,

does this problem affect all the hadoop modules (because the saxon jar is in
all the modules classpath)? If yes, I have to disable all of them with IBM J9
and Oracle Java 8.
My biggest problem is the fact that this could also affect the release of
Solr. If the saxon.jar is in the WAR file of Solr, then it breaks whole of
Solr. But as it is a module, it should be loaded by the SolrResourceLoader
from the core's lib folder, so all should be fine, if installed.

I hope the huge Hadoop stuff is not in the WAR (not only because of this
issue) and needs to be installed by the user in the instance's lib folder!!!

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

-Original Message-
From: dawid.we...@gmail.com [mailto:dawid.we...@gmail.com] On Behalf
Of Dawid Weiss
Sent: Tuesday, December 03, 2013 12:10 PM
To: dev@lucene.apache.org
Subject: Re: [JENKINS] Lucene-Solr-trunk-Linux (32bit/jdk1.8.0-ea-b117) -
Build # 8549 - Still Failing!

Ha! Thanks for filing the issue, Wolfgang.

On Tue, Dec 3, 2013 at 12:01 PM, Wolfgang Hoschek
whosc...@cloudera.com wrote:
Actually, Mike's opinion has changed because now Saxon doesn't need to
support Java5 anymore - https://saxonica.plan.io/issues/1944

Wolfgang.

On Dec 3, 2013, at 2:07 AM, Dawid Weiss wrote:

I'll file a bug with saxon and see what Mike Kay's take is

I think Mike has already expressed his opinion on the subject in that
stack overflow topic... :)

Dawid

On Tue, Dec 3, 2013 at 9:52 AM, Wolfgang Hoschek
whosc...@cloudera.com wrote:

On Dec 3, 2013, at 12:11 AM, Uwe Schindler wrote:

Looks like Java's service loader lookup impl has become more strict in
Java8.
This issue on Java 8 is kind of unfortunate because morphlines and
solr-mr doesn't actually use JAXP at all.

For the time being might be best to disable testing on Java8 for
this contrib, in order to get a stable build and make progress on other
issues.

A couple of options that come to mind in how to deal with this longer
term:

1) Remove the dependency on cdk-morphlines-saxon (which pulls in
the saxon jar)

What ist he effect of this? I would prefer this!

The effect is that the convertHTML, xquery and xslt commands won't be
available anymore:

http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlines
ReferenceGuide.html#/cdk-morphlines-saxon

2) Replace all Solr calls to JAXP XPathFactory.newInstance() with
a little helper that first tries to use one of a list of well
known XPathFactory subclasses, and only if that fails falls back
to the generic XPathFactory.newInstance(). E.g. use something like

XPathFactory.newInstance(XPathFactory.DEFAULT_OBJECT_MODEL_URI,
com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl,
ClassLoader.getSystemClassLoader());

This is a hack, just because of this craziness, I don't want to have non
conformant code in Solr Core!

This is actually quite common practice because the JAXP service loader
mechanism is a bit flawed. Also, most XSLT and XPath and StaX
implementations have serious bugs in various areas. Thus many XML
intensive apps that require reliability and predictable behavior explicitly
choose one of the JAXP implementation that's known to work for them,
rather than hoping for the best with some potentially buggy default impl.
JAXP plug-ability really only exists for simple XPath use cases. The good
news
is that Solr Config et al seems to fit into that simple pluggable bucket.

There are 14 such XPathFactory.newInstance() calls in the Solr
codebase.

Definite -1

3) Somehow remove the
META-INF/services/javax.xml.xpath.XPathFactory
file from the saxon jar (this is what's causing this, and we don't
need that file, but it's not clear how to remove it,
realistically)

The only correct way to solve this: File a bug in Jackson and apply (1).
Jackson violates the standards. And this violation fails in a number of JVMs
(not only in Java 8, also IBM J9 is affected).

I'll file a bug with saxon and see what Mike Kay's take is. Meanwhile, we
could remove the saxon jar or disable tests on java8 J9 to be able to move
forward on this.

Because of this I don't want to have Jackson in Solr at all (you have to
know, I am a fan of XSLT and XPath, but Jackson is the worst implementation
I have seen and I avoid

[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837976#comment-13837976
]

wolfgang hoschek edited comment on SOLR-1301 at 12/3/13 6:40 PM:
-

bq. module/dir names

I propose morphlines-solr-core and morphlines-solr-cell as names. This avoids
confusion by fitting nicely with the existing naming pattern, which is
cdk-morphlines-solr-core and cdk-morphlines-solr-cell.
(https://github.com/cloudera/cdk/tree/master/cdk-morphlines). Thoughts?

was (Author: whoschek):
bq. module/dir names

I propose morphlines-solr-core and morphlines-solr-cell as names. Thoughts?

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 5.0, 4.7

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838054#comment-13838054
]

wolfgang hoschek commented on SOLR-1301:

bq. The problem with these two names is that the artifact names will have
solr- prepended, and then solr will occur twice in their names:
solr-morphlines-solr-core-4.7.0.jar, solr-morphlines-solr-cell-4.7.0.jar. Yuck.

Ah, argh. In this light, what Mark suggested seems good to me as well.

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 5.0, 4.7

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838064#comment-13838064
]

wolfgang hoschek commented on SOLR-1301:

+1 on Steve's suggestion as well. Thanks for helping out!

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 5.0, 4.7

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838305#comment-13838305
]

wolfgang hoschek edited comment on SOLR-1301 at 12/3/13 11:11 PM:
--

Upon a bit more reflection might be better to call the contrib map-reduce and
the artifact solr-map-reduce. This keeps the door open to potentially later
add things like a Hadoop SolrInputFormat, i.e. read from solr via MR, rather
than just write to solr via MR.

was (Author: whoschek):
Upon a bit more reflection might be better to call the contrib map-reduce and
the artifact solr-map-reduce. This keeps the door upon to potentially later
add things like a Hadoop SolrInputFormat, i.e. read from solr via MR, rather
than just write to solr via MR.

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 5.0, 4.7

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838305#comment-13838305
]

wolfgang hoschek commented on SOLR-1301:

Upon a bit more reflection might be better to call the contrib map-reduce and
the artifact solr-map-reduce. This keeps the door upon to potentially later
add things like a Hadoop SolrInputFormat, i.e. read from solr via MR, rather
than just write to solr via MR.

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 5.0, 4.7

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-12-02 Thread wolfgang hoschek (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837068#comment-13837068
]

wolfgang hoschek commented on SOLR-1301:

There is also a known issue in that Morphlines don't work on Windows because
the Guava Classpath utility doesn't work with windows path conventions. For
example, see
http://mail-archives.apache.org/mod_mbox/flume-dev/201310.mbox/%3c5acffcd9-4ad7-4e6e-8365-ceadfac78...@cloudera.com%3E

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 5.0, 4.7

--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [JENKINS] Lucene-Solr-trunk-Linux (32bit/jdk1.8.0-ea-b117) - Build # 8549 - Still Failing!

2013-12-02 Thread Wolfgang Hoschek

Looks like Java's service loader lookup impl has become more strict in Java8. 
This issue on Java 8 is kind of unfortunate because morphlines and solr-mr 
doesn't actually use JAXP at all. 

For the time being might be best to disable testing on Java8 for this contrib, 
in order to get a stable build and make progress on other issues.

A couple of options that come to mind in how to deal with this longer term:

1) Remove the dependency on cdk-morphlines-saxon (which pulls in the saxon jar)

or 

2) Replace all Solr calls to JAXP XPathFactory.newInstance() with a little 
helper that first tries to use one of a list of well known XPathFactory 
subclasses, and only if that fails falls back to the generic 
XPathFactory.newInstance(). E.g. use something like 

XPathFactory.newInstance(XPathFactory.DEFAULT_OBJECT_MODEL_URI,
com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl, 
ClassLoader.getSystemClassLoader());

There are 14 such XPathFactory.newInstance() calls in the Solr codebase.

or 

3) Somehow remove the META-INF/services/javax.xml.xpath.XPathFactory file from 
the saxon jar (this is what's causing this, and we don't need that file, but 
it's not clear how to remove it, realistically)

Approach 2) might be best.

Thoughts?
Wolfgang.

On Dec 2, 2013, at 4:41 PM, Mark Miller wrote:

 Uwe mentioned this in IRC - I guess Saxon doesn’t play nice with java 8.
 
 http://stackoverflow.com/questions/7914915/syntax-error-in-javax-xml-xpath-xpathfactory-provider-configuration-file-of-saxo
 
 - Mark
 
 On Dec 2, 2013, at 7:06 PM, Policeman Jenkins Server jenk...@thetaphi.de 
 wrote:
 
 Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/8549/
 Java: 32bit/jdk1.8.0-ea-b117 -server -XX:+UseSerialGC
 
 3 tests failed.
 FAILED:  
 junit.framework.TestSuite.org.apache.solr.hadoop.MorphlineReducerTest
 
 Error Message:
 1 thread leaked from SUITE scope at 
 org.apache.solr.hadoop.MorphlineReducerTest: 1) Thread[id=17, 
 name=Thread-4, state=TIMED_WAITING, group=TGRP-MorphlineReducerTest] 
 at sun.misc.Unsafe.park(Native Method) at 
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)   
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
  at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
  at 
 java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
 at org.apache.solr.hadoop.HeartBeater.run(HeartBeater.java:108)
 
 Stack Trace:
 com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked from 
 SUITE scope at org.apache.solr.hadoop.MorphlineReducerTest: 
   1) Thread[id=17, name=Thread-4, state=TIMED_WAITING, 
 group=TGRP-MorphlineReducerTest]
at sun.misc.Unsafe.park(Native Method)
at 
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
at org.apache.solr.hadoop.HeartBeater.run(HeartBeater.java:108)
  at __randomizedtesting.SeedInfo.seed([FA8A1D94A2BB2925]:0)
 
 
 FAILED:  
 junit.framework.TestSuite.org.apache.solr.hadoop.MorphlineReducerTest
 
 Error Message:
 There are still zombie threads that couldn't be terminated:1) 
 Thread[id=17, name=Thread-4, state=TIMED_WAITING, 
 group=TGRP-MorphlineReducerTest] at sun.misc.Unsafe.park(Native 
 Method) at 
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)   
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
  at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
  at 
 java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) 
 at org.apache.solr.hadoop.HeartBeater.run(HeartBeater.java:108)
 
 Stack Trace:
 com.carrotsearch.randomizedtesting.ThreadLeakError: There are still zombie 
 threads that couldn't be terminated:
   1) Thread[id=17, name=Thread-4, state=TIMED_WAITING, 
 group=TGRP-MorphlineReducerTest]
at sun.misc.Unsafe.park(Native Method)
at 
 java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
at org.apache.solr.hadoop.HeartBeater.run(HeartBeater.java:108)
  at

Re: Welcome Joel Bernstein

2013-10-04 Thread Wolfgang Hoschek

Welcome Joel!

Wolfgang.

On Oct 3, 2013, at 9:56 AM, Erick Erickson wrote:

 Welcome Joel!
 
 On Thu, Oct 3, 2013 at 9:33 AM, Martijn v Groningen
 martijn.v.gronin...@gmail.com wrote:
 Welcome Joel!
 
 
 On 3 October 2013 15:45, Shawn Heisey s...@elyograg.org wrote:
 
 On 10/2/2013 11:24 PM, Grant Ingersoll wrote:
 The Lucene PMC is happy to welcome Joel Bernstein as a committer on the
 Lucene and Solr project.  Joel has been working on a number of issues on 
 the
 project and we look forward to his continued contributions going forward.
 
 Welcome to the project!  Best of luck to you!
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
 
 --
 Met vriendelijke groet,
 
 Martijn van Groningen
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Welcome back, Wolfgang Hoschek!

2013-09-26 Thread Wolfgang Hoschek

Thanks to all! Looking forward to more contributions.

Wolfgang.

On Sep 26, 2013, at 3:21 AM, Uwe Schindler wrote:

 Hi,
 
 I'm pleased to announce that after a long abstinence, Wolfgang Hoschek 
 rejoined the Lucene/Solr committer team. He is working now at Cloudera and 
 plans to help with the integration of Solr and Hadoop.
 Wolfgang originally wrote the MemoryIndex, which is used by the classical 
 Lucene highlighter and ElasticSearch's percolator module.
 
 Looking forward to new contributions.
 
 Welcome back  heavy committing! :-)
 Uwe
 
 P.S.: Wolfgang, as soon as you have setup your subversion access, you should 
 add yourself back to the committers list on the website as well.
 
 -
 Uwe Schindler
 uschind...@apache.org 
 Apache Lucene PMC Chair / Committer
 Bremen, Germany
 http://lucene.apache.org/
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-09-16 Thread wolfgang hoschek (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768629#comment-13768629
]

wolfgang hoschek commented on SOLR-1301:

cdk-morphlines-solr-core and cdk-morphlines-solr-cell should remain separate
and be available through separate maven modules so that clients such as Flume
Solr Sink and Hbase Indexer can continue to choose to depend (or not depend) on
them. For example, not everyone wants tika and it's dependency chain.

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 4.5, 5.0

Attachments: commons-logging-1.0.4.jar,
commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar,
hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch,
log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch,
SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch,
SOLR-1301.patch, SolrRecordWriter.java

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-09-16 Thread wolfgang hoschek (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768662#comment-13768662
]

wolfgang hoschek commented on SOLR-1301:

Seems like the patch still misses tika-xmp.

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 4.5, 5.0

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-09-10 Thread wolfgang hoschek (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763618#comment-13763618
]

wolfgang hoschek commented on SOLR-1301:

FYI, One things that's definitely off in that adhoc ivy.xml above is that it
should use com.typesafe rather than org.skife.com.typesafe.config. Use version
1.0.2 of it. See http://search.maven.org/#search%7Cga%7C1%7Ctypesafe-config

Maybe best to wait for Mark to post our full ivy.xml, though.

(Moving all our solr-mr dependencies from Cloudera Search maven to ivy was a
bit of a beast).

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 4.5, 5.0

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-09-10 Thread wolfgang hoschek (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763636#comment-13763636
]

wolfgang hoschek commented on SOLR-1301:

By the way, docs and the downstream code for our solr-mr contrib submission is
here: https://github.com/cloudera/search/tree/master/search-mr

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 4.5, 5.0

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.

2013-09-10 Thread wolfgang hoschek (JIRA)

[
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763644#comment-13763644
]

wolfgang hoschek commented on SOLR-1301:

This new solr-mr contrib uses morphlines for ETL from MapReduce into Solr. To
get started, here are some pointers for morphlines background material and code:

code:

https://github.com/cloudera/cdk/tree/master/cdk-morphlines

blog post:

http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/

reference guide:

http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html

slides:

http://www.slideshare.net/cloudera/using-morphlines-for-onthefly-etl

talk recording:

http://www.youtube.com/watch?v=iR48cRSbW6A

Add a Solr contrib that allows for building Solr indexes via Hadoop's
Map-Reduce.
-

Key: SOLR-1301
URL: https://issues.apache.org/jira/browse/SOLR-1301
Project: Solr
Issue Type: New Feature
Reporter: Andrzej Bialecki
Assignee: Mark Miller
Fix For: 4.5, 5.0

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4661) Reduce default maxMerge/ThreadCount for ConcurrentMergeScheduler

2013-01-08 Thread wolfgang hoschek (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-4661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13547367#comment-13547367
]

wolfgang hoschek commented on LUCENE-4661:
--

Might be good to experiment with Linux block device read-ahead settings
(/sbin/blockdev --setra) and ensure using a file system that does write behind
(e.g. ext4 or xfs). Larger buffer sizes typically allow for more concurrent seq
streams even on spindles.

Reduce default maxMerge/ThreadCount for ConcurrentMergeScheduler

Key: LUCENE-4661
URL: https://issues.apache.org/jira/browse/LUCENE-4661
Project: Lucene - Core
Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Fix For: 4.1, 5.0

I think our current defaults (maxThreadCount=#cores/2,
maxMergeCount=maxThreadCount+2) are too high ... I've frequently found
merges falling behind and then slowing each other down when I index on
a spinning-magnets drive.
As a test, I indexed all of English Wikipedia with term-vectors (=
heavy on merging), using 6 threads ... at the defaults
(maxThreadCount=3, maxMergeCount=5, for my machine) it took 5288 sec
to index wait for merges commit. When I changed to
maxThreadCount=1, maxMergeCount=2, indexing time sped up to 2902
seconds (45% faster). This is on a spinning-magnets disk... basically
spinning-magnets disk don't handle the concurrent IO well.
Then I tested an OCZ Vertex 3 SSD: at the current defaults it took
1494 seconds and at maxThreadCount=1, maxMergeCount=2 it took 1795 sec
(20% slower). Net/net the SSD can handle merge concurrency just fine.
I think we should change the defaults: spinning magnet drives are hurt
by the current defaults more than SSDs are helped ... apps that know
their IO system is fast can always increase the merge concurrency.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [jira] Field constructor, avoiding String.intern()

2007-02-23 Thread Wolfgang Hoschek



On Feb 23, 2007, at 10:28 AM, James Kennedy wrote:



True. However, in the case where you are processing Documents one  
at a time
and discarding them (e.g. We use hitCollector to process all  
documents from
a search), or memory is not an issue, it would be nice to have the  
ability

to disable the interning for performance sake.


I don't know how much it would increase overall throughput in a  
variety of use cases, but one approach could be to add a copy-like- 
this factory method like Field.createField(Reader) to Field.java,  
analog to the method Term.createTerm(String text) that was added to  
Term.java sometime ago for a similar reason.


This would guarantee that the name continues to be interned yet  
allows to avoid the interning overhead on use cases where a field  
with the same parametrization (yet a different content String/Reader)  
is constructed many times, which is probably the most common case  
where intern() overhead might matter.


For example, something like

Field f1 = ...
Field f2 = f1.createSimilarField(Reader);

  /**
   * Optimized construction of new Terms by reusing same field as  
this Term

   * - avoids field.intern() overhead
   * @param text The text of the new term (field is implicitly same  
as this Term instance)

   * @return A new Term
   */
  public Term createTerm(String text)
  {
  return new Term(field,text,false);
  }

Wolfgang.






Robert Engels wrote:


I don't think it is just the performance gain of equals() where  
intern

() matters.

It also reduces memory consumption dramatically when working with
large collections of documents in memory - although this could also
be done with constants, there is nothing in Java to enforce it (thus
the use of intern()).


On Feb 23, 2007, at 12:02 PM, James Kennedy wrote:



In our case, we're trying to optimize document() retrieval and we
found that
disabling the String interning in the Field constructor improved
performance
dramatically. I agree that interning should be an option on the
constructor.
For document retrieval, at least for a small of amount of fields,  
the

performance gain of using equals() on interned strings is no match
for the
performance loss of interning the field name of each field.



Wolfgang Hoschek-2 wrote:


I noticed that, too, but in my case the difference was often much
more extreme: it was one of the primary bottlenecks on indexing.  
This
is the primary reason why MemoryIndex.addField(...) navigates  
around
the problem by taking a parameter of type String fieldName  
instead

of type Field:

public void addField(String fieldName, TokenStream stream) {
/*
 * Note that this method signature avoids having a user call new
		 * o.a.l.d.Field(...) which would be much too expensive due to  
the

 * String.intern() usage of that class.
  */

Wolfgang.

On Feb 14, 2006, at 1:42 PM, Tatu Saloranta wrote:


After profiling in-memory indexing, I noticed that
calls to String.intern() showed up surprisingly high;
especially the one from Field() constructor. This is
understandable due to overhead String.intern() has
(being native and synchronized method; overhead
incurred even if String is already interned), and the
fact this essentially gets called once per
document+field combination.

Now, it would be quite easy to improve things a bit
(in theory), such that most intern() calls could be
avoid, transparent to the calling app; for example,
for each IndexWriter() one could use a simple
HashMap() for caching interned Strings. This approach
is more than twice as fast as directly calling
intern(). One could also use per-thread cache, or
global one; all of which would probably be faster.
However, Field constructor hard-codes call to
intern(), so it would be necessary to add a new
constructor that indicates that field name is known to
be interned.
And there would also need to be a way to invoke the
new optional functionality.

Has anyone tried this approach to see if speedup is
worth the hassle (in my case it'd probably be
something like 2 - 3%, assuming profiler's 5% for
intern() is accurate)?

-+ Tatu +-


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

-- 
--

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--- 
--

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
View this message in context: http://www.nabble.com/Field-
constructor%2C-avoiding-String.intern%28%29-tf1123597.html#a9123600
Sent from the Lucene - Java Developer mailing list archive at
Nabble.com.


 
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional

Re: [jira] Commented: (LUCENE-794) Beginnings of a span based highlighter

2007-02-05 Thread Wolfgang Hoschek




I need to read the TokenStream at least twice
I used the horribly hackey but quick-for-me method of adding a  
method to MemoryIndex that accepts a List of Tokens. Any ideas?


I'm not sure about modifying MemoryIndex. It should be easy enough  
to create a subclass of TokenStream - (CachedTokenStream  
perhaps?) which takes a real TokenStream in it's constructor and  
delegates all next calls to it (and also records them in a List)  
for the the first use. This can then be rewound and re-used to  
run through the same set of tokens held in the list  from the first  
run.




Yes, as Marks points out this can be done without API change via the  
existing MemoryIndex.addField(String fieldName, TokenStream stream)


The TokenStream could be constructed along similar lines as done in  
MemoryIndex.keywordTokenStream(Collection) or perhaps along similar  
lines as in  
org.apache.lucene.index.memory.AnalyzerUtil.getTokenCachingAnalyzer 
(Analyzer)


If needed, an IndexReader can be created from a MemoryIndex via  
MemoryIndex.createSearcher().getIndexReader(), again without API change.


Wolfgang.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-129) Finalizers are non-canonical

2007-01-05 Thread wolfgang hoschek (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462579
 ] 

wolfgang hoschek commented on LUCENE-129:
-

Just to clarify: The empty finalize() method body in MemoryIndex measurabley 
improves performance of this class and it does not harm correctness because 
MemoryIndex does not require the superclass semantics wrt. concurrency.

 Finalizers are non-canonical
 

 Key: LUCENE-129
 URL: https://issues.apache.org/jira/browse/LUCENE-129
 Project: Lucene - Java
  Issue Type: Bug
  Components: Other
Affects Versions: unspecified
 Environment: Operating System: other
 Platform: All
Reporter: Esmond Pitt
 Assigned To: Michael McCandless
Priority: Minor
 Fix For: 2.1


 The canonical form of a Java finalizer is:
 protected void finalize() throws Throwable()
 {
  try
  {
// ... local code to finalize this class
  }
  catch (Throwable t)
  {
  }
  super.finalize(); // finalize base class.
 }
 The finalizers in IndexReader, IndexWriter, and FSDirectory don't conform. 
 This
 is probably minor or null in effect, but the principle is important.
 As a matter of fact FSDirectory.finaliz() is entirely redundant and could be
 removed, as it doesn't do anything that RandomAccessFile.finalize would do
 automatically.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-550) InstanciatedIndex - faster but memory consuming index

[ 
http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451817 ] 

wolfgang hoschek commented on LUCENE-550:
-

 All Lucene unit tests have been adapted to work with my alternate index. 
 Everything but proximity queries pass. 

Sounds like you're almost there :-)

Regarding indexing performance with MemoryIndex: Performance is more than good 
enough. I've observed and measured that often the bottleneck is not the 
MemoryIndex itself, but rather the Analyzer type (e.g. StandardAnalayzer) or 
the I/O for the input files or term lower casing 
(http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265809) or something else 
entirely.

Regarding query performance with MemoryIndex: Some queries are more efficient 
than others. For example, fuzzy queries are much less efficient than wild card 
queries, which in turn are much less efficient than simple term queries. Such 
effects seem partly inherent due too the nature of the query type, partly a 
function of the chosen data structure (RAMDirectory, MemoryIndex, II, ...), and 
partly a consequence of the overall Lucene API design.

The query mix found in testqueries.txt is more intended for correctness testing 
than benchmarking. Therein, certain query types dominate over others, and thus, 
conclusions about the performance of individual aspects cannot easily be drawn.

Wolfgang.


 InstanciatedIndex - faster but memory consuming index
 -

 Key: LUCENE-550
 URL: http://issues.apache.org/jira/browse/LUCENE-550
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 1.9
Reporter: Karl Wettin
 Attachments: class_diagram.png, class_diagram.png, 
 instanciated_20060527.tar, InstanciatedIndexTermEnum.java, 
 lucene.1.9-karl1.jpg, lucene2-karl_20060722.tar.gz, 
 lucene2-karl_20060723.tar.gz


 After fixing the bugs, it's now 4.5 - 5 times the speed. This is true for 
 both at index and query time. Sorry if I got your hopes up too much. There 
 are still things to be done though. Might not have time to do anything with 
 this until next month, so here is the code if anyone wants a peek.
 Not good enough for Jira yet, but if someone wants to fool around with it, 
 here it is. The implementation passes a TermEnum - TermDocs - Fields - 
 TermVector comparation against the same data in a Directory.
 When it comes to features, offsets don't exists and positions are stored ugly 
 and has bugs.
 You might notice that norms are float[] and not byte[]. That is me who 
 refactored it to see if it would do any good. Bit shifting don't take many 
 ticks, so I might just revert that.
 I belive the code is quite self explaining.
 InstanciatedIndex ii = ..
 ii.new InstanciatedIndexReader();
 ii.addDocument(s).. replace IndexWriter for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-550) InstanciatedIndex - faster but memory consuming index

[ 
http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451768 ] 

wolfgang hoschek commented on LUCENE-550:
-

Ok. That means a basic test passes. For some more exhaustive tests, run all the 
queries in 

src/test/org/apache/lucene/index/memory/testqueries.txt

against matching files such as 

String[] files = listFiles(new String[] {
  *.txt, //*.html, *.xml, xdocs/*.xml, 
  src/java/test/org/apache/lucene/queryParser/*.java,
  src/java/org/apache/lucene/index/memory/*.java,
});
 

See testMany() for details. Repeat for various analyzer, stopword toLowerCase 
settings, such as 

boolean toLowerCase = true;
//boolean toLowerCase = false;
//Set stopWords = null;
Set stopWords = StopFilter.makeStopSet(StopAnalyzer.ENGLISH_STOP_WORDS);

Analyzer[] analyzers = new Analyzer[] { 
//new SimpleAnalyzer(),
//new StopAnalyzer(),
//new StandardAnalyzer(),
PatternAnalyzer.DEFAULT_ANALYZER,
//new WhitespaceAnalyzer(),
//new PatternAnalyzer(PatternAnalyzer.NON_WORD_PATTERN, false, null),
//new PatternAnalyzer(PatternAnalyzer.NON_WORD_PATTERN, true, 
stopWords),
//new SnowballAnalyzer(English, StopAnalyzer.ENGLISH_STOP_WORDS),
};
 


 InstanciatedIndex - faster but memory consuming index
 -

 Key: LUCENE-550
 URL: http://issues.apache.org/jira/browse/LUCENE-550
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 1.9
Reporter: Karl Wettin
 Attachments: class_diagram.png, class_diagram.png, 
 instanciated_20060527.tar, InstanciatedIndexTermEnum.java, 
 lucene.1.9-karl1.jpg, lucene2-karl_20060722.tar.gz, 
 lucene2-karl_20060723.tar.gz


 After fixing the bugs, it's now 4.5 - 5 times the speed. This is true for 
 both at index and query time. Sorry if I got your hopes up too much. There 
 are still things to be done though. Might not have time to do anything with 
 this until next month, so here is the code if anyone wants a peek.
 Not good enough for Jira yet, but if someone wants to fool around with it, 
 here it is. The implementation passes a TermEnum - TermDocs - Fields - 
 TermVector comparation against the same data in a Directory.
 When it comes to features, offsets don't exists and positions are stored ugly 
 and has bugs.
 You might notice that norms are float[] and not byte[]. That is me who 
 refactored it to see if it would do any good. Bit shifting don't take many 
 ticks, so I might just revert that.
 I belive the code is quite self explaining.
 InstanciatedIndex ii = ..
 ii.new InstanciatedIndexReader();
 ii.addDocument(s).. replace IndexWriter for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-550) InstanciatedIndex - faster but memory consuming index

[ 
http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451731 ] 

wolfgang hoschek commented on LUCENE-550:
-

Other question: when running the driver in test mode (checking for equality of 
query results against RAMDirectory) does InstantiatedIndex pass all tests? That 
would be great!

 InstanciatedIndex - faster but memory consuming index
 -

 Key: LUCENE-550
 URL: http://issues.apache.org/jira/browse/LUCENE-550
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Affects Versions: 1.9
Reporter: Karl Wettin
 Attachments: class_diagram.png, class_diagram.png, 
 instanciated_20060527.tar, InstanciatedIndexTermEnum.java, 
 lucene.1.9-karl1.jpg, lucene2-karl_20060722.tar.gz, 
 lucene2-karl_20060723.tar.gz


 After fixing the bugs, it's now 4.5 - 5 times the speed. This is true for 
 both at index and query time. Sorry if I got your hopes up too much. There 
 are still things to be done though. Might not have time to do anything with 
 this until next month, so here is the code if anyone wants a peek.
 Not good enough for Jira yet, but if someone wants to fool around with it, 
 here it is. The implementation passes a TermEnum - TermDocs - Fields - 
 TermVector comparation against the same data in a Directory.
 When it comes to features, offsets don't exists and positions are stored ugly 
 and has bugs.
 You might notice that norms are float[] and not byte[]. That is me who 
 refactored it to see if it would do any good. Bit shifting don't take many 
 ticks, so I might just revert that.
 I belive the code is quite self explaining.
 InstanciatedIndex ii = ..
 ii.new InstanciatedIndexReader();
 ii.addDocument(s).. replace IndexWriter for now.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-550) InstanciatedIndex - faster but memory consuming index