[jira] [Commented] (SOLR-6907) URLEncode documents directory in MorphlineMapperTest
[ https://issues.apache.org/jira/browse/SOLR-6907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263599#comment-14263599 ] wolfgang hoschek commented on SOLR-6907: +1 Looks reasonable to me. URLEncode documents directory in MorphlineMapperTest Key: SOLR-6907 URL: https://issues.apache.org/jira/browse/SOLR-6907 Project: Solr Issue Type: Bug Components: contrib - MapReduce, Tests Reporter: Ramkumar Aiyengar Priority: Minor Currently the test fails if the source is checked out on a directory whose path contains, say spaces.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-4509) Disable HttpClient stale check for performance and fewer spurious connection errors.
[ https://issues.apache.org/jira/browse/SOLR-4509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224815#comment-14224815 ] wolfgang hoschek commented on SOLR-4509: Would be good to remove that stale check also in solrj. Disable HttpClient stale check for performance and fewer spurious connection errors. Key: SOLR-4509 URL: https://issues.apache.org/jira/browse/SOLR-4509 Project: Solr Issue Type: Improvement Components: search Environment: 5 node SmartOS cluster (all nodes living in same global zone - i.e. same physical machine) Reporter: Ryan Zezeski Assignee: Mark Miller Priority: Minor Fix For: 5.0, Trunk Attachments: IsStaleTime.java, SOLR-4509-4_4_0.patch, SOLR-4509.patch, SOLR-4509.patch, SOLR-4509.patch, SOLR-4509.patch, baremetal-stale-nostale-med-latency.dat, baremetal-stale-nostale-med-latency.svg, baremetal-stale-nostale-throughput.dat, baremetal-stale-nostale-throughput.svg By disabling the Apache HTTP Client stale check I've witnessed a 2-4x increase in throughput and reduction of over 100ms. This patch was made in the context of a project I'm leading, called Yokozuna, which relies on distributed search. Here's the patch on Yokozuna: https://github.com/rzezeski/yokozuna/pull/26 Here's a write-up I did on my findings: http://www.zinascii.com/2013/solr-distributed-search-and-the-stale-check.html I'm happy to answer any questions or make changes to the patch to make it acceptable. ReviewBoard: https://reviews.apache.org/r/28393/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6212) upgrade Saxon-HE to 9.5.1-5 and reinstate Morphline tests that were affected under java 8/9 with 9.5.1-4
[ https://issues.apache.org/jira/browse/SOLR-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047223#comment-14047223 ] wolfgang hoschek commented on SOLR-6212: This is already fixed in the latest stable morphline release per http://kitesdk.org/docs/current/release_notes.html upgrade Saxon-HE to 9.5.1-5 and reinstate Morphline tests that were affected under java 8/9 with 9.5.1-4 Key: SOLR-6212 URL: https://issues.apache.org/jira/browse/SOLR-6212 Project: Solr Issue Type: Bug Affects Versions: 4.7, 5.0 Reporter: Michael Dodsworth Assignee: Mark Miller Priority: Minor From SOLR-1301: For posterity, there is a thread on the dev list where we are working through an issue with Saxon on java 8 and ibm's j9. Wolfgang filed https://saxonica.plan.io/issues/1944 upstream. (Saxon is pulled in via cdk-morphlines-saxon). Due to this issue, several Morphline tests were made to be 'ignored' in java 8+. The Saxon issue has been fixed in 9.5.1-5, so we should upgrade and reinstate those tests. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5109) Solr 4.4 will not deploy in Glassfish 4.x
[ https://issues.apache.org/jira/browse/SOLR-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047391#comment-14047391 ] wolfgang hoschek commented on SOLR-5109: FWIW, morphlines currently won't work with guava-16 or guava-17 because of the incompatible guava API changes in the guava Closeables class in those two guava releases. However, there's a fix for this issue that will show up soon in kite-morphlines 0.15.0. See https://github.com/kite-sdk/kite/commit/0ab2795872e4e5721f477d79e5049371a17ab8db Solr 4.4 will not deploy in Glassfish 4.x - Key: SOLR-5109 URL: https://issues.apache.org/jira/browse/SOLR-5109 Project: Solr Issue Type: Bug Affects Versions: 4.4 Environment: Glassfish 4.x Reporter: jamon camisso Priority: Blocker Labels: guava Attachments: LUCENE-5109.patch, guava-15.0-SNAPSHOT.jar The bundled Guava 14.0.1 JAR blocks deploying Solr 4.4 in Glassfish 4.x. This failure is a known issue with upstream Guava and is described here: https://code.google.com/p/guava-libraries/issues/detail?id=1433 Building Guava guava-15.0-SNAPSHOT.jar from master and bundling it in Solr allows for a successful deployment. Until the Guava developers release version 15 using their HEAD or even an RC tag seems like the only way to resolve this. This is frustrating since it was proposed that Guava be removed as a dependency before Solr 4.0 was released and yet it remains and blocks upgrading: https://issues.apache.org/jira/browse/SOLR-3601 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-5109) Solr 4.4 will not deploy in Glassfish 4.x
[ https://issues.apache.org/jira/browse/SOLR-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047394#comment-14047394 ] wolfgang hoschek edited comment on SOLR-5109 at 6/30/14 5:36 AM: - Another potential issue is that hadoop ships with guava-11.0.2 on the classpath of the task tracker (the JVM that runs the job). So this old guava version will race with any other guava version that happens to be on the classpath. was (Author: whoschek): Another potential issue is that hadoop ships with guava-12.0.1 on the classpath of the task tracker (the JVM that runs the job). So this old guava version will race with any other guava version that happens to be on the classpath. Solr 4.4 will not deploy in Glassfish 4.x - Key: SOLR-5109 URL: https://issues.apache.org/jira/browse/SOLR-5109 Project: Solr Issue Type: Bug Affects Versions: 4.4 Environment: Glassfish 4.x Reporter: jamon camisso Priority: Blocker Labels: guava Attachments: LUCENE-5109.patch, guava-15.0-SNAPSHOT.jar The bundled Guava 14.0.1 JAR blocks deploying Solr 4.4 in Glassfish 4.x. This failure is a known issue with upstream Guava and is described here: https://code.google.com/p/guava-libraries/issues/detail?id=1433 Building Guava guava-15.0-SNAPSHOT.jar from master and bundling it in Solr allows for a successful deployment. Until the Guava developers release version 15 using their HEAD or even an RC tag seems like the only way to resolve this. This is frustrating since it was proposed that Guava be removed as a dependency before Solr 4.0 was released and yet it remains and blocks upgrading: https://issues.apache.org/jira/browse/SOLR-3601 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5109) Solr 4.4 will not deploy in Glassfish 4.x
[ https://issues.apache.org/jira/browse/SOLR-5109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047394#comment-14047394 ] wolfgang hoschek commented on SOLR-5109: Another potential issue is that hadoop ships with guava-12.0.1 on the classpath of the task tracker (the JVM that runs the job). So this old guava version will race with any other guava version that happens to be on the classpath. Solr 4.4 will not deploy in Glassfish 4.x - Key: SOLR-5109 URL: https://issues.apache.org/jira/browse/SOLR-5109 Project: Solr Issue Type: Bug Affects Versions: 4.4 Environment: Glassfish 4.x Reporter: jamon camisso Priority: Blocker Labels: guava Attachments: LUCENE-5109.patch, guava-15.0-SNAPSHOT.jar The bundled Guava 14.0.1 JAR blocks deploying Solr 4.4 in Glassfish 4.x. This failure is a known issue with upstream Guava and is described here: https://code.google.com/p/guava-libraries/issues/detail?id=1433 Building Guava guava-15.0-SNAPSHOT.jar from master and bundling it in Solr allows for a successful deployment. Until the Guava developers release version 15 using their HEAD or even an RC tag seems like the only way to resolve this. This is frustrating since it was proposed that Guava be removed as a dependency before Solr 4.0 was released and yet it remains and blocks upgrading: https://issues.apache.org/jira/browse/SOLR-3601 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Adding Morphline support to DIH - worth the effort?
From our perspective we don’t really see use cases for DIH anymore. Morphlines was developed primarily with Lucene in mind (even though it doesn’t require Lucene). Flume Morphline Solr Sink handles streaming ingestion into Solr in reliable, scalable, flexible and loosely coupled ways, in separate processes. Neither Flume nor Morphlines requires Hadoop. MapReduceIndexerTool uses Morphlines for reliable, scalable and flexible batch ingestion on Hadoop. On Hadoop, even the JDBC/SQL portion of DIH now seems mostly covered by a combination of Sqoop and MapReduceIndexerTool, and perhaps a bit of Hive. I’m not sure what the use cases for DIH still are these days. (I wrote most of the Morphlines framework, Flume Morphline Solr Sink, MapReduceIndexerTool and the hbase-indexer-morphline integration.) Just my 0.02c, Wolfgang. On Jun 11, 2014, at 1:05 PM, Dyer, James james.d...@ingramcontent.com wrote: Mikhail, It would be nice if the DIH could be run separately from Solr (SOLR-853 and others). I think a lot of us have already expressed support for this, and at one time I was looking into what it would take to complete. Then again, having watched the solr morphline sink be created for Flume, I realized there are other teams out there possibly building an awesome DIH killer. If that happens, then we just saved ourselves a boatload of work, right? I think if someone out there can create a nice POC that uses a different tool, that would be a great first step. But there is also SOLR-3671 which was just committed as a follow-on to SOLR-2382. This makes DIH able to send documents to places other than Solr. Turns out someone here is using DIH to import to Mongo. (See SOLR-5981 for details). So we already have one side of the functionality to generalize DIH. James Dyer Ingram Content Group (615) 213-4311 From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] Sent: Wednesday, June 11, 2014 11:56 AM To: dev@lucene.apache.org Subject: Re: Adding Morphline support to DIH - worth the effort? James, Don't you think that the spawning DIH2.0 as separate war is a priority? On Wed, Jun 11, 2014 at 6:39 PM, Dyer, James james.d...@ingramcontent.com wrote: Alexandre, I think that writing a new entity processor for DIH is a much less risky thing to commit than, say, SOLR-4799. Entity Processors work as plug-ins and they aren't likely to break anything else. So a Morphline EntityProcessor is much more likely to be evaluated and committed. But like anything else, you're going to need to explain what the need is and what this new e.p. buys the user community. There needs to be unit tests, etc. Besides this, if you can show how a morphline e.p. can be a step towards migrating away from DIH entirely, then that would be a plus. Perhaps create a new solr example along the lines of the dih solr example that demonstrates to users this new way forward. This would go a long way in convincing the community we have a viable alternative to dih. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Tuesday, June 10, 2014 9:55 PM To: dev@lucene.apache.org Subject: Re: Adding Morphline support to DIH - worth the effort? Ripples in the pond again. Spreading and dying. Understandable, but still somewhat annoying. So, what would be the minimal viable next step to move this conversation forward? Something for 4.11 as opposed to 5.0? Anyone with commit status has a feeling of what - minimal - deliverable they would put their own weight behind? Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Mon, Jun 9, 2014 at 10:50 AM, david.w.smi...@gmail.com david.w.smi...@gmail.com wrote: One of the ideas over DIH discussed earlier is making it standalone. Yeah; my beef with the DIH is that it’s tied to Solr. But I’d rather see something other than the DIH outside Solr; it’s not worthy IMO. Why have something Solr specific even? A great pipeline shouldn’t tie itself to any end-point. There are a variety of solutions out there that I tried. There are the big 3 open-source ETLs: Kettle, Clover, Talend) and they aren’t quite ideal in one way or another. And Spring-Integration. And some half-baked data pipelines like OpenPipe Open Pipeline. I never got around to taking a good look at Findwise’s open-sourced Hydra but I learned enough to know to my surprise it was configured in code versus a config file (like all the others) and that's a big turn-off to me. Today I read through most of the Morphlines docs and a few choice source files and I’m super-impressed. But as you note it’s missing a lot of other stuff. I think something great could be built using it as a core piece. ~ David Smiley
[jira] [Commented] (SOLR-6126) MapReduce's GoLive script should support replicas
[ https://issues.apache.org/jira/browse/SOLR-6126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015266#comment-14015266 ] wolfgang hoschek commented on SOLR-6126: [~dsmiley] It uses the --zk-host CLI options to fetch the solr URLs of each replica from zk - see extractShardUrls(). This info gets passed via the Options.shardUrls parameter into the go-live phase. In the go-live phase the segments of each shard are explicitly merged via a separate REST merge request per replica into the corresponding replica. The result is that each input segment is explicitly merged N times where N is the replication factor. Each such merge reads from HDFS and writes to HDFS. (BTW, I'll be unreachable on an transatlantic flight very soon) MapReduce's GoLive script should support replicas - Key: SOLR-6126 URL: https://issues.apache.org/jira/browse/SOLR-6126 Project: Solr Issue Type: Improvement Components: contrib - MapReduce Reporter: David Smiley The GoLive feature of the MapReduce contrib module is pretty cool. But a comment in there indicates that it doesn't support replicas. Every production SolrCloud setup I've seen has had replicas! I wonder what is needed to support this. For GoLive to work, it assumes a shared file system (be it HDFS or whatever, like a SAN). If perhaps the replicas in such a system read from the very same network disk location, then all we'd need to do is send a commit() to replicas; right? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6126) MapReduce's GoLive script should support replicas
[ https://issues.apache.org/jira/browse/SOLR-6126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14015092#comment-14015092 ] wolfgang hoschek commented on SOLR-6126: The comment in the code is a bit outdated. The code does actually support replicas. MapReduce's GoLive script should support replicas - Key: SOLR-6126 URL: https://issues.apache.org/jira/browse/SOLR-6126 Project: Solr Issue Type: Improvement Components: contrib - MapReduce Reporter: David Smiley The GoLive feature of the MapReduce contrib module is pretty cool. But a comment in there indicates that it doesn't support replicas. Every production SolrCloud setup I've seen has had replicas! I wonder what is needed to support this. For GoLive to work, it assumes a shared file system (be it HDFS or whatever, like a SAN). If perhaps the replicas in such a system read from the very same network disk location, then all we'd need to do is send a commit() to replicas; right? -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5848) Morphlines is not resolving
[ https://issues.apache.org/jira/browse/SOLR-5848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13932328#comment-13932328 ] wolfgang hoschek commented on SOLR-5848: Going forward I'd recommend upgrading to version 0.12.0 rather than dealing with 0.11.0 because 0.12.0 is compatible and there are some nice performance improvements and a couple of new features - http://kitesdk.org/docs/current/release_notes.html Morphlines is not resolving --- Key: SOLR-5848 URL: https://issues.apache.org/jira/browse/SOLR-5848 Project: Solr Issue Type: Bug Reporter: Dawid Weiss Assignee: Mark Miller Priority: Critical Fix For: 4.8, 5.0 This version of morphlines does not resolve for me and Grant. {code} :: :: UNRESOLVED DEPENDENCIES :: :: :: org.kitesdk#kite-morphlines-saxon;0.11.0: not found :: org.kitesdk#kite-morphlines-hadoop-sequencefile;0.11.0: not found {code} Has this been deleted from Cloudera's repositories or something? This would be pretty bad -- maven repos should be immutable... -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5848) Morphlines is not resolving
[ https://issues.apache.org/jira/browse/SOLR-5848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13932378#comment-13932378 ] wolfgang hoschek commented on SOLR-5848: Sounds good. Thx! Morphlines is not resolving --- Key: SOLR-5848 URL: https://issues.apache.org/jira/browse/SOLR-5848 Project: Solr Issue Type: Bug Reporter: Dawid Weiss Assignee: Mark Miller Priority: Critical Fix For: 4.8, 5.0 This version of morphlines does not resolve for me and Grant. {code} :: :: UNRESOLVED DEPENDENCIES :: :: :: org.kitesdk#kite-morphlines-saxon;0.11.0: not found :: org.kitesdk#kite-morphlines-hadoop-sequencefile;0.11.0: not found {code} Has this been deleted from Cloudera's repositories or something? This would be pretty bad -- maven repos should be immutable... -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-5786) MapReduceIndexerTool --help text is missing large parts of the help text
wolfgang hoschek created SOLR-5786: -- Summary: MapReduceIndexerTool --help text is missing large parts of the help text Key: SOLR-5786 URL: https://issues.apache.org/jira/browse/SOLR-5786 Project: Solr Issue Type: Bug Components: contrib - MapReduce Affects Versions: 4.7 Reporter: wolfgang hoschek Assignee: Mark Miller Fix For: 4.8 As already mentioned repeatedly and at length, this is a regression introduced by the fix in https://issues.apache.org/jira/browse/SOLR-5605 Here is the diff of --help output before SOLR-5605 vs after SOLR-5605: {code} 130,235c130 lucene segments left in this index. Merging segments involves reading and rewriting all data in all these segment files, potentially multiple times, which is very I/O intensive and time consuming. However, an index with fewer segments can later be merged faster, and it can later be queried faster once deployed to a live Solr serving shard. Set maxSegments to 1 to optimize the index for low query latency. In a nutshell, a small maxSegments value trades indexing latency for subsequently improved query latency. This can be a reasonable trade-off for batch indexing systems. (default: 1) --fair-scheduler-pool STRING Optional tuning knob that indicates the name of the fair scheduler pool to submit jobs to. The Fair Scheduler is a pluggable MapReduce scheduler that provides a way to share large clusters. Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time. Unlike the default Hadoop scheduler, which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also an easy way to share a cluster between multiple of users. Fair sharing can also work with job priorities - the priorities are used as weights to determine the fraction of total compute time that each job gets. --dry-run Run in local mode and print documents to stdout instead of loading them into Solr. This executes the morphline in the client process (without submitting a job to MR) for quicker turnaround during early trialdebug sessions. (default: false) --log4j FILE Relative or absolute path to a log4j.properties config file on the local file system. This file will be uploaded to each MR task. Example: /path/to/log4j.properties --verbose, -v Turn on verbose output. (default: false) --show-non-solr-cloud Also show options for Non-SolrCloud mode as part of --help. (default: false) Required arguments: --output-dir HDFS_URI HDFS directory to write Solr indexes to. Inside there one output directory per shard will be generated.Example: hdfs://c2202.mycompany. com/user/$USER/test --morphline-file FILE Relative or absolute path to a local config file that contains one or more morphlines. The file must be UTF-8 encoded. Example: /path/to/morphline.conf Cluster arguments: Arguments that provide information about your Solr cluster. --zk-host STRING The address of a ZooKeeper ensemble being used by a SolrCloud cluster. This ZooKeeper ensemble will be examined to determine the number of output
[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest
[ https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13914549#comment-13914549 ] wolfgang hoschek commented on SOLR-5605: Correspondingly, I filed https://issues.apache.org/jira/browse/SOLR-5786 Look, as you know, I wrote almost all of the original solr-mapreduce contrib, and I know this code inside out. To be honest, this kind of repetitive ignorance is tiresome at best and completely turns me off. MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest --- Key: SOLR-5605 URL: https://issues.apache.org/jira/browse/SOLR-5605 Project: Solr Issue Type: Bug Reporter: Hoss Man Assignee: Mark Miller Fix For: 4.7, 5.0 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest which is reproducible with any seed -- all that matters is the locale. The problem sounded familiar, and a quick search verified that jenkins has in fact hit this a couple of times in the past -- Uwe commented on the list that this is due to a real problem in one of the third-party dependencies (that does the argument parsing) that will affect usage on some systems. If working around the bug in the arg parsing lib isn't feasible, MapReduceIndexerTool should fail cleanly if the locale isn't one we know is supported -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-5786) MapReduceIndexerTool --help output is missing large parts of the help text
[ https://issues.apache.org/jira/browse/SOLR-5786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wolfgang hoschek updated SOLR-5786: --- Summary: MapReduceIndexerTool --help output is missing large parts of the help text (was: MapReduceIndexerTool --help text is missing large parts of the help text) MapReduceIndexerTool --help output is missing large parts of the help text -- Key: SOLR-5786 URL: https://issues.apache.org/jira/browse/SOLR-5786 Project: Solr Issue Type: Bug Components: contrib - MapReduce Affects Versions: 4.7 Reporter: wolfgang hoschek Assignee: Mark Miller Fix For: 4.8 As already mentioned repeatedly and at length, this is a regression introduced by the fix in https://issues.apache.org/jira/browse/SOLR-5605 Here is the diff of --help output before SOLR-5605 vs after SOLR-5605: {code} 130,235c130 lucene segments left in this index. Merging segments involves reading and rewriting all data in all these segment files, potentially multiple times, which is very I/O intensive and time consuming. However, an index with fewer segments can later be merged faster, and it can later be queried faster once deployed to a live Solr serving shard. Set maxSegments to 1 to optimize the index for low query latency. In a nutshell, a small maxSegments value trades indexing latency for subsequently improved query latency. This can be a reasonable trade-off for batch indexing systems. (default: 1) --fair-scheduler-pool STRING Optional tuning knob that indicates the name of the fair scheduler pool to submit jobs to. The Fair Scheduler is a pluggable MapReduce scheduler that provides a way to share large clusters. Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time. Unlike the default Hadoop scheduler, which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also an easy way to share a cluster between multiple of users. Fair sharing can also work with job priorities - the priorities are used as weights to determine the fraction of total compute time that each job gets. --dry-run Run in local mode and print documents to stdout instead of loading them into Solr. This executes the morphline in the client process (without submitting a job to MR) for quicker turnaround during early trialdebug sessions. (default: false) --log4j FILE Relative or absolute path to a log4j.properties config file on the local file system. This file will be uploaded to each MR task. Example: /path/to/log4j.properties --verbose, -v Turn on verbose output. (default: false) --show-non-solr-cloud Also show options for Non-SolrCloud mode as part of --help. (default: false) Required arguments: --output-dir HDFS_URI HDFS directory to write Solr indexes to. Inside there one output directory per shard will be generated.Example: hdfs://c2202.mycompany. com/user/$USER/test --morphline-file FILE Relative or absolute path to a local config file that contains one or more morphlines. The file must be UTF-8
[jira] [Updated] (SOLR-5786) MapReduceIndexerTool --help output is missing large parts of the help text
[ https://issues.apache.org/jira/browse/SOLR-5786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wolfgang hoschek updated SOLR-5786: --- Description: As already mentioned repeatedly and at length, this is a regression introduced by the fix in https://issues.apache.org/jira/browse/SOLR-5605 Here is the diff of --help output before SOLR-5605 vs after SOLR-5605: {code} 130,235c130 lucene segments left in this index. Merging segments involves reading and rewriting all data in all these segment files, potentially multiple times, which is very I/O intensive and time consuming. However, an index with fewer segments can later be merged faster, and it can later be queried faster once deployed to a live Solr serving shard. Set maxSegments to 1 to optimize the index for low query latency. In a nutshell, a small maxSegments value trades indexing latency for subsequently improved query latency. This can be a reasonable trade-off for batch indexing systems. (default: 1) --fair-scheduler-pool STRING Optional tuning knob that indicates the name of the fair scheduler pool to submit jobs to. The Fair Scheduler is a pluggable MapReduce scheduler that provides a way to share large clusters. Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time. Unlike the default Hadoop scheduler, which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also an easy way to share a cluster between multiple of users. Fair sharing can also work with job priorities - the priorities are used as weights to determine the fraction of total compute time that each job gets. --dry-run Run in local mode and print documents to stdout instead of loading them into Solr. This executes the morphline in the client process (without submitting a job to MR) for quicker turnaround during early trialdebug sessions. (default: false) --log4j FILE Relative or absolute path to a log4j.properties config file on the local file system. This file will be uploaded to each MR task. Example: /path/to/log4j.properties --verbose, -v Turn on verbose output. (default: false) --show-non-solr-cloud Also show options for Non-SolrCloud mode as part of --help. (default: false) Required arguments: --output-dir HDFS_URI HDFS directory to write Solr indexes to. Inside there one output directory per shard will be generated.Example: hdfs://c2202.mycompany. com/user/$USER/test --morphline-file FILE Relative or absolute path to a local config file that contains one or more morphlines. The file must be UTF-8 encoded. Example: /path/to/morphline.conf Cluster arguments: Arguments that provide information about your Solr cluster. --zk-host STRING The address of a ZooKeeper ensemble being used by a SolrCloud cluster. This ZooKeeper ensemble will be examined to determine the number of output shards to create as well as the Solr URLs to merge the output shards into when using the --go- live option. Requires that you also pass the -- collection to merge the shards
[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest
[ https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915037#comment-13915037 ] wolfgang hoschek commented on SOLR-5605: bq. Are you not a committer? At Apache, those who do decide. Yes, but you've clearly been assigned to upstream this stuff and I have plenty of other things to attend to these days. bq. I did not realize Patricks patch did not include the latest code updates from MapReduce. Might be good to pay more attention, also to CDH-14804? bq. I had and still have bigger concerns around the usability of this code in Solr than this issue. It is very, very far from easy for someone to get started with this contrib right now. The usability is fine downstream where maven automatically builds a job jar that includes the necessary dependency jars inside of the lib dir of the MR job jar. Hence no startup script or extra steps are required downstream, just one (fat) jar. If it's not usable upstream it may be because no corresponding packaging system has been used upstream, for reasons that escape me. bq. which is why non of these smaller issues concern me very much at this point. I'm afraid ignorance never helps. MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest --- Key: SOLR-5605 URL: https://issues.apache.org/jira/browse/SOLR-5605 Project: Solr Issue Type: Bug Reporter: Hoss Man Assignee: Mark Miller Fix For: 4.7, 5.0 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest which is reproducible with any seed -- all that matters is the locale. The problem sounded familiar, and a quick search verified that jenkins has in fact hit this a couple of times in the past -- Uwe commented on the list that this is due to a real problem in one of the third-party dependencies (that does the argument parsing) that will affect usage on some systems. If working around the bug in the arg parsing lib isn't feasible, MapReduceIndexerTool should fail cleanly if the locale isn't one we know is supported -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest
[ https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13915037#comment-13915037 ] wolfgang hoschek edited comment on SOLR-5605 at 2/27/14 9:23 PM: - bq. Are you not a committer? At Apache, those who do decide. Yes, but you've clearly been assigned to upstream those contribs and I have plenty of other things to attend to these days. bq. I did not realize Patricks patch did not include the latest code updates from MapReduce. Might be good to pay more attention, also to CDH-14804? bq. I had and still have bigger concerns around the usability of this code in Solr than this issue. It is very, very far from easy for someone to get started with this contrib right now. The usability is fine downstream where maven automatically builds a job jar that includes the necessary dependency jars inside of the lib dir of the MR job jar. Hence no startup script or extra steps are required downstream, just one (fat) jar. If it's not usable upstream it may be because no corresponding packaging system has been used upstream, for reasons that escape me. bq. which is why non of these smaller issues concern me very much at this point. I'm afraid ignorance never helps. was (Author: whoschek): bq. Are you not a committer? At Apache, those who do decide. Yes, but you've clearly been assigned to upstream this stuff and I have plenty of other things to attend to these days. bq. I did not realize Patricks patch did not include the latest code updates from MapReduce. Might be good to pay more attention, also to CDH-14804? bq. I had and still have bigger concerns around the usability of this code in Solr than this issue. It is very, very far from easy for someone to get started with this contrib right now. The usability is fine downstream where maven automatically builds a job jar that includes the necessary dependency jars inside of the lib dir of the MR job jar. Hence no startup script or extra steps are required downstream, just one (fat) jar. If it's not usable upstream it may be because no corresponding packaging system has been used upstream, for reasons that escape me. bq. which is why non of these smaller issues concern me very much at this point. I'm afraid ignorance never helps. MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest --- Key: SOLR-5605 URL: https://issues.apache.org/jira/browse/SOLR-5605 Project: Solr Issue Type: Bug Reporter: Hoss Man Assignee: Mark Miller Fix For: 4.7, 5.0 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest which is reproducible with any seed -- all that matters is the locale. The problem sounded familiar, and a quick search verified that jenkins has in fact hit this a couple of times in the past -- Uwe commented on the list that this is due to a real problem in one of the third-party dependencies (that does the argument parsing) that will affect usage on some systems. If working around the bug in the arg parsing lib isn't feasible, MapReduceIndexerTool should fail cleanly if the locale isn't one we know is supported -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest
[ https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13911744#comment-13911744 ] wolfgang hoschek commented on SOLR-5605: I have looked, have you? I have fixed this one before. Have you? Pls take the time to diff before vs. after to see that some docs parts are missing while other's are present (b/c of the funny missing buffer flush). It is not the same. This is a regression. Thx. MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest --- Key: SOLR-5605 URL: https://issues.apache.org/jira/browse/SOLR-5605 Project: Solr Issue Type: Bug Reporter: Hoss Man Assignee: Mark Miller Fix For: 4.7, 5.0 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest which is reproducible with any seed -- all that matters is the locale. The problem sounded familiar, and a quick search verified that jenkins has in fact hit this a couple of times in the past -- Uwe commented on the list that this is due to a real problem in one of the third-party dependencies (that does the argument parsing) that will affect usage on some systems. If working around the bug in the arg parsing lib isn't feasible, MapReduceIndexerTool should fail cleanly if the locale isn't one we know is supported -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Reopened] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest
[ https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wolfgang hoschek reopened SOLR-5605: Without this the --help text is screwed. https://issues.apache.org/jira/secure/EditComment!default.jspa?id=12687301commentId=13862272 MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest --- Key: SOLR-5605 URL: https://issues.apache.org/jira/browse/SOLR-5605 Project: Solr Issue Type: Bug Reporter: Hoss Man Assignee: Mark Miller Fix For: 4.7, 5.0 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest which is reproducible with any seed -- all that matters is the locale. The problem sounded familiar, and a quick search verified that jenkins has in fact hit this a couple of times in the past -- Uwe commented on the list that this is due to a real problem in one of the third-party dependencies (that does the argument parsing) that will affect usage on some systems. If working around the bug in the arg parsing lib isn't feasible, MapReduceIndexerTool should fail cleanly if the locale isn't one we know is supported -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest
[ https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905806#comment-13905806 ] wolfgang hoschek commented on SOLR-5605: Yes, as already mentioned, otherwise some of the --help text doesn't show up in the output because there's a change related to buffer flushing in argparse4j-0.4.2. MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest --- Key: SOLR-5605 URL: https://issues.apache.org/jira/browse/SOLR-5605 Project: Solr Issue Type: Bug Reporter: Hoss Man Assignee: Mark Miller Fix For: 4.7, 5.0 I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest which is reproducible with any seed -- all that matters is the locale. The problem sounded familiar, and a quick search verified that jenkins has in fact hit this a couple of times in the past -- Uwe commented on the list that this is due to a real problem in one of the third-party dependencies (that does the argument parsing) that will affect usage on some systems. If working around the bug in the arg parsing lib isn't feasible, MapReduceIndexerTool should fail cleanly if the locale isn't one we know is supported -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Welcome Benson Margulies as Lucene/Solr committer!
Welcome on board! Wolfgang. On Jan 26, 2014, at 4:32 PM, Erick Erickson wrote: Good to have you aboard! Erick On Sat, Jan 25, 2014 at 10:52 PM, Mark Miller markrmil...@gmail.com wrote: Welcome! - Mark http://about.me/markrmiller On Jan 25, 2014, at 4:40 PM, Michael McCandless luc...@mikemccandless.com wrote: I'm pleased to announce that Benson Margulies has accepted to join our ranks as a committer. Benson has been involved in a number of Lucene/Solr issues over time (see http://jirasearch.mikemccandless.com/search.py?index=jirachg=ddsa1=allUsersa2=Benson+Margulies ), most recently on debugging tricky analysis issues. Benson, it is tradition that you introduce yourself with a brief bio. I know you're heavily involved in other Apache projects already... Once your account is set up, you should then be able to add yourself to the who we are page on the website as well. Congratulations and welcome! Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest
[ https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862272#comment-13862272 ] wolfgang hoschek commented on SOLR-5605: Thanks for getting to the bottom of this! Looks like we'll now be good on upgrade to argparse4j-0.4.3, except we'll also need to apply CDH-16434 to MapReduceIndexerTool.java because there's a change related to flushing in 0.4.2: -parser.printHelp(new PrintWriter(System.out)); +parser.printHelp(); Otherwise some of the --help text doesn't show up in the output :-( MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest --- Key: SOLR-5605 URL: https://issues.apache.org/jira/browse/SOLR-5605 Project: Solr Issue Type: Bug Reporter: Hoss Man I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest which is reproducible with any seed -- all that matters is the locale. The problem sounded familiar, and a quick search verified that jenkins has in fact hit this a couple of times in the past -- Uwe commented on the list that this is due to a real problem in one of the third-party dependencies (that does the argument parsing) that will affect usage on some systems. If working around the bug in the arg parsing lib isn't feasible, MapReduceIndexerTool should fail cleanly if the locale isn't one we know is supported -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-5605) MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest
[ https://issues.apache.org/jira/browse/SOLR-5605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862272#comment-13862272 ] wolfgang hoschek edited comment on SOLR-5605 at 1/4/14 11:42 AM: - Thanks for getting to the bottom of this! Looks like we'll now be good on upgrade to argparse4j-0.4.3, except we'll also need to apply CDH-16434 to MapReduceIndexerTool.java because there's a change related to flushing in 0.4.2: {code} -parser.printHelp(new PrintWriter(System.out)); +parser.printHelp(); {code} Otherwise some of the --help text doesn't show up in the output :-( was (Author: whoschek): Thanks for getting to the bottom of this! Looks like we'll now be good on upgrade to argparse4j-0.4.3, except we'll also need to apply CDH-16434 to MapReduceIndexerTool.java because there's a change related to flushing in 0.4.2: -parser.printHelp(new PrintWriter(System.out)); +parser.printHelp(); Otherwise some of the --help text doesn't show up in the output :-( MapReduceIndexerTool fails in some locales -- seen in random failures of MapReduceIndexerToolArgumentParserTest --- Key: SOLR-5605 URL: https://issues.apache.org/jira/browse/SOLR-5605 Project: Solr Issue Type: Bug Reporter: Hoss Man I noticed a randomized failure in MapReduceIndexerToolArgumentParserTest which is reproducible with any seed -- all that matters is the locale. The problem sounded familiar, and a quick search verified that jenkins has in fact hit this a couple of times in the past -- Uwe commented on the list that this is due to a real problem in one of the third-party dependencies (that does the argument parsing) that will affect usage on some systems. If working around the bug in the arg parsing lib isn't feasible, MapReduceIndexerTool should fail cleanly if the locale isn't one we know is supported -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5584) Update to Guava 15.0
[ https://issues.apache.org/jira/browse/SOLR-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862273#comment-13862273 ] wolfgang hoschek commented on SOLR-5584: As mentioned above, morphlines was designed to run fine with any guava version = 11.0.2. But the hadoop task tracker always puts guava 11.0.2 on the classpath of any MR job that it executes, so solr-mapreduce would need to figure out how to override or reorder that. Update to Guava 15.0 Key: SOLR-5584 URL: https://issues.apache.org/jira/browse/SOLR-5584 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Mark Miller Priority: Minor Fix For: 5.0, 4.7 -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: The Old Git Discussion
+1 On Jan 2, 2014, at 10:53 PM, Simon Willnauer wrote: +1 On Thu, Jan 2, 2014 at 9:51 PM, Mark Miller markrmil...@gmail.com wrote: bzr is dying; Emacs needs to move http://lists.gnu.org/archive/html/emacs-devel/2014-01/msg5.html Interesting thread. For similar reasons, I think that Lucene and Solr should eventually move to Git. It's not GitHub, but it's a lot closer. The new Apache projects I see are all choosing Git. It's the winners road I think. I don't know that there is a big hurry right now, but I think it's inevitable that we should switch. -- - Mark - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-5584) Update to Guava 15.0
[ https://issues.apache.org/jira/browse/SOLR-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13858699#comment-13858699 ] wolfgang hoschek commented on SOLR-5584: What exactly is failing for you? morphlines was designed to run fine with any guava version = 11.0.2. At least it did last I checked... Update to Guava 15.0 Key: SOLR-5584 URL: https://issues.apache.org/jira/browse/SOLR-5584 Project: Solr Issue Type: Improvement Reporter: Mark Miller Assignee: Mark Miller Priority: Minor Fix For: 5.0, 4.7 -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13856657#comment-13856657 ] wolfgang hoschek commented on SOLR-1301: Also see https://issues.cloudera.org/browse/CDK-262 Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1.5#6160) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848097#comment-13848097 ] wolfgang hoschek edited comment on SOLR-1301 at 12/16/13 2:27 AM: -- Might be best to write a program that generates the list of files and then explicitly provide that file list to the MR job, e.g. via the --input-list option. For example you could use the HDFS version of the Linux file system 'find' command for that (HdfsFindTool doc and code here: https://github.com/cloudera/search/tree/master_1.1.0/search-mr#hdfsfindtool) was (Author: whoschek): Might be best to write a program that generates the list of files and then explicitly provide that file list to the MR job, e.g. via the --input-list option. For example you could use the HDFS version of the Linux file system 'find' command for that (HdfsFindTool doc and code here: https://github.com/cloudera/search/tree/master_1.1.0/search-mr) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1.4#6159) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848775#comment-13848775 ] wolfgang hoschek commented on SOLR-1301: bq. it would be convenient if we could ignore the underscore (_) hidden files in hdfs as well as the . hidden files when reading input files from hdfs. +1 Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1.4#6159) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13848097#comment-13848097 ] wolfgang hoschek commented on SOLR-1301: Might be best to write a program that generates the list of files and then explicitly provide that file list to the MR job, e.g. via the --input-list option. For example you could use the HDFS version of the Linux file system 'find' command for that (HdfsFindTool doc and code here: https://github.com/cloudera/search/tree/master_1.1.0/search-mr) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1.4#6159) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843443#comment-13843443 ] wolfgang hoschek commented on SOLR-1301: I'm not aware of anything needing jersey except perhaps hadoop pulls that in. The combined dependencies of all morphline modules is here: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html The dependencies of each individual morphline modules is here: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html The source and POMs are here, as usual: https://github.com/cloudera/cdk/tree/master/cdk-morphlines By the way, a somewhat separate issue is that it seems to me that the ivy dependences for solr-morphlines-core and solr-morphlines-cell and solr-map-reduce are a bit backwards upstream in that solr-morphlines-core pulls in a ton of dependencies that it doesn't need, and those deps should rather be pulled in by the solr-map-reduce (which is a essentially an out-of-the-box app). Would be good to organize ivy and mvn upstream in such a way that * solr-map-reduce should depend on solr-morphlines-cell plus cdk-morphlines-all plus xyz * solr-morphlines-cell should depend on solr-morphlines-core plus xyz * solr-morphlines-core should depend on cdk-morphlines-core plus xyz More concretely, FWIW, to see how the deps look like in production releases downstream review the following POMs: https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-core/pom.xml and https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-cell/pom.xml and https://github.com/cloudera/search/blob/master_1.1.0/search-mr/pom.xml Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored
[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843443#comment-13843443 ] wolfgang hoschek edited comment on SOLR-1301 at 12/9/13 7:30 PM: - I'm not aware of anything needing jersey except perhaps hadoop pulls that in. The combined dependencies of all morphline modules is here: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html The dependencies of each individual morphline modules is here: http://cloudera.github.io/cdk/docs/current/dependencies.html The source and POMs are here, as usual: https://github.com/cloudera/cdk/tree/master/cdk-morphlines By the way, a somewhat separate issue is that it seems to me that the ivy dependences for solr-morphlines-core and solr-morphlines-cell and solr-map-reduce are a bit backwards upstream in that currently solr-morphlines-core pulls in a ton of dependencies that it doesn't need, and those deps should rather be pulled in by the solr-map-reduce (which is a essentially an out-of-the-box app that bundles user level deps). Correspondingly, would be good to organize ivy and mvn upstream in such a way that * solr-map-reduce should depend on solr-morphlines-cell plus cdk-morphlines-all minus cdk-morphlines-solr-cell (now upstream) minus cdk-morphlines-solr-core (now upstream) plus xyz * solr-morphlines-cell should depend on solr-morphlines-core plus xyz * solr-morphlines-core should depend on cdk-morphlines-core plus xyz More concretely, FWIW, to see how the deps look like in production releases downstream review the following POMs: https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-core/pom.xml and https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-cell/pom.xml and https://github.com/cloudera/search/blob/master_1.1.0/search-mr/pom.xml was (Author: whoschek): I'm not aware of anything needing jersey except perhaps hadoop pulls that in. The combined dependencies of all morphline modules is here: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html The dependencies of each individual morphline modules is here: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html The source and POMs are here, as usual: https://github.com/cloudera/cdk/tree/master/cdk-morphlines By the way, a somewhat separate issue is that it seems to me that the ivy dependences for solr-morphlines-core and solr-morphlines-cell and solr-map-reduce are a bit backwards upstream in that solr-morphlines-core pulls in a ton of dependencies that it doesn't need, and those deps should rather be pulled in by the solr-map-reduce (which is a essentially an out-of-the-box app). Would be good to organize ivy and mvn upstream in such a way that * solr-map-reduce should depend on solr-morphlines-cell plus cdk-morphlines-all plus xyz * solr-morphlines-cell should depend on solr-morphlines-core plus xyz * solr-morphlines-core should depend on cdk-morphlines-core plus xyz More concretely, FWIW, to see how the deps look like in production releases downstream review the following POMs: https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-core/pom.xml and https://github.com/cloudera/cdk/blob/master/cdk-morphlines/cdk-morphlines-solr-cell/pom.xml and https://github.com/cloudera/search/blob/master_1.1.0/search-mr/pom.xml Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13843523#comment-13843523 ] wolfgang hoschek commented on SOLR-1301: Apologies for the confusion. We are upstreaming cdk-morphlines-solr-cell into the solr contrib solr-morphlines-cell as well as cdk-morphlines-solr-core into the solr contrib solr-morphlines-core as well as search-mr into the solr contrib solr-map-reduce. Once the upstreaming is done these old modules will go away. Next, downstream will be made identical to upstream plus perhaps some critical fixes as necessary, and the upstream/downstream terms will apply in the way folks usually think about them, but we are not quite yet there today, but getting there... cdk-morphlines-all is simply a convenience pom that includes all the other morphline poms so there's less to type for users who like a bit more auto magic. Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1.4#6159) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13842034#comment-13842034 ] wolfgang hoschek commented on SOLR-1301: There are also some important fixes downstream in 0.9.0 of cdk-morphlines-core and cdk-morphlines-solr-cell that would be good to merge upstream (solr locator race, solr cell bug, etc). Also there are new morphline modules jars to add with 0.9.0 and jars to update (plus upstream is also missing some morphline modules from 0.8 as well) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13842034#comment-13842034 ] wolfgang hoschek edited comment on SOLR-1301 at 12/7/13 2:57 AM: - There are also some important fixes downstream in 0.9.0 of cdk-morphlines-solr-core and cdk-morphlines-solr-cell that would be good to merge upstream (solr locator race, solr cell bug, etc). Also there are new morphline modules jars to add with 0.9.0 and jars to update (plus upstream is also missing some morphline modules from 0.8 as well) was (Author: whoschek): There are also some important fixes downstream in 0.9.0 of cdk-morphlines-core and cdk-morphlines-solr-cell that would be good to merge upstream (solr locator race, solr cell bug, etc). Also there are new morphline modules jars to add with 0.9.0 and jars to update (plus upstream is also missing some morphline modules from 0.8 as well) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839308#comment-13839308 ] wolfgang hoschek commented on SOLR-1301: There are also some fixes downstream in cdk-morphlines-core and cdk-morphlines-solr-cell that would be good to push upstream. Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839311#comment-13839311 ] wolfgang hoschek commented on SOLR-1301: Minor nit: could remove jobConf.setBoolean(ExtractingParams.IGNORE_TIKA_EXCEPTION, false) in MorphlineBasicMiniMRTest + MorphlineGoLiveMiniMRTest because such a flag is nomore needed, and it removes an unnecessary dependency on tika. Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839556#comment-13839556 ] wolfgang hoschek commented on SOLR-1301: FWIW, a current printout of --help showing the CLI options is here: https://github.com/cloudera/search/tree/master_1.0.0/search-mr Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13839556#comment-13839556 ] wolfgang hoschek edited comment on SOLR-1301 at 12/5/13 12:55 AM: -- FWIW, a current printout of --help showing the CLI options is here: https://github.com/cloudera/search/tree/master_1.1.0/search-mr was (Author: whoschek): FWIW, a current printout of --help showing the CLI options is here: https://github.com/cloudera/search/tree/master_1.0.0/search-mr Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [JENKINS] Lucene-Solr-trunk-Linux (32bit/jdk1.8.0-ea-b117) - Build # 8549 - Still Failing!
On Dec 3, 2013, at 12:11 AM, Uwe Schindler wrote: Looks like Java's service loader lookup impl has become more strict in Java8. This issue on Java 8 is kind of unfortunate because morphlines and solr-mr doesn't actually use JAXP at all. For the time being might be best to disable testing on Java8 for this contrib, in order to get a stable build and make progress on other issues. A couple of options that come to mind in how to deal with this longer term: 1) Remove the dependency on cdk-morphlines-saxon (which pulls in the saxon jar) What ist he effect of this? I would prefer this! The effect is that the convertHTML, xquery and xslt commands won't be available anymore: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html#/cdk-morphlines-saxon 2) Replace all Solr calls to JAXP XPathFactory.newInstance() with a little helper that first tries to use one of a list of well known XPathFactory subclasses, and only if that fails falls back to the generic XPathFactory.newInstance(). E.g. use something like XPathFactory.newInstance(XPathFactory.DEFAULT_OBJECT_MODEL_URI, com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl, ClassLoader.getSystemClassLoader()); This is a hack, just because of this craziness, I don't want to have non conformant code in Solr Core! This is actually quite common practice because the JAXP service loader mechanism is a bit flawed. Also, most XSLT and XPath and StaX implementations have serious bugs in various areas. Thus many XML intensive apps that require reliability and predictable behavior explicitly choose one of the JAXP implementation that's known to work for them, rather than hoping for the best with some potentially buggy default impl. JAXP plug-ability really only exists for simple XPath use cases. The good news is that Solr Config et al seems to fit into that simple pluggable bucket. There are 14 such XPathFactory.newInstance() calls in the Solr codebase. Definite -1 3) Somehow remove the META-INF/services/javax.xml.xpath.XPathFactory file from the saxon jar (this is what's causing this, and we don't need that file, but it's not clear how to remove it, realistically) The only correct way to solve this: File a bug in Jackson and apply (1). Jackson violates the standards. And this violation fails in a number of JVMs (not only in Java 8, also IBM J9 is affected). I'll file a bug with saxon and see what Mike Kay's take is. Meanwhile, we could remove the saxon jar or disable tests on java8 J9 to be able to move forward on this. Because of this I don't want to have Jackson in Solr at all (you have to know, I am a fan of XSLT and XPath, but Jackson is the worst implementation I have seen and I avoid it whenever possible - Only if you need XPath2 / XSLT 2 you may want to use it). All XML libs have bugs but most XML intensive apps use saxon in production rather than other impls, at least from what I've seen over the years. Anyway, just my 2 cents. Wolfgang. Uwe On Dec 2, 2013, at 4:41 PM, Mark Miller wrote: Uwe mentioned this in IRC - I guess Saxon doesn’t play nice with java 8. http://stackoverflow.com/questions/7914915/syntax-error-in-javax-xml-x path-xpathfactory-provider-configuration-file-of-saxo - Mark On Dec 2, 2013, at 7:06 PM, Policeman Jenkins Server jenk...@thetaphi.de wrote: Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/8549/ Java: 32bit/jdk1.8.0-ea-b117 -server -XX:+UseSerialGC 3 tests failed. FAILED: junit.framework.TestSuite.org.apache.solr.hadoop.MorphlineReducerTest Error Message: 1 thread leaked from SUITE scope at org.apache.solr.hadoop.MorphlineReducerTest: 1) Thread[id=17, name=Thread-4, state=TIMED_WAITING, group=TGRP- MorphlineReducerTest] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNa nos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNa nos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) at org.apache.solr.hadoop.HeartBeater.run(HeartBeater.java:108) Stack Trace: com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked from SUITE scope at org.apache.solr.hadoop.MorphlineReducerTest: 1) Thread[id=17, name=Thread-4, state=TIMED_WAITING, group=TGRP- MorphlineReducerTest] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNa nos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNa nos(AbstractQueuedSynchronizer.java:1328) at
Re: [JENKINS] Lucene-Solr-trunk-Linux (32bit/jdk1.8.0-ea-b117) - Build # 8549 - Still Failing!
FYI, I filed this saxon ticket: https://saxonica.plan.io/issues/1944 On Dec 3, 2013, at 12:52 AM, Wolfgang Hoschek wrote: On Dec 3, 2013, at 12:11 AM, Uwe Schindler wrote: Looks like Java's service loader lookup impl has become more strict in Java8. This issue on Java 8 is kind of unfortunate because morphlines and solr-mr doesn't actually use JAXP at all. For the time being might be best to disable testing on Java8 for this contrib, in order to get a stable build and make progress on other issues. A couple of options that come to mind in how to deal with this longer term: 1) Remove the dependency on cdk-morphlines-saxon (which pulls in the saxon jar) What ist he effect of this? I would prefer this! The effect is that the convertHTML, xquery and xslt commands won't be available anymore: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html#/cdk-morphlines-saxon 2) Replace all Solr calls to JAXP XPathFactory.newInstance() with a little helper that first tries to use one of a list of well known XPathFactory subclasses, and only if that fails falls back to the generic XPathFactory.newInstance(). E.g. use something like XPathFactory.newInstance(XPathFactory.DEFAULT_OBJECT_MODEL_URI, com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl, ClassLoader.getSystemClassLoader()); This is a hack, just because of this craziness, I don't want to have non conformant code in Solr Core! This is actually quite common practice because the JAXP service loader mechanism is a bit flawed. Also, most XSLT and XPath and StaX implementations have serious bugs in various areas. Thus many XML intensive apps that require reliability and predictable behavior explicitly choose one of the JAXP implementation that's known to work for them, rather than hoping for the best with some potentially buggy default impl. JAXP plug-ability really only exists for simple XPath use cases. The good news is that Solr Config et al seems to fit into that simple pluggable bucket. There are 14 such XPathFactory.newInstance() calls in the Solr codebase. Definite -1 3) Somehow remove the META-INF/services/javax.xml.xpath.XPathFactory file from the saxon jar (this is what's causing this, and we don't need that file, but it's not clear how to remove it, realistically) The only correct way to solve this: File a bug in Jackson and apply (1). Jackson violates the standards. And this violation fails in a number of JVMs (not only in Java 8, also IBM J9 is affected). I'll file a bug with saxon and see what Mike Kay's take is. Meanwhile, we could remove the saxon jar or disable tests on java8 J9 to be able to move forward on this. Because of this I don't want to have Jackson in Solr at all (you have to know, I am a fan of XSLT and XPath, but Jackson is the worst implementation I have seen and I avoid it whenever possible - Only if you need XPath2 / XSLT 2 you may want to use it). All XML libs have bugs but most XML intensive apps use saxon in production rather than other impls, at least from what I've seen over the years. Anyway, just my 2 cents. Wolfgang. Uwe On Dec 2, 2013, at 4:41 PM, Mark Miller wrote: Uwe mentioned this in IRC - I guess Saxon doesn’t play nice with java 8. http://stackoverflow.com/questions/7914915/syntax-error-in-javax-xml-x path-xpathfactory-provider-configuration-file-of-saxo - Mark On Dec 2, 2013, at 7:06 PM, Policeman Jenkins Server jenk...@thetaphi.de wrote: Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/8549/ Java: 32bit/jdk1.8.0-ea-b117 -server -XX:+UseSerialGC 3 tests failed. FAILED: junit.framework.TestSuite.org.apache.solr.hadoop.MorphlineReducerTest Error Message: 1 thread leaked from SUITE scope at org.apache.solr.hadoop.MorphlineReducerTest: 1) Thread[id=17, name=Thread-4, state=TIMED_WAITING, group=TGRP- MorphlineReducerTest] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNa nos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNa nos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) at org.apache.solr.hadoop.HeartBeater.run(HeartBeater.java:108) Stack Trace: com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked from SUITE scope at org.apache.solr.hadoop.MorphlineReducerTest: 1) Thread[id=17, name=Thread-4, state=TIMED_WAITING, group=TGRP- MorphlineReducerTest] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNa nos(AbstractQueuedSynchronizer.java
Re: [JENKINS] Lucene-Solr-trunk-Linux (32bit/jdk1.8.0-ea-b117) - Build # 8549 - Still Failing!
Actually, Mike's opinion has changed because now Saxon doesn't need to support Java5 anymore - https://saxonica.plan.io/issues/1944 Wolfgang. On Dec 3, 2013, at 2:07 AM, Dawid Weiss wrote: I'll file a bug with saxon and see what Mike Kay's take is I think Mike has already expressed his opinion on the subject in that stack overflow topic... :) Dawid On Tue, Dec 3, 2013 at 9:52 AM, Wolfgang Hoschek whosc...@cloudera.com wrote: On Dec 3, 2013, at 12:11 AM, Uwe Schindler wrote: Looks like Java's service loader lookup impl has become more strict in Java8. This issue on Java 8 is kind of unfortunate because morphlines and solr-mr doesn't actually use JAXP at all. For the time being might be best to disable testing on Java8 for this contrib, in order to get a stable build and make progress on other issues. A couple of options that come to mind in how to deal with this longer term: 1) Remove the dependency on cdk-morphlines-saxon (which pulls in the saxon jar) What ist he effect of this? I would prefer this! The effect is that the convertHTML, xquery and xslt commands won't be available anymore: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html#/cdk-morphlines-saxon 2) Replace all Solr calls to JAXP XPathFactory.newInstance() with a little helper that first tries to use one of a list of well known XPathFactory subclasses, and only if that fails falls back to the generic XPathFactory.newInstance(). E.g. use something like XPathFactory.newInstance(XPathFactory.DEFAULT_OBJECT_MODEL_URI, com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl, ClassLoader.getSystemClassLoader()); This is a hack, just because of this craziness, I don't want to have non conformant code in Solr Core! This is actually quite common practice because the JAXP service loader mechanism is a bit flawed. Also, most XSLT and XPath and StaX implementations have serious bugs in various areas. Thus many XML intensive apps that require reliability and predictable behavior explicitly choose one of the JAXP implementation that's known to work for them, rather than hoping for the best with some potentially buggy default impl. JAXP plug-ability really only exists for simple XPath use cases. The good news is that Solr Config et al seems to fit into that simple pluggable bucket. There are 14 such XPathFactory.newInstance() calls in the Solr codebase. Definite -1 3) Somehow remove the META-INF/services/javax.xml.xpath.XPathFactory file from the saxon jar (this is what's causing this, and we don't need that file, but it's not clear how to remove it, realistically) The only correct way to solve this: File a bug in Jackson and apply (1). Jackson violates the standards. And this violation fails in a number of JVMs (not only in Java 8, also IBM J9 is affected). I'll file a bug with saxon and see what Mike Kay's take is. Meanwhile, we could remove the saxon jar or disable tests on java8 J9 to be able to move forward on this. Because of this I don't want to have Jackson in Solr at all (you have to know, I am a fan of XSLT and XPath, but Jackson is the worst implementation I have seen and I avoid it whenever possible - Only if you need XPath2 / XSLT 2 you may want to use it). All XML libs have bugs but most XML intensive apps use saxon in production rather than other impls, at least from what I've seen over the years. Anyway, just my 2 cents. Wolfgang. Uwe On Dec 2, 2013, at 4:41 PM, Mark Miller wrote: Uwe mentioned this in IRC - I guess Saxon doesn’t play nice with java 8. http://stackoverflow.com/questions/7914915/syntax-error-in-javax-xml-x path-xpathfactory-provider-configuration-file-of-saxo - Mark On Dec 2, 2013, at 7:06 PM, Policeman Jenkins Server jenk...@thetaphi.de wrote: Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/8549/ Java: 32bit/jdk1.8.0-ea-b117 -server -XX:+UseSerialGC 3 tests failed. FAILED: junit.framework.TestSuite.org.apache.solr.hadoop.MorphlineReducerTest Error Message: 1 thread leaked from SUITE scope at org.apache.solr.hadoop.MorphlineReducerTest: 1) Thread[id=17, name=Thread-4, state=TIMED_WAITING, group=TGRP- MorphlineReducerTest] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNa nos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNa nos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) at org.apache.solr.hadoop.HeartBeater.run(HeartBeater.java:108) Stack Trace: com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked from SUITE scope at org.apache.solr.hadoop.MorphlineReducerTest: 1) Thread
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837976#comment-13837976 ] wolfgang hoschek commented on SOLR-1301: bq. module/dir names I propose morphlines-solr-core and morphlines-solr-cell as names. Thoughts? Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837979#comment-13837979 ] wolfgang hoschek commented on SOLR-1301: +1 to map-reduce-indexer module name/dir. Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [JENKINS] Lucene-Solr-trunk-Linux (32bit/jdk1.8.0-ea-b117) - Build # 8549 - Still Failing!
Hi Uwe, There is no need for the saxon jar to be in the WAR. The mr contrib module is intended to be run in a separate process. The saxon jar should only be pulled in by the MR contrib module aka map-reduce-indexer contrib module. If that's not the case that's a packaging bug that we should fix. For some more background, here is how the morphline dependency graph looks downstream: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/cdk-morphlines-all/dependencies.html Wolfgang. On Dec 3, 2013, at 5:14 AM, Uwe Schindler wrote: Wolfgang, does this problem affect all the hadoop modules (because the saxon jar is in all the modules classpath)? If yes, I have to disable all of them with IBM J9 and Oracle Java 8. My biggest problem is the fact that this could also affect the release of Solr. If the saxon.jar is in the WAR file of Solr, then it breaks whole of Solr. But as it is a module, it should be loaded by the SolrResourceLoader from the core's lib folder, so all should be fine, if installed. I hope the huge Hadoop stuff is not in the WAR (not only because of this issue) and needs to be installed by the user in the instance's lib folder!!! Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: dawid.we...@gmail.com [mailto:dawid.we...@gmail.com] On Behalf Of Dawid Weiss Sent: Tuesday, December 03, 2013 12:10 PM To: dev@lucene.apache.org Subject: Re: [JENKINS] Lucene-Solr-trunk-Linux (32bit/jdk1.8.0-ea-b117) - Build # 8549 - Still Failing! Ha! Thanks for filing the issue, Wolfgang. D. On Tue, Dec 3, 2013 at 12:01 PM, Wolfgang Hoschek whosc...@cloudera.com wrote: Actually, Mike's opinion has changed because now Saxon doesn't need to support Java5 anymore - https://saxonica.plan.io/issues/1944 Wolfgang. On Dec 3, 2013, at 2:07 AM, Dawid Weiss wrote: I'll file a bug with saxon and see what Mike Kay's take is I think Mike has already expressed his opinion on the subject in that stack overflow topic... :) Dawid On Tue, Dec 3, 2013 at 9:52 AM, Wolfgang Hoschek whosc...@cloudera.com wrote: On Dec 3, 2013, at 12:11 AM, Uwe Schindler wrote: Looks like Java's service loader lookup impl has become more strict in Java8. This issue on Java 8 is kind of unfortunate because morphlines and solr-mr doesn't actually use JAXP at all. For the time being might be best to disable testing on Java8 for this contrib, in order to get a stable build and make progress on other issues. A couple of options that come to mind in how to deal with this longer term: 1) Remove the dependency on cdk-morphlines-saxon (which pulls in the saxon jar) What ist he effect of this? I would prefer this! The effect is that the convertHTML, xquery and xslt commands won't be available anymore: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlines ReferenceGuide.html#/cdk-morphlines-saxon 2) Replace all Solr calls to JAXP XPathFactory.newInstance() with a little helper that first tries to use one of a list of well known XPathFactory subclasses, and only if that fails falls back to the generic XPathFactory.newInstance(). E.g. use something like XPathFactory.newInstance(XPathFactory.DEFAULT_OBJECT_MODEL_URI, com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl, ClassLoader.getSystemClassLoader()); This is a hack, just because of this craziness, I don't want to have non conformant code in Solr Core! This is actually quite common practice because the JAXP service loader mechanism is a bit flawed. Also, most XSLT and XPath and StaX implementations have serious bugs in various areas. Thus many XML intensive apps that require reliability and predictable behavior explicitly choose one of the JAXP implementation that's known to work for them, rather than hoping for the best with some potentially buggy default impl. JAXP plug-ability really only exists for simple XPath use cases. The good news is that Solr Config et al seems to fit into that simple pluggable bucket. There are 14 such XPathFactory.newInstance() calls in the Solr codebase. Definite -1 3) Somehow remove the META-INF/services/javax.xml.xpath.XPathFactory file from the saxon jar (this is what's causing this, and we don't need that file, but it's not clear how to remove it, realistically) The only correct way to solve this: File a bug in Jackson and apply (1). Jackson violates the standards. And this violation fails in a number of JVMs (not only in Java 8, also IBM J9 is affected). I'll file a bug with saxon and see what Mike Kay's take is. Meanwhile, we could remove the saxon jar or disable tests on java8 J9 to be able to move forward on this. Because of this I don't want to have Jackson in Solr at all (you have to know, I am a fan of XSLT and XPath, but Jackson is the worst implementation I have seen and I avoid
[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837976#comment-13837976 ] wolfgang hoschek edited comment on SOLR-1301 at 12/3/13 6:40 PM: - bq. module/dir names I propose morphlines-solr-core and morphlines-solr-cell as names. This avoids confusion by fitting nicely with the existing naming pattern, which is cdk-morphlines-solr-core and cdk-morphlines-solr-cell. (https://github.com/cloudera/cdk/tree/master/cdk-morphlines). Thoughts? was (Author: whoschek): bq. module/dir names I propose morphlines-solr-core and morphlines-solr-cell as names. Thoughts? Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838054#comment-13838054 ] wolfgang hoschek commented on SOLR-1301: bq. The problem with these two names is that the artifact names will have solr- prepended, and then solr will occur twice in their names: solr-morphlines-solr-core-4.7.0.jar, solr-morphlines-solr-cell-4.7.0.jar. Yuck. Ah, argh. In this light, what Mark suggested seems good to me as well. Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838064#comment-13838064 ] wolfgang hoschek commented on SOLR-1301: +1 on Steve's suggestion as well. Thanks for helping out! Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838305#comment-13838305 ] wolfgang hoschek edited comment on SOLR-1301 at 12/3/13 11:11 PM: -- Upon a bit more reflection might be better to call the contrib map-reduce and the artifact solr-map-reduce. This keeps the door open to potentially later add things like a Hadoop SolrInputFormat, i.e. read from solr via MR, rather than just write to solr via MR. was (Author: whoschek): Upon a bit more reflection might be better to call the contrib map-reduce and the artifact solr-map-reduce. This keeps the door upon to potentially later add things like a Hadoop SolrInputFormat, i.e. read from solr via MR, rather than just write to solr via MR. Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13838305#comment-13838305 ] wolfgang hoschek commented on SOLR-1301: Upon a bit more reflection might be better to call the contrib map-reduce and the artifact solr-map-reduce. This keeps the door upon to potentially later add things like a Hadoop SolrInputFormat, i.e. read from solr via MR, rather than just write to solr via MR. Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837068#comment-13837068 ] wolfgang hoschek commented on SOLR-1301: There is also a known issue in that Morphlines don't work on Windows because the Guava Classpath utility doesn't work with windows path conventions. For example, see http://mail-archives.apache.org/mod_mbox/flume-dev/201310.mbox/%3c5acffcd9-4ad7-4e6e-8365-ceadfac78...@cloudera.com%3E Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 5.0, 4.7 Attachments: README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301-maven-intellij.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java, commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message was sent by Atlassian JIRA (v6.1#6144) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [JENKINS] Lucene-Solr-trunk-Linux (32bit/jdk1.8.0-ea-b117) - Build # 8549 - Still Failing!
Looks like Java's service loader lookup impl has become more strict in Java8. This issue on Java 8 is kind of unfortunate because morphlines and solr-mr doesn't actually use JAXP at all. For the time being might be best to disable testing on Java8 for this contrib, in order to get a stable build and make progress on other issues. A couple of options that come to mind in how to deal with this longer term: 1) Remove the dependency on cdk-morphlines-saxon (which pulls in the saxon jar) or 2) Replace all Solr calls to JAXP XPathFactory.newInstance() with a little helper that first tries to use one of a list of well known XPathFactory subclasses, and only if that fails falls back to the generic XPathFactory.newInstance(). E.g. use something like XPathFactory.newInstance(XPathFactory.DEFAULT_OBJECT_MODEL_URI, com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl, ClassLoader.getSystemClassLoader()); There are 14 such XPathFactory.newInstance() calls in the Solr codebase. or 3) Somehow remove the META-INF/services/javax.xml.xpath.XPathFactory file from the saxon jar (this is what's causing this, and we don't need that file, but it's not clear how to remove it, realistically) Approach 2) might be best. Thoughts? Wolfgang. On Dec 2, 2013, at 4:41 PM, Mark Miller wrote: Uwe mentioned this in IRC - I guess Saxon doesn’t play nice with java 8. http://stackoverflow.com/questions/7914915/syntax-error-in-javax-xml-xpath-xpathfactory-provider-configuration-file-of-saxo - Mark On Dec 2, 2013, at 7:06 PM, Policeman Jenkins Server jenk...@thetaphi.de wrote: Build: http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/8549/ Java: 32bit/jdk1.8.0-ea-b117 -server -XX:+UseSerialGC 3 tests failed. FAILED: junit.framework.TestSuite.org.apache.solr.hadoop.MorphlineReducerTest Error Message: 1 thread leaked from SUITE scope at org.apache.solr.hadoop.MorphlineReducerTest: 1) Thread[id=17, name=Thread-4, state=TIMED_WAITING, group=TGRP-MorphlineReducerTest] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) at org.apache.solr.hadoop.HeartBeater.run(HeartBeater.java:108) Stack Trace: com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked from SUITE scope at org.apache.solr.hadoop.MorphlineReducerTest: 1) Thread[id=17, name=Thread-4, state=TIMED_WAITING, group=TGRP-MorphlineReducerTest] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) at org.apache.solr.hadoop.HeartBeater.run(HeartBeater.java:108) at __randomizedtesting.SeedInfo.seed([FA8A1D94A2BB2925]:0) FAILED: junit.framework.TestSuite.org.apache.solr.hadoop.MorphlineReducerTest Error Message: There are still zombie threads that couldn't be terminated:1) Thread[id=17, name=Thread-4, state=TIMED_WAITING, group=TGRP-MorphlineReducerTest] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) at org.apache.solr.hadoop.HeartBeater.run(HeartBeater.java:108) Stack Trace: com.carrotsearch.randomizedtesting.ThreadLeakError: There are still zombie threads that couldn't be terminated: 1) Thread[id=17, name=Thread-4, state=TIMED_WAITING, group=TGRP-MorphlineReducerTest] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) at org.apache.solr.hadoop.HeartBeater.run(HeartBeater.java:108) at
Re: Welcome Joel Bernstein
Welcome Joel! Wolfgang. On Oct 3, 2013, at 9:56 AM, Erick Erickson wrote: Welcome Joel! On Thu, Oct 3, 2013 at 9:33 AM, Martijn v Groningen martijn.v.gronin...@gmail.com wrote: Welcome Joel! On 3 October 2013 15:45, Shawn Heisey s...@elyograg.org wrote: On 10/2/2013 11:24 PM, Grant Ingersoll wrote: The Lucene PMC is happy to welcome Joel Bernstein as a committer on the Lucene and Solr project. Joel has been working on a number of issues on the project and we look forward to his continued contributions going forward. Welcome to the project! Best of luck to you! - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Met vriendelijke groet, Martijn van Groningen - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Welcome back, Wolfgang Hoschek!
Thanks to all! Looking forward to more contributions. Wolfgang. On Sep 26, 2013, at 3:21 AM, Uwe Schindler wrote: Hi, I'm pleased to announce that after a long abstinence, Wolfgang Hoschek rejoined the Lucene/Solr committer team. He is working now at Cloudera and plans to help with the integration of Solr and Hadoop. Wolfgang originally wrote the MemoryIndex, which is used by the classical Lucene highlighter and ElasticSearch's percolator module. Looking forward to new contributions. Welcome back heavy committing! :-) Uwe P.S.: Wolfgang, as soon as you have setup your subversion access, you should add yourself back to the committers list on the website as well. - Uwe Schindler uschind...@apache.org Apache Lucene PMC Chair / Committer Bremen, Germany http://lucene.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768629#comment-13768629 ] wolfgang hoschek commented on SOLR-1301: cdk-morphlines-solr-core and cdk-morphlines-solr-cell should remain separate and be available through separate maven modules so that clients such as Flume Solr Sink and Hbase Indexer can continue to choose to depend (or not depend) on them. For example, not everyone wants tika and it's dependency chain. Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 4.5, 5.0 Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768662#comment-13768662 ] wolfgang hoschek commented on SOLR-1301: Seems like the patch still misses tika-xmp. Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 4.5, 5.0 Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763618#comment-13763618 ] wolfgang hoschek commented on SOLR-1301: FYI, One things that's definitely off in that adhoc ivy.xml above is that it should use com.typesafe rather than org.skife.com.typesafe.config. Use version 1.0.2 of it. See http://search.maven.org/#search%7Cga%7C1%7Ctypesafe-config Maybe best to wait for Mark to post our full ivy.xml, though. (Moving all our solr-mr dependencies from Cloudera Search maven to ivy was a bit of a beast). Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 4.5, 5.0 Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763636#comment-13763636 ] wolfgang hoschek commented on SOLR-1301: By the way, docs and the downstream code for our solr-mr contrib submission is here: https://github.com/cloudera/search/tree/master/search-mr Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 4.5, 5.0 Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-1301) Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce.
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763644#comment-13763644 ] wolfgang hoschek commented on SOLR-1301: This new solr-mr contrib uses morphlines for ETL from MapReduce into Solr. To get started, here are some pointers for morphlines background material and code: code: https://github.com/cloudera/cdk/tree/master/cdk-morphlines blog post: http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-and-integrate-etl-apps-for-apache-hadoop/ reference guide: http://cloudera.github.io/cdk/docs/current/cdk-morphlines/morphlinesReferenceGuide.html slides: http://www.slideshare.net/cloudera/using-morphlines-for-onthefly-etl talk recording: http://www.youtube.com/watch?v=iR48cRSbW6A Add a Solr contrib that allows for building Solr indexes via Hadoop's Map-Reduce. - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: New Feature Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 4.5, 5.0 Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop-0.20.1-core.jar, hadoop-core-0.20.2-cdh3u3.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301-hadoop-0-20.patch, SOLR-1301-hadoop-0-20.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4661) Reduce default maxMerge/ThreadCount for ConcurrentMergeScheduler
[ https://issues.apache.org/jira/browse/LUCENE-4661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13547367#comment-13547367 ] wolfgang hoschek commented on LUCENE-4661: -- Might be good to experiment with Linux block device read-ahead settings (/sbin/blockdev --setra) and ensure using a file system that does write behind (e.g. ext4 or xfs). Larger buffer sizes typically allow for more concurrent seq streams even on spindles. Reduce default maxMerge/ThreadCount for ConcurrentMergeScheduler Key: LUCENE-4661 URL: https://issues.apache.org/jira/browse/LUCENE-4661 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.1, 5.0 I think our current defaults (maxThreadCount=#cores/2, maxMergeCount=maxThreadCount+2) are too high ... I've frequently found merges falling behind and then slowing each other down when I index on a spinning-magnets drive. As a test, I indexed all of English Wikipedia with term-vectors (= heavy on merging), using 6 threads ... at the defaults (maxThreadCount=3, maxMergeCount=5, for my machine) it took 5288 sec to index wait for merges commit. When I changed to maxThreadCount=1, maxMergeCount=2, indexing time sped up to 2902 seconds (45% faster). This is on a spinning-magnets disk... basically spinning-magnets disk don't handle the concurrent IO well. Then I tested an OCZ Vertex 3 SSD: at the current defaults it took 1494 seconds and at maxThreadCount=1, maxMergeCount=2 it took 1795 sec (20% slower). Net/net the SSD can handle merge concurrency just fine. I think we should change the defaults: spinning magnet drives are hurt by the current defaults more than SSDs are helped ... apps that know their IO system is fast can always increase the merge concurrency. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [jira] Field constructor, avoiding String.intern()
On Feb 23, 2007, at 10:28 AM, James Kennedy wrote: True. However, in the case where you are processing Documents one at a time and discarding them (e.g. We use hitCollector to process all documents from a search), or memory is not an issue, it would be nice to have the ability to disable the interning for performance sake. I don't know how much it would increase overall throughput in a variety of use cases, but one approach could be to add a copy-like- this factory method like Field.createField(Reader) to Field.java, analog to the method Term.createTerm(String text) that was added to Term.java sometime ago for a similar reason. This would guarantee that the name continues to be interned yet allows to avoid the interning overhead on use cases where a field with the same parametrization (yet a different content String/Reader) is constructed many times, which is probably the most common case where intern() overhead might matter. For example, something like Field f1 = ... Field f2 = f1.createSimilarField(Reader); /** * Optimized construction of new Terms by reusing same field as this Term * - avoids field.intern() overhead * @param text The text of the new term (field is implicitly same as this Term instance) * @return A new Term */ public Term createTerm(String text) { return new Term(field,text,false); } Wolfgang. Robert Engels wrote: I don't think it is just the performance gain of equals() where intern () matters. It also reduces memory consumption dramatically when working with large collections of documents in memory - although this could also be done with constants, there is nothing in Java to enforce it (thus the use of intern()). On Feb 23, 2007, at 12:02 PM, James Kennedy wrote: In our case, we're trying to optimize document() retrieval and we found that disabling the String interning in the Field constructor improved performance dramatically. I agree that interning should be an option on the constructor. For document retrieval, at least for a small of amount of fields, the performance gain of using equals() on interned strings is no match for the performance loss of interning the field name of each field. Wolfgang Hoschek-2 wrote: I noticed that, too, but in my case the difference was often much more extreme: it was one of the primary bottlenecks on indexing. This is the primary reason why MemoryIndex.addField(...) navigates around the problem by taking a parameter of type String fieldName instead of type Field: public void addField(String fieldName, TokenStream stream) { /* * Note that this method signature avoids having a user call new * o.a.l.d.Field(...) which would be much too expensive due to the * String.intern() usage of that class. */ Wolfgang. On Feb 14, 2006, at 1:42 PM, Tatu Saloranta wrote: After profiling in-memory indexing, I noticed that calls to String.intern() showed up surprisingly high; especially the one from Field() constructor. This is understandable due to overhead String.intern() has (being native and synchronized method; overhead incurred even if String is already interned), and the fact this essentially gets called once per document+field combination. Now, it would be quite easy to improve things a bit (in theory), such that most intern() calls could be avoid, transparent to the calling app; for example, for each IndexWriter() one could use a simple HashMap() for caching interned Strings. This approach is more than twice as fast as directly calling intern(). One could also use per-thread cache, or global one; all of which would probably be faster. However, Field constructor hard-codes call to intern(), so it would be necessary to add a new constructor that indicates that field name is known to be interned. And there would also need to be a way to invoke the new optional functionality. Has anyone tried this approach to see if speedup is worth the hassle (in my case it'd probably be something like 2 - 3%, assuming profiler's 5% for intern() is accurate)? -+ Tatu +- __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -- -- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- View this message in context: http://www.nabble.com/Field- constructor%2C-avoiding-String.intern%28%29-tf1123597.html#a9123600 Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional
Re: [jira] Commented: (LUCENE-794) Beginnings of a span based highlighter
I need to read the TokenStream at least twice I used the horribly hackey but quick-for-me method of adding a method to MemoryIndex that accepts a List of Tokens. Any ideas? I'm not sure about modifying MemoryIndex. It should be easy enough to create a subclass of TokenStream - (CachedTokenStream perhaps?) which takes a real TokenStream in it's constructor and delegates all next calls to it (and also records them in a List) for the the first use. This can then be rewound and re-used to run through the same set of tokens held in the list from the first run. Yes, as Marks points out this can be done without API change via the existing MemoryIndex.addField(String fieldName, TokenStream stream) The TokenStream could be constructed along similar lines as done in MemoryIndex.keywordTokenStream(Collection) or perhaps along similar lines as in org.apache.lucene.index.memory.AnalyzerUtil.getTokenCachingAnalyzer (Analyzer) If needed, an IndexReader can be created from a MemoryIndex via MemoryIndex.createSearcher().getIndexReader(), again without API change. Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-129) Finalizers are non-canonical
[ https://issues.apache.org/jira/browse/LUCENE-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462579 ] wolfgang hoschek commented on LUCENE-129: - Just to clarify: The empty finalize() method body in MemoryIndex measurabley improves performance of this class and it does not harm correctness because MemoryIndex does not require the superclass semantics wrt. concurrency. Finalizers are non-canonical Key: LUCENE-129 URL: https://issues.apache.org/jira/browse/LUCENE-129 Project: Lucene - Java Issue Type: Bug Components: Other Affects Versions: unspecified Environment: Operating System: other Platform: All Reporter: Esmond Pitt Assigned To: Michael McCandless Priority: Minor Fix For: 2.1 The canonical form of a Java finalizer is: protected void finalize() throws Throwable() { try { // ... local code to finalize this class } catch (Throwable t) { } super.finalize(); // finalize base class. } The finalizers in IndexReader, IndexWriter, and FSDirectory don't conform. This is probably minor or null in effect, but the principle is important. As a matter of fact FSDirectory.finaliz() is entirely redundant and could be removed, as it doesn't do anything that RandomAccessFile.finalize would do automatically. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-550) InstanciatedIndex - faster but memory consuming index
[ http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451817 ] wolfgang hoschek commented on LUCENE-550: - All Lucene unit tests have been adapted to work with my alternate index. Everything but proximity queries pass. Sounds like you're almost there :-) Regarding indexing performance with MemoryIndex: Performance is more than good enough. I've observed and measured that often the bottleneck is not the MemoryIndex itself, but rather the Analyzer type (e.g. StandardAnalayzer) or the I/O for the input files or term lower casing (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265809) or something else entirely. Regarding query performance with MemoryIndex: Some queries are more efficient than others. For example, fuzzy queries are much less efficient than wild card queries, which in turn are much less efficient than simple term queries. Such effects seem partly inherent due too the nature of the query type, partly a function of the chosen data structure (RAMDirectory, MemoryIndex, II, ...), and partly a consequence of the overall Lucene API design. The query mix found in testqueries.txt is more intended for correctness testing than benchmarking. Therein, certain query types dominate over others, and thus, conclusions about the performance of individual aspects cannot easily be drawn. Wolfgang. InstanciatedIndex - faster but memory consuming index - Key: LUCENE-550 URL: http://issues.apache.org/jira/browse/LUCENE-550 Project: Lucene - Java Issue Type: New Feature Components: Store Affects Versions: 1.9 Reporter: Karl Wettin Attachments: class_diagram.png, class_diagram.png, instanciated_20060527.tar, InstanciatedIndexTermEnum.java, lucene.1.9-karl1.jpg, lucene2-karl_20060722.tar.gz, lucene2-karl_20060723.tar.gz After fixing the bugs, it's now 4.5 - 5 times the speed. This is true for both at index and query time. Sorry if I got your hopes up too much. There are still things to be done though. Might not have time to do anything with this until next month, so here is the code if anyone wants a peek. Not good enough for Jira yet, but if someone wants to fool around with it, here it is. The implementation passes a TermEnum - TermDocs - Fields - TermVector comparation against the same data in a Directory. When it comes to features, offsets don't exists and positions are stored ugly and has bugs. You might notice that norms are float[] and not byte[]. That is me who refactored it to see if it would do any good. Bit shifting don't take many ticks, so I might just revert that. I belive the code is quite self explaining. InstanciatedIndex ii = .. ii.new InstanciatedIndexReader(); ii.addDocument(s).. replace IndexWriter for now. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-550) InstanciatedIndex - faster but memory consuming index
[ http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451768 ] wolfgang hoschek commented on LUCENE-550: - Ok. That means a basic test passes. For some more exhaustive tests, run all the queries in src/test/org/apache/lucene/index/memory/testqueries.txt against matching files such as String[] files = listFiles(new String[] { *.txt, //*.html, *.xml, xdocs/*.xml, src/java/test/org/apache/lucene/queryParser/*.java, src/java/org/apache/lucene/index/memory/*.java, }); See testMany() for details. Repeat for various analyzer, stopword toLowerCase settings, such as boolean toLowerCase = true; //boolean toLowerCase = false; //Set stopWords = null; Set stopWords = StopFilter.makeStopSet(StopAnalyzer.ENGLISH_STOP_WORDS); Analyzer[] analyzers = new Analyzer[] { //new SimpleAnalyzer(), //new StopAnalyzer(), //new StandardAnalyzer(), PatternAnalyzer.DEFAULT_ANALYZER, //new WhitespaceAnalyzer(), //new PatternAnalyzer(PatternAnalyzer.NON_WORD_PATTERN, false, null), //new PatternAnalyzer(PatternAnalyzer.NON_WORD_PATTERN, true, stopWords), //new SnowballAnalyzer(English, StopAnalyzer.ENGLISH_STOP_WORDS), }; InstanciatedIndex - faster but memory consuming index - Key: LUCENE-550 URL: http://issues.apache.org/jira/browse/LUCENE-550 Project: Lucene - Java Issue Type: New Feature Components: Store Affects Versions: 1.9 Reporter: Karl Wettin Attachments: class_diagram.png, class_diagram.png, instanciated_20060527.tar, InstanciatedIndexTermEnum.java, lucene.1.9-karl1.jpg, lucene2-karl_20060722.tar.gz, lucene2-karl_20060723.tar.gz After fixing the bugs, it's now 4.5 - 5 times the speed. This is true for both at index and query time. Sorry if I got your hopes up too much. There are still things to be done though. Might not have time to do anything with this until next month, so here is the code if anyone wants a peek. Not good enough for Jira yet, but if someone wants to fool around with it, here it is. The implementation passes a TermEnum - TermDocs - Fields - TermVector comparation against the same data in a Directory. When it comes to features, offsets don't exists and positions are stored ugly and has bugs. You might notice that norms are float[] and not byte[]. That is me who refactored it to see if it would do any good. Bit shifting don't take many ticks, so I might just revert that. I belive the code is quite self explaining. InstanciatedIndex ii = .. ii.new InstanciatedIndexReader(); ii.addDocument(s).. replace IndexWriter for now. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-550) InstanciatedIndex - faster but memory consuming index
[ http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451731 ] wolfgang hoschek commented on LUCENE-550: - Other question: when running the driver in test mode (checking for equality of query results against RAMDirectory) does InstantiatedIndex pass all tests? That would be great! InstanciatedIndex - faster but memory consuming index - Key: LUCENE-550 URL: http://issues.apache.org/jira/browse/LUCENE-550 Project: Lucene - Java Issue Type: New Feature Components: Store Affects Versions: 1.9 Reporter: Karl Wettin Attachments: class_diagram.png, class_diagram.png, instanciated_20060527.tar, InstanciatedIndexTermEnum.java, lucene.1.9-karl1.jpg, lucene2-karl_20060722.tar.gz, lucene2-karl_20060723.tar.gz After fixing the bugs, it's now 4.5 - 5 times the speed. This is true for both at index and query time. Sorry if I got your hopes up too much. There are still things to be done though. Might not have time to do anything with this until next month, so here is the code if anyone wants a peek. Not good enough for Jira yet, but if someone wants to fool around with it, here it is. The implementation passes a TermEnum - TermDocs - Fields - TermVector comparation against the same data in a Directory. When it comes to features, offsets don't exists and positions are stored ugly and has bugs. You might notice that norms are float[] and not byte[]. That is me who refactored it to see if it would do any good. Bit shifting don't take many ticks, so I might just revert that. I belive the code is quite self explaining. InstanciatedIndex ii = .. ii.new InstanciatedIndexReader(); ii.addDocument(s).. replace IndexWriter for now. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-550) InstanciatedIndex - faster but memory consuming index
[ http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451730 ] wolfgang hoschek commented on LUCENE-550: - What's the benchmark configuration? For example, is throughput bounded by indexing or querying? Measuring N queries against a single preindexed document vs. 1 precompiled query against N documents? See the line boolean measureIndexing = false; // toggle this to measure query performance in my driver. If measuring indexing, what kind of analyzer / token filter chain is used? If measuring queries, what kind of query types are in the mix, with which relative frequencies? You may want to experiment with modifying/commenting/uncommenting various parts of the driver setup, for any given target scenario. Would it be possible to post the benchmark code, test data, queries for analysis? InstanciatedIndex - faster but memory consuming index - Key: LUCENE-550 URL: http://issues.apache.org/jira/browse/LUCENE-550 Project: Lucene - Java Issue Type: New Feature Components: Store Affects Versions: 1.9 Reporter: Karl Wettin Attachments: class_diagram.png, class_diagram.png, instanciated_20060527.tar, InstanciatedIndexTermEnum.java, lucene.1.9-karl1.jpg, lucene2-karl_20060722.tar.gz, lucene2-karl_20060723.tar.gz After fixing the bugs, it's now 4.5 - 5 times the speed. This is true for both at index and query time. Sorry if I got your hopes up too much. There are still things to be done though. Might not have time to do anything with this until next month, so here is the code if anyone wants a peek. Not good enough for Jira yet, but if someone wants to fool around with it, here it is. The implementation passes a TermEnum - TermDocs - Fields - TermVector comparation against the same data in a Directory. When it comes to features, offsets don't exists and positions are stored ugly and has bugs. You might notice that norms are float[] and not byte[]. That is me who refactored it to see if it would do any good. Bit shifting don't take many ticks, so I might just revert that. I belive the code is quite self explaining. InstanciatedIndex ii = .. ii.new InstanciatedIndexReader(); ii.addDocument(s).. replace IndexWriter for now. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MemoryIndex
MemoryIndex was designed to maximize performance for a specific use case: pure in-memory datastructure, at most one document per MemoryIndex instance, any number of fields, high frequency reads, high frequency index writes, no thread-safety required, optional support for storing offsets. I briefly considered extending it to the multi-document case, but eventually refrained from doing so, because I didn't really need such functionality myself (no itch). Here are some issues to consider when attempting such an extension: - The internal datastructure would probably look quite different - Datastructure/algorithmic trade-offs regarding time vs space, read vs. write frequency, common vs. less common use cases - Hence, it may well turn out that there's not much to reuse. - A priori, it isn't clear whether a new solution would be significantly faster than normal RAMDirectory usage. Thus... - Need benchmark suite to evaluate the chosen trade-offs. - Need tests to ensure correctness (in practise, meaning, it behaves just like the existing alternative). I'd say it's a non-trival untertaking. For example, right now, I don't have time for such an effort. That doesn't mean it's impossible or shouldn't be done, of course. If someone would like to run with it that would be great, but in light of the above issues, I'd suggest doing it in a new class (say MultiMemoryIndex or similar). I believe Mark has dome some initial work in that direction, based on an independent (and different) implementation strategy. Wolfgang. On May 2, 2006, at 12:25 AM, Robert Engels wrote: Along the lines of Lucene-550, what about having a MemoryIndex that accepts multiple documents, then wrote the index once at the end in the Lucene file format (so it could be merged) during close. When adding documents using an IndexWriter, a new segment is created for each document, and then the segments are periodically merged in memory, and/or with disk segments. It seems that when constructing an Index or updating a lot of documents in an existing index, the write, read, merge cycle is inefficient, and if the documents/field information were maintained in order (TreeMaps) greater efficiency would be realized. With a memory index, the memory needed during update will increase dramatically, but this could still be bounded, and a disk based index segment written when too many documents are in the memory index (max buffered documents). Does this sound like an improvement? Has anyone else tried something like this? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Optimizing/minimizing memory usage of memory-based indexes
Initially it might, but probably eventually not. I was thinking Lucene formats might also be bit more compact than vanilla hash maps, but I guess that depends on many factors. But I will probably want to play with actual queries later on, based on frequencies. OK. In the latter case, are you using org.apache.lucene.store.RAMDirectory or org.apache.lucene.index.memory.MemoryIndex? I'm using RAMDirectory. Should I be using MemoryIndex maybe instead (I'll check it out)? The main constraint is that a MemoryIndex instance can only hold *one* lucene document (though it can have any number of fields). MemoryIndex is designed to be a transient throw away data structure, for streaming / publish-subscribe usecases. If it's applicable, MemoryIndex has better performance but worse memory consumption than RAMDirectory. I can't tell whether that may or may not be an issue for your case. Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Advanced query language
On Dec 17, 2005, at 2:36 PM, Paul Elschot wrote: Gentlemen, While maintaining my bookmarks I ran into this: Case Study: Enabling Low-Cost XML-Aware Searching Capable of Complex Querying: http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/ 03-02-08/03-02-08.html Some loose thoughts: In the system described there a Lucene document is used for each low level xml construct, even when it contains very few characters of text. The resulting Lucene indexes are at least 2.5 times the size of the original document, which is not a surprise given this document structure. Normal index size is about one third of the indexed text. I don't know about the XQuery standard, but I was wondering whether this unusual document structure and the non straightforward fit between Lucene queries and XQuery queries are related. Seems that a lot of metadata beyond the actual text is stored. For example, node type, ancestors, parent, number of children, etc., for each element and attribute. If the fulltext is relatively small, as is often the case in quite structured XML such as the shakespeare collection, that should significantly increase storage space. For example, romeo and juliet goes along the following lines: SPEECH SPEAKERFRIAR LAURENCE/SPEAKER LINENot in a grave,/LINE LINETo lay one in, another out to have./LINE /SPEECH SPEECH SPEAKERROMEO/SPEAKER LINEI pray thee, chide not; she whom I love now/LINE LINEDoth grace for grace and love for love allow;/LINE LINEThe other did not so./LINE /SPEECH SPEECH SPEAKERFRIAR LAURENCE/SPEAKER LINEO, she knew well/LINE LINEThy love did read by rote and could not spell./LINE LINEBut come, young waverer, come, go with me,/LINE LINEIn one respect I'll thy assistant be;/LINE LINEFor this alliance may so happy prove,/LINE LINETo turn your households' rancour to pure love./LINE /SPEECH As for the joines and iterations over items from the stream of XML results: iteration over matching XML constructs should be no problem in Lucene. Joins in Lucene are normally done via boolean filters, so I was wondering how XQuery joins fit these. Similar as in SQL. The engine constructs a locial execution plan for the query, and rewrites it into an optimized physical plan as deemed appropriate, perhaps guided by statistics, using a nested loop, hash join, or any other more sophisticated strategy. Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Advanced query language
over matching XML constructs should be no problem in Lucene. Joins in Lucene are normally done via boolean filters, so I was wondering how XQuery joins fit these. The case study above has a note a the end of par 5.3: The Search Result list that comes back could then be organized by document id to group together all the results for a single XML document. This is not provided by default, but has been done with extension to this code. Regards, Paul Elschot On Friday 16 December 2005 03:45, Wolfgang Hoschek wrote: I think implementing an XQuery Full-Text engine is far beyond the scope of Lucene. Implementing a building block for the fulltext aspect of it would be more manageable. Unfortunately The W3C fulltext drafts indiscriminately mix and mingle two completely different languages into a single language, without clear boundaries. That's why most practical folks implement XQuery fulltext search via extension functions rather than within XQuery itself. This also allows for much more detailed tokenization, configuration and extensibility than what would be possible with the W3C draft. Wolfgang. On Dec 15, 2005, at 4:20 PM, [EMAIL PROTECTED] wrote: Mark, This is very cool. When I was at TripleHop we did something very similar where both query and results conformed to an XML Schema and we used XML over HTTP as our main vehicle to do remote/federated searches with quick rendering with stylesheets. That however is the first piece of the puzzle. If you really want to go beyond search (in the traditional sense) and be able to perform more complex operations such as joines and iterations over items from the stream of XML results you are getting you should consider implementing an XQuery Full-Text engine with Lucene adopting the now standard XQuery language. Here is the pointer to the working draft on the W3C working draft on XQuery 1.0 and XPath 2.0 Full-Text: http://www.w3.org/TR/xquery-full-text/ Now I'm part of the task force editing this draft so your comments are very much welcomed. -- J.D. http://www.inperspective.com/lucene/LXQueryV0_1.zip I've implemented just a few queries (Boolean, Term, FilteredQuery, BoostingQuery ...) but other queries are fairly trivial to add. At this stage I am more interested in feedback on parser design/ approach rather than trying to achieve complete coverage of all the Lucene Query types or debating the choice of tag names. Please see the readme.txt in the package for more details. Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] While maintaining my bookmarks I ran into this: Case Study: Enabling Low-Cost XML-Aware Searching Capable of Complex Querying: http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/ 03-02-08/03-02-08.html Some loose thoughts: In the system described there a Lucene document is used for each low level xml construct, even when it contains very few characters of text. The resulting Lucene indexes are at least 2.5 times the size of the original document, which is not a surprise given this document structure. Normal index size is about one third of the indexed text. I don't know about the XQuery standard, but I was wondering whether this unusual document structure and the non straightforward fit between Lucene queries and XQuery queries are related. As for the joines and iterations over items from the stream of XML results: iteration over matching XML constructs should be no problem in Lucene. Joins in Lucene are normally done via boolean filters, so I was wondering how XQuery joins fit these. The case study above has a note a the end of par 5.3: The Search Result list that comes back could then be organized by document id to group together all the results for a single XML document. This is not provided by default, but has been done with extension to this code. Regards, Paul Elschot On Friday 16 December 2005 03:45, Wolfgang Hoschek wrote: I think implementing an XQuery Full-Text engine is far beyond the scope of Lucene. Implementing a building block for the fulltext aspect of it would be more manageable. Unfortunately The W3C fulltext drafts indiscriminately mix and mingle two completely different languages into a single language, without clear boundaries. That's why most practical folks implement XQuery fulltext search via extension functions rather than within XQuery itself. This also allows for much more detailed tokenization, configuration and extensibility than what would be possible with the W3C draft. Wolfgang. On Dec 15, 2005, at 4:20 PM, [EMAIL PROTECTED] wrote: Mark, This is very cool. When I was at TripleHop we did something very similar where both query and results conformed to an XML Schema and we used XML over HTTP as our main vehicle to do remote/federated searches with quick rendering with stylesheets. That however is the first piece of the puzzle. If you really want
Re: Advanced query language
I think implementing an XQuery Full-Text engine is far beyond the scope of Lucene. Implementing a building block for the fulltext aspect of it would be more manageable. Unfortunately The W3C fulltext drafts indiscriminately mix and mingle two completely different languages into a single language, without clear boundaries. That's why most practical folks implement XQuery fulltext search via extension functions rather than within XQuery itself. This also allows for much more detailed tokenization, configuration and extensibility than what would be possible with the W3C draft. Wolfgang. On Dec 15, 2005, at 4:20 PM, [EMAIL PROTECTED] wrote: Mark, This is very cool. When I was at TripleHop we did something very similar where both query and results conformed to an XML Schema and we used XML over HTTP as our main vehicle to do remote/federated searches with quick rendering with stylesheets. That however is the first piece of the puzzle. If you really want to go beyond search (in the traditional sense) and be able to perform more complex operations such as joines and iterations over items from the stream of XML results you are getting you should consider implementing an XQuery Full-Text engine with Lucene adopting the now standard XQuery language. Here is the pointer to the working draft on the W3C working draft on XQuery 1.0 and XPath 2.0 Full-Text: http://www.w3.org/TR/xquery-full-text/ Now I'm part of the task force editing this draft so your comments are very much welcomed. -- J.D. http://www.inperspective.com/lucene/LXQueryV0_1.zip I've implemented just a few queries (Boolean, Term, FilteredQuery, BoostingQuery ...) but other queries are fairly trivial to add. At this stage I am more interested in feedback on parser design/ approach rather than trying to achieve complete coverage of all the Lucene Query types or debating the choice of tag names. Please see the readme.txt in the package for more details. Cheers Mark ___ How much free photo storage do you get? Store your holiday snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://www.inperspective.com/lucene/LXQueryV0_1.zip I've implemented just a few queries (Boolean, Term, FilteredQuery, BoostingQuery ...) but other queries are fairly trivial to add. At this stage I am more interested in feedback on parser design/ approach rather than trying to achieve complete coverage of all the Lucene Query types or debating the choice of tag names. Please see the readme.txt in the package for more details. Cheers Mark ___ How much free photo storage do you get? Store your holiday snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Advanced query language
Right now the Sun STAX impl is decidedly buggy compared to xerces SAX (and it's not faster either). The most complete, reliable and efficient STAX impl seems to be woodstox. Wolfgang. On Dec 15, 2005, at 7:22 PM, Yonik Seeley wrote: Agreed, that is a significant downside. StAX is included in Java 6, but that doesn't help too much given the Java 1.4 req. -Yonik On 12/15/05, Wolfgang Hoschek [EMAIL PROTECTED] wrote: STAX would probably make coding easier, but unfortunately complicates the packaging side: one must ship at least two additional external jars (stax interfaces and impl) for it to become usable. Plus, STAX is quite underspecified (I wrote a STAX parser + serializer impl lately), so there's room for runtime suprises with different impls. The primary advantage of SAX is that everything is included in JDK = 1.4, and that impls tend to be more mature. SAX bottom line: more hassle early on, less hassle later. Wolfgang. On Dec 15, 2005, at 5:47 PM, Yonik Seeley wrote: On 12/15/05, markharw00d [EMAIL PROTECTED] wrote: At this stage I am more interested in feedback on parser design/ approach Excellent idea. While SAX is fast, I've found callback interfaces more difficult to deal with while generating nested object graphs... it normally requires one to maintain state in stack(s). Have you considered a pull-parser like StAX or XPP? They are as fast as SAX, and allow you to ask for the next XML event you are interested in, eliminating the need to keep track of where you are by other means (the place in your own code and normal variables do that). It normally turns into much more natural code. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Advanced query language
That's basically what I'm implementing with Nux, except that the syntax and calling conventions are a bit different, and that Lucene analyzers can optionally be specified, which makes it a lot more powerful (but also a bit more complicated). Wolfgang. On Dec 6, 2005, at 10:48 AM, Incze Lajos wrote: Maybe, I'm a bit late with this, but. There is an ongoing effort at w3c to define a fulltext search language that could extend their xpath and xquery languages (which clearly makes sense). These are the current documents on the topic: http://www.w3.org/TR/2005/WD-xquery-full-text-20051103/ http://www.w3.org/TR/2005/WD-xmlquery-full-text-use-cases-20051103/ incze (This case, the query language itself is not xml, as has to serve as a selection criteria in an xpath or xquery expression, but xml conform, so may be embedded in any xml doc.) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Advanced query language
Hopefully that makes sense to someone besides just me. It's certainly a lot more complexity then a simple one to one mapping, but it seems to me like the flexability is worth spending the extra time to design/ build it. Makes perfect sense to me, and it doesn't seem any more complex than what's been proposed before. Actually, this may be a quite straightforward, compact and extensible way of doing it all. Though, I'd be careful with proposing a variety of equivalent syntaxes as it may easily lead to more confusion than good. Let's start with one canonical syntax. If desired, other (more pleasant) syntaxes may then be converted to that as part of a preprocessing step. Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Advanced query language
Hopefully that makes sense to someone besides just me. It's certainly a lot more complexity then a simple one to one mapping, but it seems to me like the flexability is worth spending the extra time to design/ build it. Makes perfect sense to me, and it doesn't seem any more complex than what's been proposed before. Actually, this may be a quite straightforward, compact and extensible way of doing it all. Though, I'd be careful with proposing a variety of equivalent syntaxes as it may easily lead to more confusion than good. Let's start with one canonical syntax. If desired, other (more pleasant) syntaxes may then be converted to that as part of a preprocessing step. I should add that I'd love to see a powerful, extensible yet easy to read XML based query syntax, and make that available to users of XQuery fulltext search. Here is an example fulltext XQuery that finds all books authored by James that have something to do with 'salmon fishing manuals', sorted by relevance declare namespace lucene = java:nux.xom.pool.FullTextUtil; declare variable $query := +salmon~ +fish* manual~; (: any arbitrary Lucene query can go here :) (: declare variable $query as xs:string external; :) for $book in /books/book[author=James and lucene:match(abstract, $query) 0.0] let $score := lucene:match($book/abstract, $query) order by $score descending return $book Now, instead of handing a quite limited lucene query string to lucene:match($query), as above, I'd love to pass it an XML query blurb that makes all of lucene's power accessible without the user having to construct query objects himself. Consider it an additional use case beyond what Erik and others brought up so far... Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: open source YourKit licence
Yonik, I haven't been terribly active lately, but I've been voted in as committer as well... :-) http://marc.theaimsgroup.com/?l=lucene-devw=2r=1s=hoschek +committerq=b Cheers, Wolfgang. On Dec 2, 2005, at 2:53 PM, Yonik Seeley wrote: ~yonik/yourkit/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene does NOT use UTF-8.
On Aug 30, 2005, at 12:47 PM, Doug Cutting wrote: Yonik Seeley wrote: I've been looking around... do you have a pointer to the source where just the suffix is converted from UTF-8? I understand the index format, but I'm not sure I understand the problem that would be posed by the prefix length being a byte count. TermBuffer.java:66 Things could work fine if the prefix length were a byte count. A byte buffer could easily be constructed that contains the full byte sequence (prefix + suffix), and then this could be converted to a String. The inefficiency would be if prefix were re-converted from UTF-8 for each term, e.g., in order to compare it to the target. Prefixes are frequently longer than suffixes, so this could be significant. Does that make sense? I don't know whether it would actually be significant, although TermBuffer.java was added recently as a measurable performance enhancement, so this is performance critical code. We need to stop discussing this in the abstract and start coding alternatives and benchmarking them. Is java.nio.charset.CharsetEncoder fast enough? Will moving things through CharBuffer and ByteBuffer be too slow? Should Lucene keep maintaining its own UTF-8 implementation for performance? I don't know, only some experiments will tell. Doug I don't know if it matters for Lucene usage. But if using CharsetEncoder/CharBuffer/ByteBuffer should turn out to be a significant problem, it's probably due to startup/init time of these methods for individually converting many small strings, not inherently due to UTF-8 usage. I'm confident that a custom UTF-8 implementation can almost completely eliminate these issues. I've done this before for binary XML with great success, and it could certainly be done for lucene just as well. Bottom line: It's probably an issue that can be dealt with via proper impl; it probably shouldn't dictate design directions. Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[ANN] Nux-1.3 released
The Nux-1.3 release has been uploaded to http://dsd.lbl.gov/nux/ Nux is an open-source Java toolkit making efficient and powerful XML processing easy. Changelog: •Upgraded to saxonb-8.5 (saxon-8.4 and 8.3 should continue to work as well). •Upgraded to xom-1.1-rc1 (with compatible performance patches). Plain xom-1.0 should continue to work as well, albeit less efficiently. •Numerous bnux Binary XML performance enhancements for serialization and deserialization (UTF-8 character encoding, buffer management, symbol table, pack sorting, cache locality, etc). Overall, bnux is now about twice as fast, and, perhaps more importantly, has a much more uniform performance profile, no matter what kind of document flavour is thrown at it. It routinely delivers 50-100 MB/sec deserialization performance, and 30-70 MB/sec serialization performance (commodity PC 2004). It is roughly 5-10 times faster than xom-1.1 with xerces-2.7.1 (which, in turn, is faster than saxonb-8.5, dom4j-1.6.1 and xerces-2.7.1 DOM). Further, preliminary measurements indicate bnux deserialization and serialization to be consistently 2-3 times faster than Sun's FastInfoSet implementation, using XOM. Saxon's PTree could not be tested as it is only available in the commercial version. The only remaining area with substantial potential for performance improvement seems to be complex namespace handling. This might be addressed by slightly restructuring private XOM internals in a future version. •BinaryXMLTest now also has command line support for testing and benchmarking Saxon, DOM and FastInfoSet (besides bnux and XOM). •Rewrote XQueryCommand. The new nux/bin/fire-xquery is a more powerful, flexible and reliable command line test tool that runs a given XQuery against a set of files and prints the result sequence. In addition, it supports schema validation, XInclude (via XOM), an XQuery update facility, malformed HTML parsing (via TagSoup) and much more. It's available for Unix and Windows, and works like any other decent Unix command line tool. •Removed ValidationCommand (made obsolete by the fire-xquery functionality). •Added experimental XQuery in-place update functionality. Comments on the usefulness of the current behaviour are especially welcome, as are suggestions for potential improvements. •Added nux.xom.xquery.ResultSequenceSerializer, which serializes an XQuery/XPath2 result sequence onto a given output stream, using various configurable serialization options such encoding and indentation. Implements the W3C XQuery/XSLT2 Serialization Draft Spec. Also implements an alternative wrapping algorithm that ensures that any arbitrary result sequence can always be output as a well-formed XML document. •Added XQueryFactory.createXQuery(File file, URI baseURI) and XQueryPool.getXQuery(File file, URI baseURI) to allow for separation of the location of the query file and input XML files. •The default XQuery DocumentURIResolver now recognizes the .bnux file extension as binary XML, and parses it accordingly. For example, a query can be 'doc(samples/data/articles.xml.bnux)/ articles/*' •Added FileUtil.listFiles(). Returns the URIs of all files who's path matches at least one of the given inclusion wildcard or regular expressions but none of the given exclusion wildcard or regular expressions; starting from the given directory, optionally with recursive directory traversal, insensitive to underlying operating system conventions. •XOMUtil.Normalizer now uses XML whitespace definition rather than Java whitespace definition. •Added XOMUtil.Normalizer.STRIP, which removes Texts that consist of whitespace-only (boundary whitespace), retaining other strings unchanged. •Added AnalyzerUtil.getPorterStemmerAnalyzer() for English language stemming on full text search. •Added XOMUtil.toDocument(String xml) convenience method to parse a string. •Moved XOMUtil.toByteArray() and XOMUtil.toString() into class FileUtil. The old methods remain available but have been deprecated. •Added jar-bnux ant target to optionally build a minimal jar file (20 KB) for binary XML only. •Added more test documents to samples/data directory. •Updated license blurbs to 2005. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Analyzer as an Interface?
On Jul 19, 2005, at 12:58 PM, Daniel Naber wrote: Hi, currently Analyzer is an abstract class. Shouldn't we make it an Interface? Currently that's not possible, but it will be as soon as the deprecated method is removed (i.e. after Lucene 1.9). Regards Daniel Daniel, what's the use case that would make this a significant improvement over extending and overriding the single abstract method? Classes that implement multiple interfaces? For consistency, similar thoughts would apply to TokenStream, IndexReader/Writer, etc. Also note that once it's become an interface the API is effectively frozen forever. With abstract classes the option remains open to later add methods with a default impl. (e.g. tokenStream(String fieldName, String text) or whatever). Thanks, Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene vs. Ruby/Odeum
poor java startup time For the one's really keen on reducing startup time the Jolt Java VM daemon may perhaps be of some interest: http://www.dystance.net/software/jolt/index.html I played with it a year ago when I was curious to see what could be done about startup time in the context of simple unix-scriptable command line XML webservice clients (the ones that require tons of jars as dependencies and take ages to initialize). Startup time went from 3-5 secs to zero. Feels like ls - you hit ENTER and the program completes *instantly*. Of course there's a catch. It requires some more work, and it's not a general solution wrt. isolation, security, reliability, etc. but for a simple command line lucene query tool it might just do fine, FWIW. Long-term Sun's MVM might be a more comprehensive solution, with some luck. Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene vs. Ruby/Odeum
As an aside, in my performance testing of Lucene using JProfiler, it seems to me that the only way to improve Lucene's performance greatly can come from 2 areas 1. optimizing the JVM array/looping/JIT constructs/capabilities to avoid bounds checking/improve performance 2. improve function call overhead Other than that, other changes will require a significant change in the code structure (manually unrolling loops), at the sacrifice of readability/maintainability. Just curious: are you more happy with JProfiler than with the JDK 1.5 profiler? I haven't used JProfiler in quite a while but my impression back then was that it's overheads tend to significantly perturb measurement results. When I switched to the low-level JDK 1.5 profiler CPU tuning efforts got a lot more targetted and meaningful. So, in my experience, the least perturbing and most accurate profiler is the one built into JDK 1.5. run java with -server -agentlib:hprof=cpu=samples,depth=10' flags for long enough to collect enough samples to be statistically meaningful, then study the trace log and correlate its hotspot trailer with its call stack headers (grep is your friend, a GUI isn't really needed). For a background article on hprof see http://java.sun.com/developer/ technicalArticles/Programming/HPROF.html Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: contrib/queryParsers/surround
Cool stuff. Once this has stabilized and settled down I might start exposing the surround language from XQuery/XPath as an experimental match facility. Wolfgang. On May 28, 2005, at 10:07 AM, Paul Elschot wrote: On Saturday 28 May 2005 17:06, Erik Hatcher wrote: On May 28, 2005, at 10:04 AM, Paul Elschot wrote: Dear readers, I've started moving the surround query language http://issues.apache.org/bugzilla/show_bug.cgi?id=34331 into the directory named by the title in my working copy of the lucene trunk. When the tests pass I'll repost it there. In case someone needs this earlier, please holler. As for naming conventions and where this should live in contrib, consider that a user will only want a single query parser and more than that would be unneeded bloat in her application. The contrib pieces are all packaged as a separate JAR per directory under contrib. My recommendation would be to put your wonderful surround parser and supporting infrastructure under contrib/surround. I'm very much looking forward to having this available! Meanwhile the tests pass again with some expected standard ouput. A little bit of deprecation is left in the CharStream (getLine and getColumn) in the parser. Would you have any idea how to deal with that? I'll leave the build.xml stand alone with constants for the environment. It was derived from a lucene build.xml of a few eons ago, so I hope someone can still integrate it... Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[ANN] nux-1.2 release
The nux-1.2 release has been uploaded to http://dsd.lbl.gov/nux/ Nux is an open-source Java XML toolset geared towards embedded use in high-throughput XML messaging middleware such as large-scale Peer-to- Peer infrastructures, message queues, publish-subscribe and matchmaking systems for Blogs/newsfeeds, text chat, data acquisition and distribution systems, application level routers, firewalls, classifiers, etc. It is not an XML database, and does not attempt to be one. Changelog: XQuery/XPath: Added optional fulltext search via Apache Lucene engine. Similar to Google search, it is easy to use, powerful, efficient and goes far beyond what can be done with standard XPath regular expressions and string manipulation functions. It is similar in intent but not directly related to preliminary W3C fulltext search drafts. Rather than targetting fulltext search of infrequent queries over huge persistent data archives (historic search), Nux targets fulltext search of huge numbers of queries over comparatively small transient realtime data (prospective search). See FullTextUtil and MemoryIndex. Example fulltext XQuery that finds all books authored by James that have something to do with 'salmon fishing manuals', sorted by relevance: declare namespace lucene = java:nux.xom.pool.FullTextUtil; declare variable $query := +salmon~ +fish* manual~; (: any arbitrary Lucene query can go here :) (: declare variable $query as xs:string external; :) for $book in /books/book[author=James and lucene:match(abstract, $query) 0.0] let $score := lucene:match($book/abstract, $query) order by $score descending return $book Example fulltext XQuery that matches on extracted sentences: declare namespace lucene = java:nux.xom.pool.FullTextUtil; for $book in /books/book for $s in lucene:sentences($book/abstract, 0) return if (lucene:match($s, +salmon~ +fish* manual~) 0.0) then normalize-space($s) else () It is designed to enable maximum efficiency for on-the-fly matchmaking combining structured and fuzzy fulltext search in realtime streaming applications such as XQuery based XML message queues, publish-subscribe systems for Blogs/newsfeeds, text chat, data acquisition and distribution systems, application level routers, firewalls, classifiers, etc. Arbitrary Lucene fulltext queries can be run from Java or from XQuery/ XPath/XSLT via a simple extension function. The former approach is more flexible whereas the latter is more convenient. Lucene analyzers can split on whitespace, normalize to lower case for case insensitivity, ignore common terms with little discriminatory value such as he, in, and (stop words), reduce the terms to their natural linguistic root form such as fishing being reduced to fish (stemming), resolve synonyms/inflexions/thesauri (upon indexing and/or querying), etc. Also see Lucene Query Syntax as well as Query Parser Rules. Background: The first prototype was put together over the weekend. The functionality worked just fine, except that it took ages to index and search text in a high-frequency environment. Subsequently I wrote a complete reimplementation of the Lucene interfaces and contributed that back to Lucene (the bits in org.apache.lucene.index.memory.*). Next, I placed a smart cache in front of it (the bits in nux.xom.pool.FullTextUtil / FullTextPool). The net effect is that fulltext queries over realtime data now run some three orders of magnitude faster while preserving the same general functionality (e.g. 10-50 queries/sec ballpark). In fact, you'll probably notice little or no overhead when adding fulltext search to your streaming apps. See MemoryIndexBenchmark and XQueryBenchmark. Explore and enjoy, perhaps using the queries and sample data from the samples/fulltext directory as a starting point. Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Add Term.createTerm to avoid 99% of String.intern() calls
For the MemoryIndex, I'm seeing large performance overheads due to repetitive temporary string interning of o.a.l.index.Term. For example, consider a FuzzyTermQuery or similar, scanning all terms via TermEnum in the index: 40% of the time is spent in String.intern () of new Term(). [Allocating temporary memory and FuzzyTermEnum.termCompare are less of a problem according to profiling]. Note that the field name would only need to be interned once, not time and again for each term. But the non-iterning Term constructor is private and hence not accessible from o.a.l.index.memory.*. TermBuffer isn't what I'm looking for, and it's private anyway. The best solution I came up with is to have an additional safe public method in Term.java: /** Constructs a term with the given text and the same interned field name as * this term (minimizes interning overhead). */ public Term createTerm(String txt) { // WH return new Term(field, txt, false); } Besides dramatically improving performance, this has the benefit of keeping the non-interning constructor private. Comments/opinions, anyone? Here's a sketch of how it can be used: public Term term() { ... if (cachedTerm == null) cachedTerm = new Term ((String) sortedFields[j].getKey(), ); return cachedTerm.createTerm((String) info.sortedTerms[i].getKey()); } public boolean next() { ... if (...) cachedTerm = null; } I'll send the full patch for MemoryIndex if this is accepted. Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene vs. Ruby/Odeum
Right. One doesn't need to run those benchmarks to immediately see that most time is spent in VM startup, class loading, hotspot compilation rather than anything Lucene related. Even a simple System.out.println(hello) typically takes some 0.3 secs on a fast box and JVM. Wolfgang. On May 17, 2005, at 7:33 AM, Scott Ganyo wrote: Interesting, but questionable. I can imagine three problems with the write-up just off-hand: 1) JVM startup time. As the author noted, this can be an issue with short-running Java applications. 2) JVM warm-up time. The HotSpot VM is designed to optimize itself and become faster over time rather than being the fastest right out of the blocks. 3) Data access patterns. It is possible (I don't know) that Odeum is designed for quick one-time search on the data without reading and caching the index like Lucene does for subsequent queries. In each case, there is a common theme: Lucene and Java are designed to perform better for longer-running applications... not start, lookup, and terminate utilities. S On May 16, 2005, at 9:41 PM, Otis Gospodnetic wrote: Some interesting stuff... http://www.zedshaw.com/projects/ruby_odeum/performance.html http://blog.innerewut.de/articles/2005/05/16/ruby-odeum-vs-apache- lucene - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
Here's a performance patch for MemoryIndex.MemoryIndexReader that caches the norms for a given field, avoiding repeated recomputation of the norms. Recall that, depending on the query, norms() can be called over and over again with mostly the same parameters. Thus, replace public byte[] norms(String fieldName) with the following code: /** performance hack: cache norms to avoid repeated expensive calculations */ private byte[] cachedNorms; private String cachedFieldName; private Similarity cachedSimilarity; public byte[] norms(String fieldName) { byte[] norms = cachedNorms; Similarity sim = getSimilarity(); if (fieldName != cachedFieldName || sim != cachedSimilarity) { // not cached? Info info = getInfo(fieldName); int numTokens = info != null ? info.numTokens : 0; float n = sim.lengthNorm(fieldName, numTokens); byte norm = Similarity.encodeNorm(n); norms = new byte[] {norm}; cachedNorms = norms; cachedFieldName = fieldName; cachedSimilarity = sim; if (DEBUG) System.err.println(MemoryIndexReader.norms: + fieldName + : + n + : + norm + : + numTokens); } return norms; } The effect can be substantial when measured with the profiler, so it's worth it. Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
contrib: keywordTokenStream
Here's a convenience add-on method to MemoryIndex. If it turns out that this could be of wider use, it could be moved into the core analysis package. For the moment the MemoryIndex might be a better home. Opinions, anyone? Wolfgang. /** * Convenience method; Creates and returns a token stream that generates a * token for each keyword in the given collection, as is, without any * transforming text analysis. The resulting token stream can be fed into * [EMAIL PROTECTED] #addField(String, TokenStream)}, perhaps wrapped into another * [EMAIL PROTECTED] org.apache.lucene.analysis.TokenFilter}, as desired. * * @param keywords *the keywords to generate tokens for * @return the corresponding token stream */ public TokenStream keywordTokenStream(final Collection keywords) { if (keywords == null) throw new IllegalArgumentException(keywords must not be null); return new TokenStream() { Iterator iter = keywords.iterator(); int pos = 0; int start = 0; public Token next() { if (!iter.hasNext()) return null; Object obj = iter.next(); if (obj == null) throw new IllegalArgumentException(keyword must not be null); String term = obj.toString(); Token token = new Token(term, start, start + term.length()); start += term.length() + 1; // separate words by 1 (blank) character pos++; return token; } }; } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: contrib: keywordTokenStream
On May 3, 2005, at 5:26 PM, Erik Hatcher wrote: Wolfgang, I've now added this. Thanks :-) I'm not seeing how this could be generally useful. I'm curious how you are using it and why it is better suited for what you're doing than any other analyzer. keyword tokenizer is a bit overloaded terminology-wise, though - look in the contrib/analyzers/src/java area to see what I mean. Erik The difference between this and the KeywordTokenizer from the contrib/analyzer is that it - can operate on multiple keywords rather than just a single one. So it's slighly more general. - Takes a collection (typically of String values) as a input rather than a Reader. I can see the java.io.Reader scalability rationale used throughout the analysis APIs, but for many use cases (including my own) Strings are a lot handier (and more efficient to deal with) - the string values are small anyway. So it's a convenient way to add terms (keywords if you like) that have been parsed/massaged into string(s) by some existing external means (e.g. grouped regex scanning of legacy formatted text files into various fields, etc) into an index as is, without any further transforming analysis. Most folks could write such a (non-essential) utility themselves but it's handy in a similar way that you have the Field.Keyword convenience infrastructure... keyword tokenizer is a bit overloaded terminology-wise, though If you come up with a better name feel free to rename it. Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
I'm looking at it right now. The tests pass fine when you put lucene-1.4.3.jar instead of the current lucene onto the classpath which is what I've been doing so far. Something seems to have changed in the scoring calculation. No idea what that might be. I'll see if I can find out. Wolfgang. The test case is failing (type ant test at the contrib/memory working directory) with this: [junit] Testcase: testMany(org.apache.lucene.index.memory.MemoryIndexTest): Caused an ERROR [junit] BUG DETECTED:69 at query=term AND NOT phrase term, file=src/java/org/apache/lucene/index/memory/MemoryIndex.java, [EMAIL PROTECTED] [junit] java.lang.IllegalStateException: BUG DETECTED:69 at query=term AND NOT phrase term, file=src/java/org/apache/lucene/index/memory/MemoryIndex.java, [EMAIL PROTECTED] [junit] at org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.java :305) [junit] at org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTest .java:228) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
This is what I have as scoring calculation, and it seems to do exactly what lucene-1.4.3 does because the tests pass. public byte[] norms(String fieldName) { if (DEBUG) System.err.println(MemoryIndexReader.norms: + fieldName); Info info = getInfo(fieldName); int numTokens = info != null ? info.numTokens : 0; byte norm = Similarity.encodeNorm(getSimilarity().lengthNorm(fieldName, numTokens)); return new byte[] {norm}; } public void norms(String fieldName, byte[] bytes, int offset) { if (DEBUG) System.err.println(MemoryIndexReader.norms: + fieldName + *); byte[] norms = norms(fieldName); System.arraycopy(norms, 0, bytes, offset, norms.length); } private Similarity getSimilarity() { return searcher.getSimilarity(); // this is the normal lucene IndexSearcher } Can anyone see what's wrong with it for lucene current SVN? Should my calculation now be done differently? If so, how? Thanks for any clues into the right direction. Wolfgang. On May 2, 2005, at 9:05 AM, Wolfgang Hoschek wrote: I'm looking at it right now. The tests pass fine when you put lucene-1.4.3.jar instead of the current lucene onto the classpath which is what I've been doing so far. Something seems to have changed in the scoring calculation. No idea what that might be. I'll see if I can find out. Wolfgang. The test case is failing (type ant test at the contrib/memory working directory) with this: [junit] Testcase: testMany(org.apache.lucene.index.memory.MemoryIndexTest): Caused an ERROR [junit] BUG DETECTED:69 at query=term AND NOT phrase term, file=src/java/org/apache/lucene/index/memory/MemoryIndex.java, [EMAIL PROTECTED] [junit] java.lang.IllegalStateException: BUG DETECTED:69 at query=term AND NOT phrase term, file=src/java/org/apache/lucene/index/memory/MemoryIndex.java, [EMAIL PROTECTED] [junit] at org.apache.lucene.index.memory.MemoryIndexTest.run(MemoryIndexTest.jav a:305) [junit] at org.apache.lucene.index.memory.MemoryIndexTest.testMany(MemoryIndexTes t.java:228) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
Yes, the svn trunk uses skipTo more often than 1.4.3. However, your implementation of skipTo() needs some improvement. See the javadoc of skipTo of class Scorer: http://lucene.apache.org/java/docs/api/org/apache/lucene/search/ Scorer.html#skipTo(int) What's wrong with the version I sent? Remeber that there can be at most one document in a MemoryIndex. So the target parameter can safely be ignored, as far as I can see. In case the underlying scorers provide skipTo() it's even better to use that. The version I sent returns in O(1), if performance was your concern. Or did you mean something else? Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
The version I sent returns in O(1), if performance was your concern. Or did you mean something else? Since 0 is the only document number in the index, a return target == 0; might be nice for skipTo(). It doesn't really help performance, though, and the next() works just as well. Regards, Paul Elschot. It's not just return target == 0. Internally next() switches a hasNext flag to false, and that makes it a safer operation... BTW, did you give the unit tests a shot? Or even better, run it against some of your own queries/test data? That might help to shake out other bugs that might potentially be lurking in remote corners... Cheers, Wolfgang. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
Thanks! Wolfgang. I've committed this change after it successfully worked for me. Thanks! Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[Patch] IndexReader.finalize() performance
Here is the first and most high-priority patch I've settled on to get Lucene to work efficiently for the typical usage scenarios of MemoryIndex. More patches are forthcoming if this one is received favourably... There's large overhead involved in forcing all IndexReader impls to have a finalize() method. Remember that allocating and registering finalizable objects in a JVM isn't cheap at all when it's done at high frequency, which is the case for my single document MemoryIndex usage. MemoryIndex.createSearcher() does a new MemoryIndexReader() which is a subclass of IndexReader and thus carries what for this case amounts to unnecessary IndexReader superclass baggage. The proposal is to rename IndexReader.finalize() to IndexReader.doFinalize(), and for each subclass of IndexReader that wants or needs finalization add a method XYZReader.finalize() { doFinalize(); } That way subclass are not forced to be finalizable and incur the associated overheads. Note that it would not help to simply have an empty finalize() {} method, because that would still incur the finalizer JVM registration costs. [The other option would be to have IndexReader be an interface, but that would be a change that's a lot more involved] Here are two test runs without and with the patch: [grolsch /home/portnoy/u5/hoschek/tmp/tmp/firefish] cat xjames.txt James is out in the woods ** NOW WITHOUT THE PATCH APPLIED: * [grolsch /home/portnoy/u5/hoschek/tmp/tmp/firefish] bin/fire-java org.apache.lucene.index.memory.MemoryIndexTest 3 100 mem James xjames.txt ### iteration=0 *** FILE=xjames.txt secs = 15.046 queries/sec= 66462.85 MB/sec = 1.6479818 ### iteration=1 *** FILE=xjames.txt secs = 15.507 queries/sec= 64487.008 MB/sec = 1.5989896 ### iteration=2 *** FILE=xjames.txt secs = 15.923 queries/sec= 62802.234 MB/sec = 1.5572149 Done benchmarking (without checking correctness). Dumping CPU usage by sampling running threads ... done. [grolsch /home/portnoy/u5/hoschek/tmp/tmp/firefish] ** NOW WITH THE PATCH APPLIED: * [grolsch /home/portnoy/u5/hoschek/tmp/tmp/firefish] bin/fire-java org.apache.lucene.index.memory.MemoryIndexTest 3 100 mem James xjames.txt ### iteration=0 *** FILE=xjames.txt secs = 4.974 queries/sec= 201045.44 MB/sec = 4.9850287 ### iteration=1 *** FILE=xjames.txt secs = 4.495 queries/sec= 222469.42 MB/sec = 5.5162477 ### iteration=2 *** FILE=xjames.txt secs = 4.49 queries/sec= 222717.16 MB/sec = 5.5223904 Done benchmarking (without checking correctness). Dumping CPU usage by sampling running threads ... done. * If you're curious about * the whereabouts of bottlenecks, run java 1.5 with the non-perturbing '-server * -agentlib:hprof=cpu=samples,depth=10' flags, then study the trace log and * correlate its hotspot trailer with its call stack headers (see a * target=_blank * href=http://java.sun.com/developer/technicalArticles/Programming/ HPROF.html * hprof tracing /a). See the tail of the profiler output below and in particular note the following: CPU SAMPLES BEGIN (total = 918) Thu Apr 28 11:39:14 2005 rank self accum count trace method 1 57.41% 57.41% 527 300154 org.apache.lucene.index.memory.MemoryIndex.createSearcher 2 5.01% 62.42% 46 300152 java.lang.StrictMath.log 3 3.05% 65.47% 28 300164 java.lang.ref.Finalizer.invokeFinalizeMethod cat java.hprof.txt: JAVA PROFILE 1.0.1, created Thu Apr 28 11:38:27 2005 Header for -agentlib:hprof (or -Xrunhprof) ASCII Output (J2SE 1.5 JVMTI based) @(#)jvm.hprof.txt 1.3 04/02/09 Copyright (c) 2004 Sun Microsystems, Inc. All Rights Reserved. WARNING! This file format is under development, and is subject to change without notice. This file contains the following types of records: THREAD START THREAD END mark the lifetime of Java threads TRACE represents a Java stack trace. Each trace consists of a series of stack frames. Other records refer to TRACEs to identify (1) where object allocations have taken place, (2) the frames in which GC roots were found, and (3) frequently executed methods. HEAP DUMP is a complete snapshot of all live objects in the Java heap. Following distinctions are made: ROOTroot set as determined by GC CLS classes OBJ instances ARR arrays SITES is a sorted list of allocation sites. This identifies the most heavily allocated object types, and the TRACE at which those allocations occurred. CPU SAMPLES is a statistical profile of program execution. The VM periodically samples all running threads, and assigns a quantum to active TRACEs in those threads.
Re: [Performance] Streaming main memory indexing of single strings
Whichever place you settle on is fine with me. [In case it might make a difference: Just note that MemoryIndex has a small auxiliary dependency on PatternAnalyzer in addField() because the Analyzer superclass doesn't have a tokenStream(String fieldName, String text) method. And PatternAnalyzer requires JDK 1.4 or higher] Wolfgang. On Apr 27, 2005, at 9:22 AM, Doug Cutting wrote: Erik Hatcher wrote: I'm not quite sure where to put MemoryIndex - maybe it deserves to stand on its own in a new contrib area? That sounds good to me. Or does it make sense to put this into misc (still in sandbox/misc)? Or where? Isn't the goal for sandbox/ to go away, replaced with contrib/? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
OK. I'll send an update as soon as I get round to it... Wolfgang. On Apr 27, 2005, at 12:22 PM, Doug Cutting wrote: Erik Hatcher wrote: I'm not quite sure where to put MemoryIndex - maybe it deserves to stand on its own in a new contrib area? That sounds good to me. Ok... once Wolfgang gives me one last round up updates (JUnit tests instead of main() and upgrade it to work with trunk) I'll do that. I had put it in miscellaneous but will create its only sub-contrib area instead. Or does it make sense to put this into misc (still in sandbox/misc)? Or where? Isn't the goal for sandbox/ to go away, replaced with contrib/? Yes. In fact, I moved the last relevant piece (sandbox/contributions/miscellaneous) to contrib last night. I think both the parsers and XML-Indexing-Demo found in the sandbox are not worth preserving. Anyone feel that these pieces left in the sandbox should be preserved? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
I've uploaded slightly improved versions of the fast MemoryIndex contribution to http://issues.apache.org/bugzilla/show_bug.cgi?id=34585 along with another contrib - PatternAnalyzer. For a quick overview without downloading code, there's javadoc for it all at http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- summary.html I'm happy to maintain these classes externally as part of the Nux project. But from the preliminary discussion on the list some time ago I gathered there'd be some wider interest, hence I prepared the contribs for the community. What would be the next steps for taking this further, if any? Thanks, Wolfgang. /** * Efficient Lucene analyzer/tokenizer that preferably operates on a String rather than a * [EMAIL PROTECTED] java.io.Reader}, that can flexibly separate on a regular expression [EMAIL PROTECTED] Pattern} * (with behaviour idential to [EMAIL PROTECTED] String#split(String)}), * and that combines the functionality of * [EMAIL PROTECTED] org.apache.lucene.analysis.LetterTokenizer}, * [EMAIL PROTECTED] org.apache.lucene.analysis.LowerCaseTokenizer}, * [EMAIL PROTECTED] org.apache.lucene.analysis.WhitespaceTokenizer}, * [EMAIL PROTECTED] org.apache.lucene.analysis.StopFilter} into a single efficient * multi-purpose class. * p * If you are unsure how exactly a regular expression should look like, consider * prototyping by simply trying various expressions on some test texts via * [EMAIL PROTECTED] String#split(String)}. Once you are satisfied, give that regex to * PatternAnalyzer. Also see a target=_blank * href=http://java.sun.com/docs/books/tutorial/extra/regex/;Java Regular Expression Tutorial/a. * p * This class can be considerably faster than the normal Lucene tokenizers. * It can also serve as a building block in a compound Lucene * [EMAIL PROTECTED] org.apache.lucene.analysis.TokenFilter} chain. For example as in this * stemming example: * pre * PatternAnalyzer pat = ... * TokenStream tokenStream = new SnowballFilter( * pat.tokenStream(content, James is running round in the woods), * English)); * /pre On Apr 22, 2005, at 1:53 PM, Wolfgang Hoschek wrote: I've now got the contrib code cleaned up, tested and documented into a decent state, ready for your review and comments. Consider this a formal contrib (Apache license is attached). The relevant files are attached to the following bug ID: http://issues.apache.org/bugzilla/show_bug.cgi?id=34585 For a quick overview without downloading code, there's some javadoc at http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- summary.html There are several small open issues listed in the javadoc and also inside the code. Thoughts? Comments? I've also got small performance patches for various parts of Lucene core (not submitted yet). Taken together they lead to substantially improved performance for MemoryIndex, and most likely also for Lucene in general. Some of them are more involved than others. I'm now figuring out how much performance each of these contributes and how to propose potential integration - stay tuned for some follow-ups to this. The code as submitted would certainly benefit a lot from said patches, but they are not required for correct operation. It should work out of the box (currently only on 1.4.3 or lower). Try running cd lucene-cvs java org.apache.lucene.index.memory.MemoryIndexTest with or without custom arguments to see it in action. Before turning to a performance patch discussion I'd a this point rather be most interested in folks giving it a spin, comments on the API, or any other issues. Cheers, Wolfgang. On Apr 20, 2005, at 11:26 AM, Wolfgang Hoschek wrote: On Apr 20, 2005, at 9:22 AM, Erik Hatcher wrote: On Apr 20, 2005, at 12:11 PM, Wolfgang Hoschek wrote: By the way, by now I have a version against 1.4.3 that is 10-100 times faster (i.e. 3 - 20 index+query steps/sec) than the simplistic RAMDirectory approach, depending on the nature of the input data and query. From some preliminary testing it returns exactly what RAMDirectory returns. Awesome. Using the basic StringIndexReader I sent? Yep, it's loosely based on the empty skeleton you sent. I've been fiddling with it a bit more to get other query types. I'll add it to the contrib area when its a bit more robust. Perhaps we could merge up once I'm ready and put that into the contrib area? My version now supports tokenization with any analyzer and it supports any arbitrary Lucene query. I might make the API for adding terms a little more general, perhaps allowing arbitrary Document objects if that's what other folks really need... As an aside, is there any work going on to potentially support prefix (and infix) wild card queries ala *fish? WildcardQuery supports wildcard characters anywhere in the string. QueryParser itself restricts expressions that have leading
Re: [Performance] Streaming main memory indexing of single strings
I've now got the contrib code cleaned up, tested and documented into a decent state, ready for your review and comments. Consider this a formal contrib (Apache license is attached). The relevant files are attached to the following bug ID: http://issues.apache.org/bugzilla/show_bug.cgi?id=34585 For a quick overview without downloading code, there's some javadoc at http://dsd.lbl.gov/nux/api/org/apache/lucene/index/memory/package- summary.html There are several small open issues listed in the javadoc and also inside the code. Thoughts? Comments? I've also got small performance patches for various parts of Lucene core (not submitted yet). Taken together they lead to substantially improved performance for MemoryIndex, and most likely also for Lucene in general. Some of them are more involved than others. I'm now figuring out how much performance each of these contributes and how to propose potential integration - stay tuned for some follow-ups to this. The code as submitted would certainly benefit a lot from said patches, but they are not required for correct operation. It should work out of the box (currently only on 1.4.3 or lower). Try running cd lucene-cvs java org.apache.lucene.index.memory.MemoryIndexTest with or without custom arguments to see it in action. Before turning to a performance patch discussion I'd a this point rather be most interested in folks giving it a spin, comments on the API, or any other issues. Cheers, Wolfgang. On Apr 20, 2005, at 11:26 AM, Wolfgang Hoschek wrote: On Apr 20, 2005, at 9:22 AM, Erik Hatcher wrote: On Apr 20, 2005, at 12:11 PM, Wolfgang Hoschek wrote: By the way, by now I have a version against 1.4.3 that is 10-100 times faster (i.e. 3 - 20 index+query steps/sec) than the simplistic RAMDirectory approach, depending on the nature of the input data and query. From some preliminary testing it returns exactly what RAMDirectory returns. Awesome. Using the basic StringIndexReader I sent? Yep, it's loosely based on the empty skeleton you sent. I've been fiddling with it a bit more to get other query types. I'll add it to the contrib area when its a bit more robust. Perhaps we could merge up once I'm ready and put that into the contrib area? My version now supports tokenization with any analyzer and it supports any arbitrary Lucene query. I might make the API for adding terms a little more general, perhaps allowing arbitrary Document objects if that's what other folks really need... As an aside, is there any work going on to potentially support prefix (and infix) wild card queries ala *fish? WildcardQuery supports wildcard characters anywhere in the string. QueryParser itself restricts expressions that have leading wildcards from being accepted. Any particular reason for this restriction? Is this simply a current parser limitation or something inherent? QueryParser supports wildcard characters in the middle of strings no problem though. Are you seeing otherwise? I ment an infix query such as *fish* Wolfgang. --- Wolfgang Hoschek | email: [EMAIL PROTECTED] Distributed Systems Department| phone: (415)-533-7610 Berkeley Laboratory | http://dsd.lbl.gov/~hoschek/ --- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [Performance] Streaming main memory indexing of single strings
Good point. By the way, by now I have a version against 1.4.3 that is 10-100 times faster (i.e. 3 - 20 index+query steps/sec) than the simplistic RAMDirectory approach, depending on the nature of the input data and query. From some preliminary testing it returns exactly what RAMDirectory returns. I'll do some cleanup and documentation and then post this to the list for review RSN. As an aside, is there any work going on to potentially support prefix (and infix) wild card queries ala *fish? Wolfgang. On Apr 20, 2005, at 6:10 AM, Vanlerberghe, Luc wrote: One reason to choose the 'simplistic IndexReader' approach to this problem over regex's is that the result should be 'bug-compatible' with a standard search over all documents. Differences between the two systems would be difficult to explain to an end-user (let alone for the developer to debug and find the reason in the first place!) Luc -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Saturday, April 16, 2005 2:09 AM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings On Apr 15, 2005, at 6:15 PM, Wolfgang Hoschek wrote: Cool! For my use case it would need to be able to handle arbitrary queries (previously parsed from a general lucene query string). Something like: float match(String Text, Query query) it's fine with me if it also works for float[] match(String[] texts, Query query) or float(Document doc, Query query) but that isn't required by the use case. My implementation is nearly that. The score is available as hits.score(0). You would also need an analyzer, I presume, passed to your proposed match() method if you want the text broken into terms. My current implementation is passed a String[] where each item is considered a term for the document. match() would also need a field name to be fully accurate - since the analyzer needs a field name and terms used for searching need a field name. The Query may contain terms for any number of fields - how should that be handled? Should only a single field name be passed in and any terms request for other fields be ignored? Or should this utility morph to assume any words in the text is in any field being asked of it? As for Doug's devil advocate questions - I really don't know what I'd use it for personally (other than the match this single string against a bunch of queries), I just thought it was clever that it could be done. Clever regex's could come close, but it'd be a lot more effort than reusing good ol' QueryParser and this simplistic IndexReader, along with an Analyzer. Erik Wolfgang. I am intrigued by this and decided to mock a quick and dirty example of such an IndexReader. After a little trial-and-error I got it working at least for TermQuery and WildcardQuery. I've pasted my code below as an example, but there is much room for improvement, especially in terms of performance and also in keeping track of term frequency, and also it would be nicer if it handled the analysis internally. I think something like this would make a handy addition to our contrib area at least. I'd be happy to receive improvements to this and then add it to a contrib subproject. Perhaps this would be a handy way to handle situations where users have queries saved in a system and need to be alerted whenever a new document arrives matching the saved queries? Erik -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 4:04 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings This seems to be a promising avenue worth exploring. My gutfeeling is that this could easily be 10-100 times faster. The drawback is that it requires a fair amount of understanding of intricate Lucene internals, pulling those pieces together and adapting them as required for the seemingly simple float match(String text, Query query). I might give it a shot but I'm not sure I'll be able to pull this off! Is there any similar code I could look at as a starting point? Wolfgang. On Apr 14, 2005, at 1:13 PM, Robert Engels wrote: I think you are not approaching this the correct way. Pseudo code: Subclass IndexReader. Get tokens from String 'document' using Lucene analyzers. Build simple hash-map based data structures using tokens for terms, and term positions. reimplement termDocs() and termPositions() to use the structures from above. run searches. start again with next document. -Original Message- From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED] Sent: Thursday, April 14, 2005 2:56 PM To: java-dev@lucene.apache.org Subject: Re: [Performance] Streaming main memory indexing of single strings Otis, this might be a misunderstanding. - I'm not calling optimize(). That piece is commented out you if look again at the code. - The *streaming* use case requires that for each query I add one