Hi Abe-san, Actually, for the Tika transformation connector, there are TWO different mime types. One mime type represents what the connector generates. The other represents what the connector can accept. This is true of all transformation connectors.
Hope that helps. Karl On Tue, Aug 12, 2014 at 12:59 PM, Shinichiro Abe <shinichiro.ab...@gmail.com > wrote: > Ok, I understand we specify 'text/plain;charset=utf-8' string temporarily > so that we accept all kinds of mime types. > > Thanks, > Shinichiro Abe > > > > 2014-08-13 1:25 GMT+09:00 Karl Wright <daddy...@gmail.com>: > > > bq. I have a question. > > What is this? -> hard-coded mymetype checkings, > "text/plain;charset=utf-8". > > For what? This seems to be unnecessary. > > > > > http://svn.apache.org/viewvc/manifoldcf/trunk/connectors/tika/connector/src/main/java/org/apache/manifoldcf/agents/transformation/tika/TikaExtractor.java?view=markup&#l156 > > > > > http://svn.apache.org/viewvc/manifoldcf/trunk/connectors/tika/connector/src/main/java/org/apache/manifoldcf/agents/transformation/tika/TikaExtractor.java?view=markup&#l99 > > > > > > Hi Abe-san, > > > > The idea is that the Tika extractor always confirms that the downstream > > pipeline accepts text/plain;charset=utf-8 because that is what it always > > outputs. On the upstream side, we should technically only accept > documents > > that Tika knows how to extract. Right now, we accept all kinds, because > I > > don't know what that list is. > > > > Karl > > > > > > > > > > On Tue, Aug 12, 2014 at 12:20 PM, Shinichiro Abe < > > shinichiro.ab...@gmail.com > > > wrote: > > > > > Hi Karl, > > > > > > I also confirmed that using a SJIS file attached on CONNECTORS-613, > > > then the file was not garbled and could extract content and metadata > > > properly by tika connector. > > > Therefore currently we don't need to respin RC. > > > > > > I have a question. > > > What is this? -> hard-coded mymetype checkings, > > "text/plain;charset=utf-8". > > > For what? This seems to be unnecessary. > > > > > > > > > http://svn.apache.org/viewvc/manifoldcf/trunk/connectors/tika/connector/src/main/java/org/apache/manifoldcf/agents/transformation/tika/TikaExtractor.java?view=markup&#l156 > > > > > > > > > http://svn.apache.org/viewvc/manifoldcf/trunk/connectors/tika/connector/src/main/java/org/apache/manifoldcf/agents/transformation/tika/TikaExtractor.java?view=markup&#l99 > > > > > > Thanks, > > > Shinichiro Abe > > > > > > > > > 2014-08-13 1:09 GMT+09:00 Karl Wright <daddy...@gmail.com>: > > > > > > > Ok, I closed the ticket. > > > > > > > > So thanks, I think I'm now read to vote +1. > > > > > > > > Karl > > > > > > > > > > > > > > > > On Tue, Aug 12, 2014 at 11:38 AM, Shinichiro Abe < > > > > shinichiro.ab...@gmail.com > > > > > wrote: > > > > > > > > > I apologize for the mistake, I forgot to configure tika connector > in > > > the > > > > > job. I configured documentFilter and Metadata adjuster only. > > > > > It works by adding tika connector, there is no problem. English > pdf, > > > > > Japanese pdf/xls are not garbled! > > > > > I'm sorry! So we don't have to fix CONNECTORS-1008. > > > > > > > > > > Shinichiro Abe > > > > > > > > > > > > > > > 2014-08-13 0:24 GMT+09:00 Karl Wright <daddy...@gmail.com>: > > > > > > > > > > > Ok, I've done some more experimentation, and confirmed that there > > is > > > > > really > > > > > > only ONE problem: in SolrJ or Solr. ManifoldCF is working > > perfectly. > > > > > > > > > > > > The ticket I created, CONNECTORS-1008, will therefore be > postponed > > to > > > > MCF > > > > > > 2.0. The workaround is the use the extracting update handler > even > > > when > > > > > the > > > > > > content has already been extracted on the MCF side. So we should > > > open > > > > a > > > > > > SOLR ticket, but there is no reason to respin the MCF release. > > > > > > > > > > > > Karl > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Aug 12, 2014 at 10:18 AM, Karl Wright < > daddy...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > So there are two problems. One problem is that the Tika > > Extractor > > > is > > > > > not > > > > > > > doing the right thing (I think). The second problem is that > > valid > > > > > > > characters are not being sent to Solr when SolrInputDocument is > > > used. > > > > > > > > > > > > > > Karl > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Aug 12, 2014 at 10:15 AM, Shinichiro Abe < > > > > > > > shinichiro.ab...@gmail.com> wrote: > > > > > > > > > > > > > >> Thanks Karl, > > > > > > >> > > > > > > >> When posting MCF's end-user-documentation.pdf(English) via > > > standard > > > > > > update > > > > > > >> handler, > > > > > > >> Solr throws an exception, this is a problem, I'm not sure why. > > > > > > >> It works by leaving my pipeline to include Tika and using the > > > > > extracting > > > > > > >> update handler. > > > > > > >> Solr's Tika version matches MCF's Tika one(1.5). > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> 2014-08-12 23:10 GMT+09:00 Karl Wright <daddy...@gmail.com>: > > > > > > >> > > > > > > >> > It looks like the Tika content extraction is not actually > > > > producing > > > > > > >> valid > > > > > > >> > utf-8. I'm not sure what it is producing, but that is the > > > > > underlying > > > > > > >> > problem. > > > > > > >> > > > > > > > >> > I'll create a ticket and look into it. > > > > > > >> > > > > > > > >> > Karl > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > On Tue, Aug 12, 2014 at 9:52 AM, Karl Wright < > > > daddy...@gmail.com> > > > > > > >> wrote: > > > > > > >> > > > > > > > >> > > Hi Abe-san, > > > > > > >> > > > > > > > > >> > > It looks to me like SolrJ when it uses SolrInputDocument > > > cannot > > > > > > >> correctly > > > > > > >> > > post some kinds of characters. The exception is coming > from > > > > > inside > > > > > > >> Solr > > > > > > >> > > itself -- not SolrJ. So I think a Solr ticket would be > the > > > > right > > > > > > >> thing > > > > > > >> > to > > > > > > >> > > do here. > > > > > > >> > > > > > > > > >> > > Can you try leaving your pipeline to include Tika, but > > > changing > > > > > your > > > > > > >> Solr > > > > > > >> > > connection to go back to using the extracting update > > handler? > > > > If > > > > > > that > > > > > > >> > > works, then I think we have correctly diagnosed the > problem. > > > > > > >> > > > > > > > > >> > > Thanks, > > > > > > >> > > Karl > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > On Tue, Aug 12, 2014 at 9:43 AM, Shinichiro Abe < > > > > > > >> > > shinichiro.ab...@gmail.com> wrote: > > > > > > >> > > > > > > > > >> > >> Hi Karl, > > > > > > >> > >> > > > > > > >> > >> The content field was garbled via /update and tika > > connector. > > > > > > >> > >> Sample Docs: > > > http://www.rondhuit.com/download.html#whitepaper > > > > > > >> > >> My mcf-job was from filesystem:Japanese PDF,XLS to Solr. > > > > > > >> > >> > > > > > > >> > >> I was surprised that Solr threw an exception when > > > > > > >> > >> en_US end-user-documentation.pdf > > > > > > >> > >> was posted via tika connector. Posting files via > > > > /update/extract > > > > > > were > > > > > > >> > not > > > > > > >> > >> garbled, not threw exceptions. > > > > > > >> > >> Could you reproduce this? > > > > > > >> > >> > > > > > > >> > >> 2268394 [qtp1224864813-14] ERROR > > > > > > >> > >> org.apache.solr.servlet.SolrDispatchFilter > > > > > > >> > >> – null:java.lang.RuntimeException: [was class > > > > > > >> > >> java.io.CharConversionException] Invalid UTF-8 character > > > 0xffff > > > > > at > > > > > > >> char > > > > > > >> > >> #112515, byte #184319) > > > > > > >> > >> at > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) > > > > > > >> > >> at > > > > > > >> > > > com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) > > > > > > >> > >> at > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) > > > > > > >> > >> at > > > > > > >> > > > > com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) > > > > > > >> > >> at > > > > > > >> > > > org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:395) > > > > > > >> > >> at > > > > > > >> > >> > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) > > > > > > >> > >> at > > > > > > org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174) > > > > > > >> > >> at > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) > > > > > > >> > >> at > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > > > > > > >> > >> at > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > > > > > > >> > >> at > > org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) > > > > > > >> > >> ... > > > > > > >> > >> Caused by: java.io.CharConversionException: Invalid UTF-8 > > > > > character > > > > > > >> > 0xffff > > > > > > >> > >> at char #112515, byte #184319) > > > > > > >> > >> at > > > > com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335) > > > > > > >> > >> at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249) > > > > > > >> > >> at > com.ctc.wstx.io.MergedReader.read(MergedReader.java:101) > > > > > > >> > >> at > > > com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) > > > > > > >> > >> at > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57) > > > > > > >> > >> at > > > > com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992) > > > > > > >> > >> at > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628) > > > > > > >> > >> at > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126) > > > > > > >> > >> at > > > > > > >> > >> > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701) > > > > > > >> > >> at > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649) > > > > > > >> > >> ... 36 more > > > > > > >> > >> > > > > > > >> > >> Thanks, > > > > > > >> > >> Shinichiro Abe > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> 2014-08-12 22:24 GMT+09:00 Karl Wright < > daddy...@gmail.com > > >: > > > > > > >> > >> > > > > > > >> > >> > I ran "ant rat-sources", and inspected the packages. > All > > > > looks > > > > > > >> good. > > > > > > >> > >> The > > > > > > >> > >> > only comment is that the connector-lib area has grown > by > > > > about > > > > > > 18MB > > > > > > >> > this > > > > > > >> > >> > cycle, and of course all the images for the Chinese > > > > > documentation > > > > > > >> add > > > > > > >> > >> > another 5MB, so our binary packages are now just about > > > 200MB. > > > > > I > > > > > > >> don't > > > > > > >> > >> > think this something we can do a lot about, though, > > except > > > > > maybe > > > > > > by > > > > > > >> > >> > repackaging so we release connectors independently of > the > > > > > > >> framework. > > > > > > >> > >> > > > > > > > >> > >> > I'll give a final vote after I hear more back from > Erlend > > > and > > > > > > >> Abe-san. > > > > > > >> > >> > > > > > > > >> > >> > Thanks, > > > > > > >> > >> > Karl > > > > > > >> > >> > > > > > > > >> > >> > > > > > > > >> > >> > On Tue, Aug 12, 2014 at 2:23 AM, Karl Wright < > > > > > daddy...@gmail.com > > > > > > > > > > > > > >> > >> wrote: > > > > > > >> > >> > > > > > > > >> > >> > > I request that the vote be left open at least until > > > > > 8/21/2014, > > > > > > >> since > > > > > > >> > >> 1.7 > > > > > > >> > >> > > is a major release and we want as many people to try > it > > > out > > > > > as > > > > > > >> > >> possible > > > > > > >> > >> > > before declaring it complete. Thanks! > > > > > > >> > >> > > > > > > > > >> > >> > > Karl > > > > > > >> > >> > > > > > > > > >> > >> > > > > > > > > >> > >> > > > > > > > > >> > >> > > On Tue, Aug 12, 2014 at 12:44 AM, Shinichiro Abe < > > > > > > >> > >> > > shinichiro.ab...@gmail.com> wrote: > > > > > > >> > >> > > > > > > > > >> > >> > >> Hi, > > > > > > >> > >> > >> > > > > > > >> > >> > >> +1 from me. > > > > > > >> > >> > >> > > > > > > >> > >> > >> -Checked SIGS, checksum by running > > check_signatures.sh. > > > > > > >> > >> > >> -Checked that the code signing Key of Mingchun is > > > > available > > > > > > >> online. > > > > > > >> > >> > >> > > > > > > >> > >> > >> Shinichiro Abe > > > > > > >> > >> > >> > > > > > > >> > >> > >> On 2014/08/12, at 12:13, Mingchun Zhao < > > > > > > >> mingchun.zha...@gmail.com> > > > > > > >> > >> > wrote: > > > > > > >> > >> > >> > > > > > > >> > >> > >> > Hi all, > > > > > > >> > >> > >> > > > > > > > >> > >> > >> > Please vote on whether to release the ManifoldCF, > > > > version > > > > > > 1.7, > > > > > > >> > RC0. > > > > > > >> > >> > >> > > > > > > > >> > >> > >> > You can find the artifact at: > > > > > > >> > >> > >> > > > > > > > >> > >> > >> > > > > > > > http://people.apache.org/~mingchun/apache-manifoldcf-1.7-RC0 > > > > > > >> > >> > >> > > > > > > > >> > >> > >> > There is also a tag at: > > > > > > >> > >> > >> > > > > > > > >> > >> > >> > > > > > > > >> > > https://svn.apache.org/repos/asf/manifoldcf/tags/release-1.7-RC0 > > > > > > >> > >> > >> > > > > > > > >> > >> > >> > Vote will remain open at least 72 hours. > > > > > > >> > >> > >> > > > > > > > >> > >> > >> > Thanks! > > > > > > >> > >> > >> > Mingchun Zhao > > > > > > >> > >> > >> > > > > > > >> > >> > >> > > > > > > >> > >> > > > > > > > > >> > >> > > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > >> -- > > > > > > >> > >> - - - - - - - - - - - - - - - - - - - - - - - - - - - - > - - > > > - - > > > > > > >> > >> Shinichiro Abe > > > > > > >> > >> 阿部 慎一朗 > > > > > > >> > >> > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> -- > > > > > > >> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > - > > > > > > >> Shinichiro Abe > > > > > > >> 阿部 慎一朗 > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > > > > > Shinichiro Abe > > > > > 阿部 慎一朗 > > > > > > > > > > > > > > > > > > > > > -- > > > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > > > Shinichiro Abe > > > 阿部 慎一朗 > > > > > > > > > -- > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > Shinichiro Abe > 阿部 慎一朗 >