Re: Command Line Indexer
On 9/18/2018 2:21 PM, Christopher Schultz wrote: AIUI, Solr doesn't support updating a single field in a document. The document is replaced no matter how hard to try to be surgical about updating a single field. Solr does have Atomic Update functionality. For this to work, the index must be appropriately configured. Many indexes do not qualify. Atomic Updates let the user send a request that is basically an update for individual fields rather than the full document. Solr will read the existing index data and translate that request internally to a full document update. The user thinks they are just updating a portion of the document, but Solr still indexes the whole thing. There is also the In-Place Update feature, which is a lot closer to localized surgery, as it involves rewriting a portion of the docValues file for a segment, not indexing a new document. The field definition requirements for this are pretty extreme -- docValues ONLY. Depending on the size of the segment containing the document, this might be slower than simply indexing the full document again. Thanks, Shawn
Re: Command Line Indexer
Yup, thanks for the clarification. I see now that some of the items I list in 2 are moot. On Tue, Sep 18, 2018 at 4:16 PM Alexandre Rafalovitch wrote: > Uhm, inline: > > On 18 September 2018 at 17:05, Dan Brown wrote: > > 1. Thank you. > > > > 2. I think this is what you're looking for. You'd be able to be more > > specific than with bin/post. For instance: > > a. specify the CSV delimiter, CSV quote character, and multivalued field > > delimiter > > http://lucene.apache.org/solr/guide/7_4/uploading-data-with-index-handlers.html > separator - (global and field local for multivalued) > encapsulator - for CSV quote characters > > > b. the dynamic-fields feature let's you write plugins in Java to define > > values (very simple example: combine field values f_name, m_name, l_name > to > > populate a full_name field) > UpdateRequestProcessors. Your example specifically: > > > c. specify field order for mapping onto SOLR fields, data types, date > > formats of source data; perhaps your CSV headers/JSON keys don't cleanly > > map to SOLR field names > > d. flag whether the first row of a CSV is the header and should not be > > indexed > > e. use literal values - e.g., instead of having to alter the source data > to > > have a column whose value is "foo" you can configure a field to always > have > > the same literal value for all documents > > f. set the number of times to retry when there is an error and the amount > > of time between retries (e.g., sometimes zk was not consistently > responsive) > > g. skip fields - e.g., your data have 10 columns but you only want to > index > > columns 1, 3, 5, and 9 > > h. send soft commits after a specified number of batches > > i. combine fields to generate the uniqueKey value > > > > 3. Yes, atomic updates. For instance, index data using DIH then use this > > index to provide additional values to fields in those documents (e.g., > > maybe the extra data come from a different data source like BigQuery). > > > > I hope this brings more clarity to this tool's features and answers all > > your questions. Please ask questions if anyone has more. > > > > Dan > > > > > > On Tue, Sep 18, 2018 at 3:21 PM Christopher Schultz < > > ch...@christopherschultz.net> wrote: > > > >> -BEGIN PGP SIGNED MESSAGE- > >> Hash: SHA256 > >> > >> Dan, > >> > >> On 9/18/18 2:51 PM, Dan Brown wrote: > >> > I've been working on this for a while and it's finally in a state > >> > where it's ready for public consumption. > >> > > >> > This is a command line indexer that will index CSV or JSON > >> > documents: https://github.com/likethecolor/solr-indexer > >> > > >> > There are quite a few parameters/options that can be set. > >> > > >> > One thing to note is that it will update individual fields. That > >> > is, unlike the Data Import Handler, it does not replace entire > >> > documents. > >> > > >> > Please check it out and let me know what you think. > >> > >> How is this different from the bin/post tool that ships with Solr? > >> > >> Or is that you meant when you said "this is unlike the Data Import > >> Handler". > >> > >> AIUI, Solr doesn't support updating a single field in a document. The > >> document is replaced no matter how hard to try to be surgical about > >> updating a single field. > >> > >> - -chris > >> -BEGIN PGP SIGNATURE- > >> Comment: GPGTools - http://gpgtools.org > >> Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/ > >> > >> iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAluhXlYACgkQHPApP6U8 > >> pFjIeQ/+PRIx+I+IDW9XTqGNV5TIWYf+yQKC/4JpTV4Ndj7MZLsEEw+cfMvFTvQt > >> 44dK7CnDKEDgQHZlMccWKd9/Th1k/5g40VMugBMsayRwUc83Onawdi4HQfnig4et > >> VN0/RaZ/IBo2AThsgEvUNplXYyY3BtyrUt6miiBsVkhKstI/BnmKqZvsRgvVjH0P > >> K1Xc5F2LNyXswvoIZqd3YmEa9p7CYMy7COsFV9KOeSymKlB7UoHulZqpJ9MRYkmn > >> YWjc9dHIRjpz5TUrJqWhZUG03uGXGtTnaXEku1Hb98WyIUZcHxkwN8W7qm6/B0CG > >> inPxfGRFH9EbUdcK4qeXmbQqty2sbKMQ6hogpRd/NEzgSWjDapiEUT1xz+p5V6wG > >> XM0ILaiLJ8zHJA6oUY0w5SNNyhdnd76CDpCK7T7YBm+aIxUDv9zoj6TLNceEaLi0 > >> SjfI83LvaR1gM/ZeVO77d+1IY9maU1+5m0EZFjAETfMGj5dwYRvBub0Oo6QQuLUm > >> roF5R5b/bg/WjjPF1n4CJ7gTr/WBMzahKFnnQvoYD3OQqZpoasoEUifPpSd9OgvO > >> yEok0VqwxPeXdHgE+Vy+BlXn6QqshB3BYnUSNbpFXlNsOIQojfJXkjcCa+dP1nyF > >> JCElvmEgBG8K1WzGo4WAtVqJs7WDzQlmY2RDrETGsVbnqkTojXA= > >> =AmkJ > >> -END PGP SIGNATURE- > >> >
Re: Command Line Indexer
Oops, premature send. But basically, nearly all the items below seem to be a mix of things that CSV can already do or that URP can already do or would be the good place to inject that as a plugin. E.g. http://lucene.apache.org/solr/guide/7_4/update-request-processors.html#templateupdateprocessorfactory Not that I am saying your project has no place to exist. I am just saying that it would benefit from a higher-level explanation that clearly differentiates it from what Solr already does. Regards, Alex. On 18 September 2018 at 17:16, Alexandre Rafalovitch wrote: > Uhm, inline: > > On 18 September 2018 at 17:05, Dan Brown wrote: >> 1. Thank you. >> >> 2. I think this is what you're looking for. You'd be able to be more >> specific than with bin/post. For instance: >> a. specify the CSV delimiter, CSV quote character, and multivalued field >> delimiter > http://lucene.apache.org/solr/guide/7_4/uploading-data-with-index-handlers.html > separator - (global and field local for multivalued) > encapsulator - for CSV quote characters > >> b. the dynamic-fields feature let's you write plugins in Java to define >> values (very simple example: combine field values f_name, m_name, l_name to >> populate a full_name field) > UpdateRequestProcessors. Your example specifically: > >> c. specify field order for mapping onto SOLR fields, data types, date >> formats of source data; perhaps your CSV headers/JSON keys don't cleanly >> map to SOLR field names >> d. flag whether the first row of a CSV is the header and should not be >> indexed >> e. use literal values - e.g., instead of having to alter the source data to >> have a column whose value is "foo" you can configure a field to always have >> the same literal value for all documents >> f. set the number of times to retry when there is an error and the amount >> of time between retries (e.g., sometimes zk was not consistently responsive) >> g. skip fields - e.g., your data have 10 columns but you only want to index >> columns 1, 3, 5, and 9 >> h. send soft commits after a specified number of batches >> i. combine fields to generate the uniqueKey value >> >> 3. Yes, atomic updates. For instance, index data using DIH then use this >> index to provide additional values to fields in those documents (e.g., >> maybe the extra data come from a different data source like BigQuery). >> >> I hope this brings more clarity to this tool's features and answers all >> your questions. Please ask questions if anyone has more. >> >> Dan >> >> >> On Tue, Sep 18, 2018 at 3:21 PM Christopher Schultz < >> ch...@christopherschultz.net> wrote: >> >>> -BEGIN PGP SIGNED MESSAGE- >>> Hash: SHA256 >>> >>> Dan, >>> >>> On 9/18/18 2:51 PM, Dan Brown wrote: >>> > I've been working on this for a while and it's finally in a state >>> > where it's ready for public consumption. >>> > >>> > This is a command line indexer that will index CSV or JSON >>> > documents: https://github.com/likethecolor/solr-indexer >>> > >>> > There are quite a few parameters/options that can be set. >>> > >>> > One thing to note is that it will update individual fields. That >>> > is, unlike the Data Import Handler, it does not replace entire >>> > documents. >>> > >>> > Please check it out and let me know what you think. >>> >>> How is this different from the bin/post tool that ships with Solr? >>> >>> Or is that you meant when you said "this is unlike the Data Import >>> Handler". >>> >>> AIUI, Solr doesn't support updating a single field in a document. The >>> document is replaced no matter how hard to try to be surgical about >>> updating a single field. >>> >>> - -chris >>> -BEGIN PGP SIGNATURE- >>> Comment: GPGTools - http://gpgtools.org >>> Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/ >>> >>> iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAluhXlYACgkQHPApP6U8 >>> pFjIeQ/+PRIx+I+IDW9XTqGNV5TIWYf+yQKC/4JpTV4Ndj7MZLsEEw+cfMvFTvQt >>> 44dK7CnDKEDgQHZlMccWKd9/Th1k/5g40VMugBMsayRwUc83Onawdi4HQfnig4et >>> VN0/RaZ/IBo2AThsgEvUNplXYyY3BtyrUt6miiBsVkhKstI/BnmKqZvsRgvVjH0P >>> K1Xc5F2LNyXswvoIZqd3YmEa9p7CYMy7COsFV9KOeSymKlB7UoHulZqpJ9MRYkmn >>> YWjc9dHIRjpz5TUrJqWhZUG03uGXGtTnaXEku1Hb98WyIUZcHxkwN8W7qm6/B0CG >>> inPxfGRFH9EbUdcK4qeXmbQqty2sbKMQ6hogpRd/NEzgSWjDapiEUT1xz+p5V6wG >>> XM0ILaiLJ8zHJA6oUY0w5SNNyhdnd76CDpCK7T7YBm+aIxUDv9zoj6TLNceEaLi0 >>> SjfI83LvaR1gM/ZeVO77d+1IY9maU1+5m0EZFjAETfMGj5dwYRvBub0Oo6QQuLUm >>> roF5R5b/bg/WjjPF1n4CJ7gTr/WBMzahKFnnQvoYD3OQqZpoasoEUifPpSd9OgvO >>> yEok0VqwxPeXdHgE+Vy+BlXn6QqshB3BYnUSNbpFXlNsOIQojfJXkjcCa+dP1nyF >>> JCElvmEgBG8K1WzGo4WAtVqJs7WDzQlmY2RDrETGsVbnqkTojXA= >>> =AmkJ >>> -END PGP SIGNATURE- >>>
Re: Command Line Indexer
Uhm, inline: On 18 September 2018 at 17:05, Dan Brown wrote: > 1. Thank you. > > 2. I think this is what you're looking for. You'd be able to be more > specific than with bin/post. For instance: > a. specify the CSV delimiter, CSV quote character, and multivalued field > delimiter http://lucene.apache.org/solr/guide/7_4/uploading-data-with-index-handlers.html separator - (global and field local for multivalued) encapsulator - for CSV quote characters > b. the dynamic-fields feature let's you write plugins in Java to define > values (very simple example: combine field values f_name, m_name, l_name to > populate a full_name field) UpdateRequestProcessors. Your example specifically: > c. specify field order for mapping onto SOLR fields, data types, date > formats of source data; perhaps your CSV headers/JSON keys don't cleanly > map to SOLR field names > d. flag whether the first row of a CSV is the header and should not be > indexed > e. use literal values - e.g., instead of having to alter the source data to > have a column whose value is "foo" you can configure a field to always have > the same literal value for all documents > f. set the number of times to retry when there is an error and the amount > of time between retries (e.g., sometimes zk was not consistently responsive) > g. skip fields - e.g., your data have 10 columns but you only want to index > columns 1, 3, 5, and 9 > h. send soft commits after a specified number of batches > i. combine fields to generate the uniqueKey value > > 3. Yes, atomic updates. For instance, index data using DIH then use this > index to provide additional values to fields in those documents (e.g., > maybe the extra data come from a different data source like BigQuery). > > I hope this brings more clarity to this tool's features and answers all > your questions. Please ask questions if anyone has more. > > Dan > > > On Tue, Sep 18, 2018 at 3:21 PM Christopher Schultz < > ch...@christopherschultz.net> wrote: > >> -BEGIN PGP SIGNED MESSAGE- >> Hash: SHA256 >> >> Dan, >> >> On 9/18/18 2:51 PM, Dan Brown wrote: >> > I've been working on this for a while and it's finally in a state >> > where it's ready for public consumption. >> > >> > This is a command line indexer that will index CSV or JSON >> > documents: https://github.com/likethecolor/solr-indexer >> > >> > There are quite a few parameters/options that can be set. >> > >> > One thing to note is that it will update individual fields. That >> > is, unlike the Data Import Handler, it does not replace entire >> > documents. >> > >> > Please check it out and let me know what you think. >> >> How is this different from the bin/post tool that ships with Solr? >> >> Or is that you meant when you said "this is unlike the Data Import >> Handler". >> >> AIUI, Solr doesn't support updating a single field in a document. The >> document is replaced no matter how hard to try to be surgical about >> updating a single field. >> >> - -chris >> -BEGIN PGP SIGNATURE- >> Comment: GPGTools - http://gpgtools.org >> Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/ >> >> iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAluhXlYACgkQHPApP6U8 >> pFjIeQ/+PRIx+I+IDW9XTqGNV5TIWYf+yQKC/4JpTV4Ndj7MZLsEEw+cfMvFTvQt >> 44dK7CnDKEDgQHZlMccWKd9/Th1k/5g40VMugBMsayRwUc83Onawdi4HQfnig4et >> VN0/RaZ/IBo2AThsgEvUNplXYyY3BtyrUt6miiBsVkhKstI/BnmKqZvsRgvVjH0P >> K1Xc5F2LNyXswvoIZqd3YmEa9p7CYMy7COsFV9KOeSymKlB7UoHulZqpJ9MRYkmn >> YWjc9dHIRjpz5TUrJqWhZUG03uGXGtTnaXEku1Hb98WyIUZcHxkwN8W7qm6/B0CG >> inPxfGRFH9EbUdcK4qeXmbQqty2sbKMQ6hogpRd/NEzgSWjDapiEUT1xz+p5V6wG >> XM0ILaiLJ8zHJA6oUY0w5SNNyhdnd76CDpCK7T7YBm+aIxUDv9zoj6TLNceEaLi0 >> SjfI83LvaR1gM/ZeVO77d+1IY9maU1+5m0EZFjAETfMGj5dwYRvBub0Oo6QQuLUm >> roF5R5b/bg/WjjPF1n4CJ7gTr/WBMzahKFnnQvoYD3OQqZpoasoEUifPpSd9OgvO >> yEok0VqwxPeXdHgE+Vy+BlXn6QqshB3BYnUSNbpFXlNsOIQojfJXkjcCa+dP1nyF >> JCElvmEgBG8K1WzGo4WAtVqJs7WDzQlmY2RDrETGsVbnqkTojXA= >> =AmkJ >> -END PGP SIGNATURE- >>
Re: Command Line Indexer
1. Thank you. 2. I think this is what you're looking for. You'd be able to be more specific than with bin/post. For instance: a. specify the CSV delimiter, CSV quote character, and multivalued field delimiter b. the dynamic-fields feature let's you write plugins in Java to define values (very simple example: combine field values f_name, m_name, l_name to populate a full_name field) c. specify field order for mapping onto SOLR fields, data types, date formats of source data; perhaps your CSV headers/JSON keys don't cleanly map to SOLR field names d. flag whether the first row of a CSV is the header and should not be indexed e. use literal values - e.g., instead of having to alter the source data to have a column whose value is "foo" you can configure a field to always have the same literal value for all documents f. set the number of times to retry when there is an error and the amount of time between retries (e.g., sometimes zk was not consistently responsive) g. skip fields - e.g., your data have 10 columns but you only want to index columns 1, 3, 5, and 9 h. send soft commits after a specified number of batches i. combine fields to generate the uniqueKey value 3. Yes, atomic updates. For instance, index data using DIH then use this index to provide additional values to fields in those documents (e.g., maybe the extra data come from a different data source like BigQuery). I hope this brings more clarity to this tool's features and answers all your questions. Please ask questions if anyone has more. Dan On Tue, Sep 18, 2018 at 3:21 PM Christopher Schultz < ch...@christopherschultz.net> wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > Dan, > > On 9/18/18 2:51 PM, Dan Brown wrote: > > I've been working on this for a while and it's finally in a state > > where it's ready for public consumption. > > > > This is a command line indexer that will index CSV or JSON > > documents: https://github.com/likethecolor/solr-indexer > > > > There are quite a few parameters/options that can be set. > > > > One thing to note is that it will update individual fields. That > > is, unlike the Data Import Handler, it does not replace entire > > documents. > > > > Please check it out and let me know what you think. > > How is this different from the bin/post tool that ships with Solr? > > Or is that you meant when you said "this is unlike the Data Import > Handler". > > AIUI, Solr doesn't support updating a single field in a document. The > document is replaced no matter how hard to try to be surgical about > updating a single field. > > - -chris > -BEGIN PGP SIGNATURE- > Comment: GPGTools - http://gpgtools.org > Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/ > > iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAluhXlYACgkQHPApP6U8 > pFjIeQ/+PRIx+I+IDW9XTqGNV5TIWYf+yQKC/4JpTV4Ndj7MZLsEEw+cfMvFTvQt > 44dK7CnDKEDgQHZlMccWKd9/Th1k/5g40VMugBMsayRwUc83Onawdi4HQfnig4et > VN0/RaZ/IBo2AThsgEvUNplXYyY3BtyrUt6miiBsVkhKstI/BnmKqZvsRgvVjH0P > K1Xc5F2LNyXswvoIZqd3YmEa9p7CYMy7COsFV9KOeSymKlB7UoHulZqpJ9MRYkmn > YWjc9dHIRjpz5TUrJqWhZUG03uGXGtTnaXEku1Hb98WyIUZcHxkwN8W7qm6/B0CG > inPxfGRFH9EbUdcK4qeXmbQqty2sbKMQ6hogpRd/NEzgSWjDapiEUT1xz+p5V6wG > XM0ILaiLJ8zHJA6oUY0w5SNNyhdnd76CDpCK7T7YBm+aIxUDv9zoj6TLNceEaLi0 > SjfI83LvaR1gM/ZeVO77d+1IY9maU1+5m0EZFjAETfMGj5dwYRvBub0Oo6QQuLUm > roF5R5b/bg/WjjPF1n4CJ7gTr/WBMzahKFnnQvoYD3OQqZpoasoEUifPpSd9OgvO > yEok0VqwxPeXdHgE+Vy+BlXn6QqshB3BYnUSNbpFXlNsOIQojfJXkjcCa+dP1nyF > JCElvmEgBG8K1WzGo4WAtVqJs7WDzQlmY2RDrETGsVbnqkTojXA= > =AmkJ > -END PGP SIGNATURE- >
Re: Command Line Indexer
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Dan, On 9/18/18 2:51 PM, Dan Brown wrote: > I've been working on this for a while and it's finally in a state > where it's ready for public consumption. > > This is a command line indexer that will index CSV or JSON > documents: https://github.com/likethecolor/solr-indexer > > There are quite a few parameters/options that can be set. > > One thing to note is that it will update individual fields. That > is, unlike the Data Import Handler, it does not replace entire > documents. > > Please check it out and let me know what you think. How is this different from the bin/post tool that ships with Solr? Or is that you meant when you said "this is unlike the Data Import Handler". AIUI, Solr doesn't support updating a single field in a document. The document is replaced no matter how hard to try to be surgical about updating a single field. - -chris -BEGIN PGP SIGNATURE- Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/ iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAluhXlYACgkQHPApP6U8 pFjIeQ/+PRIx+I+IDW9XTqGNV5TIWYf+yQKC/4JpTV4Ndj7MZLsEEw+cfMvFTvQt 44dK7CnDKEDgQHZlMccWKd9/Th1k/5g40VMugBMsayRwUc83Onawdi4HQfnig4et VN0/RaZ/IBo2AThsgEvUNplXYyY3BtyrUt6miiBsVkhKstI/BnmKqZvsRgvVjH0P K1Xc5F2LNyXswvoIZqd3YmEa9p7CYMy7COsFV9KOeSymKlB7UoHulZqpJ9MRYkmn YWjc9dHIRjpz5TUrJqWhZUG03uGXGtTnaXEku1Hb98WyIUZcHxkwN8W7qm6/B0CG inPxfGRFH9EbUdcK4qeXmbQqty2sbKMQ6hogpRd/NEzgSWjDapiEUT1xz+p5V6wG XM0ILaiLJ8zHJA6oUY0w5SNNyhdnd76CDpCK7T7YBm+aIxUDv9zoj6TLNceEaLi0 SjfI83LvaR1gM/ZeVO77d+1IY9maU1+5m0EZFjAETfMGj5dwYRvBub0Oo6QQuLUm roF5R5b/bg/WjjPF1n4CJ7gTr/WBMzahKFnnQvoYD3OQqZpoasoEUifPpSd9OgvO yEok0VqwxPeXdHgE+Vy+BlXn6QqshB3BYnUSNbpFXlNsOIQojfJXkjcCa+dP1nyF JCElvmEgBG8K1WzGo4WAtVqJs7WDzQlmY2RDrETGsVbnqkTojXA= =AmkJ -END PGP SIGNATURE-
Re: Command Line Indexer
1. Congrats! 2. How is this different from bin\post? CSV and JSON are both supported formats. I am sure it is very clear to you, but to a visitor - not so much. 3. What is the significance of "replace just the field". Is that an atomic update? Similar to AtomicUpdateProcessorFactory? What is the use-case? Basically, what is the business/use-case for the tool, as opposed to all the technical parameters, one by one. Regards, Alex. On 18 September 2018 at 14:51, Dan Brown wrote: > I've been working on this for a while and it's finally in a state where > it's ready for public consumption. > > This is a command line indexer that will index CSV or JSON documents: > https://github.com/likethecolor/solr-indexer > > There are quite a few parameters/options that can be set. > > One thing to note is that it will update individual fields. That is, > unlike the Data Import Handler, it does not replace entire documents. > > Please check it out and let me know what you think. > > Dan