I'm using CSV indexer to write nutch data, but in the nutch.csv file I find only the last thirteen lines, it seems like the indexer is overwriting the file, I've read nutch CSV Indexer documentation but I haven't found any configuration related to this situation. Could someone help me to get all the lines extracted by the parser? This is the log output and the index-writes.xml configuration:
org.apache.nutch.plugin.PluginManifestParser 2022-11-18 07:48:02,323 INFO o.a.n.p.PluginManifestParser [main] Plugins: looking in: /home/paulesco/Downloads/apache-nutch-1.19/plugins org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,753 INFO o.a.n.p.PluginRepository [main] Plugin Auto-activation mode: [true] org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,754 INFO o.a.n.p.PluginRepository [main] Registered Plugins: org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,755 INFO o.a.n.p.PluginRepository [main] Regex URL Filter (urlfilter-regex) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,755 INFO o.a.n.p.PluginRepository [main] Html Parse Plug-in (parse-html) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,755 INFO o.a.n.p.PluginRepository [main] HTTP Framework (lib-http) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,756 INFO o.a.n.p.PluginRepository [main] the nutch core extension points (nutch-extensionpoints) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,756 INFO o.a.n.p.PluginRepository [main] Basic Indexing Filter (index-basic) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,757 INFO o.a.n.p.PluginRepository [main] Anchor Indexing Filter (index-anchor) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,757 INFO o.a.n.p.PluginRepository [main] Tika Parser Plug-in (parse-tika) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,758 INFO o.a.n.p.PluginRepository [main] Extractor based XML/HTML Parser/Indexing Filter (extractor) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,758 INFO o.a.n.p.PluginRepository [main] Basic URL Normalizer (urlnormalizer-basic) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,759 INFO o.a.n.p.PluginRepository [main] Regex URL Filter Framework (lib-regex-filter) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,760 INFO o.a.n.p.PluginRepository [main] Regex URL Normalizer (urlnormalizer-regex) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,760 INFO o.a.n.p.PluginRepository [main] CyberNeko HTML Parser (lib-nekohtml) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,761 INFO o.a.n.p.PluginRepository [main] URL Validator (urlfilter-validator) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,761 INFO o.a.n.p.PluginRepository [main] OPIC Scoring Plug-in (scoring-opic) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,762 INFO o.a.n.p.PluginRepository [main] Pass-through URL Normalizer (urlnormalizer-pass) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,762 INFO o.a.n.p.PluginRepository [main] Http Protocol Plug-in (protocol-http) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,763 INFO o.a.n.p.PluginRepository [main] CSVIndexWriter (indexer-csv) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,763 INFO o.a.n.p.PluginRepository [main] Registered Extension-Points: org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,764 INFO o.a.n.p.PluginRepository [main] (Nutch Content Parser) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,764 INFO o.a.n.p.PluginRepository [main] (Nutch URL Filter) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,765 INFO o.a.n.p.PluginRepository [main] (HTML Parse Filter) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,765 INFO o.a.n.p.PluginRepository [main] (Nutch Scoring) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,766 INFO o.a.n.p.PluginRepository [main] (Nutch URL Normalizer) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,766 INFO o.a.n.p.PluginRepository [main] (Nutch Publisher) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,767 INFO o.a.n.p.PluginRepository [main] (Nutch Exchange) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,767 INFO o.a.n.p.PluginRepository [main] (Nutch Protocol) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,768 INFO o.a.n.p.PluginRepository [main] (Nutch URL Ignore Exemption Filter) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,768 INFO o.a.n.p.PluginRepository [main] (Nutch Index Writer) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,769 INFO o.a.n.p.PluginRepository [main] (Nutch Segment Merge Filter) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:02,769 INFO o.a.n.p.PluginRepository [main] (Nutch Indexing Filter) org.apache.nutch.crawl.DeduplicationJob 2022-11-18 07:48:02,778 INFO o.a.n.c.DeduplicationJob [main] DeduplicationJob: starting at 2022-11-18 07:48:02 org.apache.nutch.crawl.DeduplicationJob 2022-11-18 07:48:05,628 INFO o.a.n.c.DeduplicationJob [main] Deduplication: 0 documents marked as duplicates org.apache.nutch.crawl.DeduplicationJob 2022-11-18 07:48:05,629 INFO o.a.n.c.DeduplicationJob [main] Deduplication: Updating status of duplicate urls into crawl db. org.apache.nutch.crawl.DeduplicationJob 2022-11-18 07:48:06,996 INFO o.a.n.c.DeduplicationJob [main] Deduplication finished at 2022-11-18 07:48:06, elapsed: 00:00:04 Indexing 20221118074241 to index /home/paulesco/Downloads/apache-nutch-1.19/bin/nutch index -Dmapreduce.job.reduces=2 -Dmapreduce.reduce.speculative=false -Dmapreduce.map.speculative=false -Dmapreduce.map.output.compress=true /home/paulesco/Downloads/apache-nutch-1.19/crawl/crawldb -linkdb /home/paulesco/Downloads/apache-nutch-1.19/crawl/linkdb /home/paulesco/Downloads/apache-nutch-1.19/crawl/segments/20221118074241 -deleteGone SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/paulesco/Downloads/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/paulesco/Downloads/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] org.apache.nutch.plugin.PluginManifestParser 2022-11-18 07:48:09,623 INFO o.a.n.p.PluginManifestParser [main] Plugins: looking in: /home/paulesco/Downloads/apache-nutch-1.19/plugins org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,111 INFO o.a.n.p.PluginRepository [main] Plugin Auto-activation mode: [true] org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,113 INFO o.a.n.p.PluginRepository [main] Registered Plugins: org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,114 INFO o.a.n.p.PluginRepository [main] Regex URL Filter (urlfilter-regex) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,114 INFO o.a.n.p.PluginRepository [main] Html Parse Plug-in (parse-html) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,115 INFO o.a.n.p.PluginRepository [main] HTTP Framework (lib-http) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,115 INFO o.a.n.p.PluginRepository [main] the nutch core extension points (nutch-extensionpoints) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,116 INFO o.a.n.p.PluginRepository [main] Basic Indexing Filter (index-basic) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,116 INFO o.a.n.p.PluginRepository [main] Anchor Indexing Filter (index-anchor) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,117 INFO o.a.n.p.PluginRepository [main] Tika Parser Plug-in (parse-tika) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,118 INFO o.a.n.p.PluginRepository [main] Extractor based XML/HTML Parser/Indexing Filter (extractor) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,118 INFO o.a.n.p.PluginRepository [main] Basic URL Normalizer (urlnormalizer-basic) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,119 INFO o.a.n.p.PluginRepository [main] Regex URL Filter Framework (lib-regex-filter) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,119 INFO o.a.n.p.PluginRepository [main] Regex URL Normalizer (urlnormalizer-regex) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,120 INFO o.a.n.p.PluginRepository [main] CyberNeko HTML Parser (lib-nekohtml) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,120 INFO o.a.n.p.PluginRepository [main] URL Validator (urlfilter-validator) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,121 INFO o.a.n.p.PluginRepository [main] OPIC Scoring Plug-in (scoring-opic) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,122 INFO o.a.n.p.PluginRepository [main] Pass-through URL Normalizer (urlnormalizer-pass) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,122 INFO o.a.n.p.PluginRepository [main] Http Protocol Plug-in (protocol-http) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,123 INFO o.a.n.p.PluginRepository [main] CSVIndexWriter (indexer-csv) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,123 INFO o.a.n.p.PluginRepository [main] Registered Extension-Points: org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,124 INFO o.a.n.p.PluginRepository [main] (Nutch Content Parser) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,124 INFO o.a.n.p.PluginRepository [main] (Nutch URL Filter) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,125 INFO o.a.n.p.PluginRepository [main] (HTML Parse Filter) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,125 INFO o.a.n.p.PluginRepository [main] (Nutch Scoring) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,126 INFO o.a.n.p.PluginRepository [main] (Nutch URL Normalizer) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,126 INFO o.a.n.p.PluginRepository [main] (Nutch Publisher) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,127 INFO o.a.n.p.PluginRepository [main] (Nutch Exchange) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,127 INFO o.a.n.p.PluginRepository [main] (Nutch Protocol) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,128 INFO o.a.n.p.PluginRepository [main] (Nutch URL Ignore Exemption Filter) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,128 INFO o.a.n.p.PluginRepository [main] (Nutch Index Writer) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,129 INFO o.a.n.p.PluginRepository [main] (Nutch Segment Merge Filter) org.apache.nutch.plugin.PluginRepository 2022-11-18 07:48:10,129 INFO o.a.n.p.PluginRepository [main] (Nutch Indexing Filter) org.apache.nutch.segment.SegmentChecker 2022-11-18 07:48:10,617 INFO o.a.n.s.SegmentChecker [main] Segment dir is complete: /home/paulesco/Downloads/apache-nutch-1.19/crawl/segments/20221118074241. org.apache.nutch.indexer.IndexingJob 2022-11-18 07:48:10,620 INFO o.a.n.i.IndexingJob [main] Indexer: starting at 2022-11-18 07:48:10 org.apache.nutch.indexer.IndexingJob 2022-11-18 07:48:10,634 INFO o.a.n.i.IndexingJob [main] Indexer: deleting gone documents: true org.apache.nutch.indexer.IndexingJob 2022-11-18 07:48:10,634 INFO o.a.n.i.IndexingJob [main] Indexer: URL filtering: false org.apache.nutch.indexer.IndexingJob 2022-11-18 07:48:10,635 INFO o.a.n.i.IndexingJob [main] Indexer: URL normalizing: false org.apache.nutch.indexer.IndexerMapReduce 2022-11-18 07:48:10,637 INFO o.a.n.i.IndexerMapReduce [main] IndexerMapReduce: crawldb: /home/paulesco/Downloads/apache-nutch-1.19/crawl/crawldb org.apache.nutch.indexer.IndexerMapReduce 2022-11-18 07:48:10,642 INFO o.a.n.i.IndexerMapReduce [main] IndexerMapReduces: adding segment: /home/paulesco/Downloads/apache-nutch-1.19/crawl/segments/20221118074241 org.apache.nutch.indexer.IndexerMapReduce 2022-11-18 07:48:10,644 INFO o.a.n.i.IndexerMapReduce [main] IndexerMapReduce: linkdb: /home/paulesco/Downloads/apache-nutch-1.19/crawl/linkdb org.apache.nutch.indexer.IndexWriters 2022-11-18 07:48:13,788 INFO o.a.n.i.IndexWriters [pool-5-thread-1] Index writer org.apache.nutch.indexwriter.csv.CSVIndexWriter identified. org.apache.nutch.exchange.Exchanges 2022-11-18 07:48:13,845 WARN o.a.n.e.Exchanges [pool-5-thread-1] No exchange was configured. The documents will be routed to all index writers. org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18 07:48:13,848 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] separator = , org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18 07:48:13,880 WARN o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] Separator quotechar must be a char, only the first character '"' of """ is used org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18 07:48:13,880 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] quotechar = " org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18 07:48:13,881 WARN o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] Separator escapechar must be a char, only the first character '"' of """ is used org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18 07:48:13,881 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] escapechar = " org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18 07:48:13,882 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] valuesep = | org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,883 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] maxfieldlength = 8096 org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,884 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] maxfieldvalues = 120 org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,885 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] fields = org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,886 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] id org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,887 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] company org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,887 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] date org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,888 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] jobTitle org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,888 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] jobDescription org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,888 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] location org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,889 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] json org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,890 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] Writing output to csvindexwriter org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:13,891 WARN o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] Removing existing output path csvindexwriter/nutch.csv org.apache.nutch.indexer.IndexerOutputFormat 2022-11-18 07:48:14,059 INFO o.a.n.i.IndexerOutputFormat [pool-5-thread-1] Active IndexWriters : CSVIndexWriter: ┌──────────────┬─────────────────────────────────────────────────────┬─────────────────────────────────────────────────────┐ │fields │Ordered list of fields (columns) in the CSV file │id,company,date,jobTitle,jobDescription,location,json│ ├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤ │separator │Separator between fields (columns), default: ,│, │ │ │(U+002C, comma) │ │ ├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤ │quotechar │Quote character used to quote fields containing│" │ │ │separators or quotes, default: " (U+0022, quotation│ │ │ │mark) │ │ ├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤ │escapechar │Escape character used to escape a quote character,│" │ │ │default: " (U+0022, quotation mark) │ │ ├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤ │valuesep │Separator between multiple values of one field,│| │ │ │default: | (U+007C) │ │ ├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤ │maxfieldvalues│Max. number of values of one field, useful for, e.g.,│120 │ │ │the anchor texts field, default: 12 │ │ ├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤ │maxfieldlength│Max. length of a single field value in characters,│8096 │ │ │default: 4096 │ │ ├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤ │charset │Encoding of CSV file, default: UTF-8 │UTF-8 │ ├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤ │header │Write CSV column headers, default: true │true │ ├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤ │outpath │Output path / directory, default: csvindexwriter. │csvindexwriter │ └──────────────┴─────────────────────────────────────────────────────┴─────────────────────────────────────────────────────┘ org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2022-11-18 07:48:14,079 INFO o.a.n.i.a.AnchorIndexingFilter [pool-5-thread-1] Anchor deduplication is: off WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by com.sun.xml.bind.v2.runtime.reflect.opt.Injector$1 (file:/home/paulesco/Downloads/apache-nutch-1.19/lib/jaxb-impl-2.2.3-1.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int) WARNING: Please consider reporting this to the maintainers of com.sun.xml.bind.v2.runtime.reflect.opt.Injector$1 WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:14,875 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/administration-assistant-at-apple-3358665327?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=hPPT6HwfoeW5O5x3hD19Og%3D%3D&position=15&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:14,891 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/business-development-music-content-at-apple-3303474256?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=WixmspxoAN5LwMiK85fGTQ%3D%3D&position=13&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:14,894 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/business-marketing-and-g-a-internships-at-apple-3109770600?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=76Rvg5XTnq%2BMLXkyvInKEw%3D%3D&position=1&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:14,898 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/engineering-program-management-internship-at-apple-3178528752?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=AkNO4ulHoq2VdFGV8zrX7Q%3D%3D&position=14&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:14,900 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/executive-administrative-assistant-at-apple-3178549204?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=0tgIj1%2F3UsEYVTatO5k8AQ%3D%3D&position=5&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:14,905 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/full-stack-web-developer-early-career-at-apple-3178543696?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=ASc%2FwLZwb%2BWxgCMD98xZjA%3D%3D&position=10&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:14,908 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/instructional-designer-and-facilitator-at-apple-3311380419?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=8jWxwc90ubxidsR7yCUa8g%3D%3D&position=23&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:14,912 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/marketing-specialist-payments-at-apple-3295802145?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=moSai8myEFTiBHfy86ZdfQ%3D%3D&position=12&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:14,916 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/partner-relationship-manager-at-apple-3335905674?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=yQNQPxWYOe5pA2zSupCXhw%3D%3D&position=11&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:14,918 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/software-engineer-early-career-at-apple-3083602420?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=syVQzNeq4uvv%2BV%2FnE5pMjw%3D%3D&position=9&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:14,921 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/software-engineer-early-career-at-apple-3142389594?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=LtuRytaw2JrWIPBarIZPRA%3D%3D&position=8&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:14,924 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/software-engineer-early-career-at-apple-3165763449?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=d3A78tGewvInBwuE1TY97A%3D%3D&position=4&pageNum=0&trk=public_jobs_jserp-result_search-card org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:14,930 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] Finished CSV index in csvindexwriter/nutch.csv org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18 07:48:15,071 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] separator = , org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18 07:48:15,072 WARN o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] Separator quotechar must be a char, only the first character '"' of """ is used org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18 07:48:15,072 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] quotechar = " org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18 07:48:15,073 WARN o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] Separator escapechar must be a char, only the first character '"' of """ is used org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18 07:48:15,073 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] escapechar = " org.apache.nutch.indexwriter.csv.CSVIndexWriter$Separator 2022-11-18 07:48:15,074 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] valuesep = | org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,074 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] maxfieldlength = 8096 org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,074 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] maxfieldvalues = 120 org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,075 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] fields = org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,075 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] id org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,076 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] company org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,076 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] date org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,077 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] jobTitle org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,077 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] jobDescription org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,077 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] location org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,078 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] json org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,079 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] Writing output to csvindexwriter org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,080 WARN o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] Removing existing output path csvindexwriter/nutch.csv org.apache.nutch.indexer.IndexerOutputFormat 2022-11-18 07:48:15,117 INFO o.a.n.i.IndexerOutputFormat [pool-5-thread-1] Active IndexWriters : CSVIndexWriter: ┌──────────────┬─────────────────────────────────────────────────────┬─────────────────────────────────────────────────────┐ │fields │Ordered list of fields (columns) in the CSV file │id,company,date,jobTitle,jobDescription,location,json│ ├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤ │separator │Separator between fields (columns), default: ,│, │ │ │(U+002C, comma) │ │ ├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤ │quotechar │Quote character used to quote fields containing│" │ │ │separators or quotes, default: " (U+0022, quotation│ │ │ │mark) │ │ ├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤ │escapechar │Escape character used to escape a quote character,│" │ │ │default: " (U+0022, quotation mark) │ │ ├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤ │valuesep │Separator between multiple values of one field,│| │ │ │default: | (U+007C) │ │ ├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤ │maxfieldvalues│Max. number of values of one field, useful for, e.g.,│120 │ │ │the anchor texts field, default: 12 │ │ ├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤ │maxfieldlength│Max. length of a single field value in characters,│8096 │ │ │default: 4096 │ │ ├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤ │charset │Encoding of CSV file, default: UTF-8 │UTF-8 │ ├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤ │header │Write CSV column headers, default: true │true │ ├──────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┤ │outpath │Output path / directory, default: csvindexwriter. │csvindexwriter │ └──────────────┴─────────────────────────────────────────────────────┴─────────────────────────────────────────────────────┘ ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:15,154 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/content-strategist-at-apple-3183050156?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=3n3SZTr2DDL%2BuLJG80tF5A%3D%3D&position=17&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:15,158 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/corporate-fp-a-financial-analyst-at-apple-3299573611?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=v9%2F3SUQVjBpc7kyqFpz%2BGw%3D%3D&position=16&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:15,160 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/customer-support-account-representative-at-apple-3276378529?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=mcqQ08GV2r%2BhQGjrKUBV3g%3D%3D&position=24&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:15,164 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/executive-assistant-at-apple-3343515422?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=6GofJN8fsMPysOPQF4p%2FVA%3D%3D&position=25&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:15,168 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/global-supply-manager-at-apple-3122122362?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=6gEcpGvSLAZQDo0J6CEP5w%3D%3D&position=18&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:15,171 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/instructional-designer-and-facilitator-at-apple-3320714845?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=2LtFgvgbFnFky52wmV6%2BVw%3D%3D&position=22&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:15,173 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/instructional-designer-at-apple-3299571683?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=1O2wuFrYl7seVDay0vY9Dg%3D%3D&position=21&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:15,175 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/jr-software-developer-c-c%2B%2B-at-apple-2995935448?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=OoO8lg0lxNY3lZsoKICCJQ%3D%3D&position=20&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:15,178 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/partner-success-manager-at-apple-3238337934?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=jkjzk0WHT79R40TGmVOTsA%3D%3D&position=3&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:15,181 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/people-operations-hris-analyst-at-apple-3217837096?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=Gusmq8ZxlihLpNTzAXfPdg%3D%3D&position=19&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:15,184 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/people-support-specialist-at-apple-3296942621?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=tdx1V7OXKAuLLt76scpuaQ%3D%3D&position=7&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:15,187 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/software-engineer-early-career-at-apple-2944352450?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=91p8jFJwx2KAh6bwE%2Bsv2Q%3D%3D&position=6&pageNum=0&trk=public_jobs_jserp-result_search-card ir.co.bayan.simorq.zal.extractor.nutch.ExtractorIndexingFilter 2022-11-18 07:48:15,190 DEBUG i.c.b.s.z.e.n.ExtractorIndexingFilter [pool-5-thread-1] Indexing: https://www.linkedin.com/jobs/view/software-engineering-internship-at-apple-3109778916?refId=hvliCRqZF9ja3gH7wNJ6OQ%3D%3D&trackingId=U0qyMZ4ai%2FquB19uZyoEKQ%3D%3D&position=2&pageNum=0&trk=public_jobs_jserp-result_search-card org.apache.nutch.indexwriter.csv.CSVIndexWriter 2022-11-18 07:48:15,197 INFO o.a.n.i.c.CSVIndexWriter [pool-5-thread-1] Finished CSV index in csvindexwriter/nutch.csv org.apache.nutch.indexer.IndexingJob 2022-11-18 07:48:15,983 INFO o.a.n.i.IndexingJob [main] Indexer: number of documents indexed, deleted, or skipped: org.apache.nutch.indexer.IndexingJob 2022-11-18 07:48:15,999 INFO o.a.n.i.IndexingJob [main] Indexer: 25 indexed (add/update) org.apache.nutch.indexer.IndexingJob 2022-11-18 07:48:16,005 INFO o.a.n.i.IndexingJob [main] Indexer: finished at 2022-11-18 07:48:15, elapsed: 00:00:05 vie nov 18 07:48:16 -05 2022 : Finished loop with 2 iterations ----------------------------------------------------------------------------------------------------------- index-writers.xml: <writer id="indexer_csv_1" class="org.apache.nutch.indexwriter.csv.CSVIndexWriter"> <parameters> <!-- <param name="fields" value="id,title,content"/> --> <param name="fields" value="id,company,date,jobTitle,jobDescription,location,json"/> <param name="charset" value="UTF-8"/> <param name="separator" value=","/> <param name="valuesep" value="|"/> <param name="quotechar" value="""/> <param name="escapechar" value="""/> <param name="maxfieldlength" value="8096"/> <param name="maxfieldvalues" value="120"/> <param name="header" value="true"/> <param name="outpath" value="csvindexwriter"/> </parameters> <mapping> <copy /> <rename /> <remove /> </mapping> </writer> I haven't mentioned but I'm using the Bayan Group extractor plugin to extract some specific fields from linkedin job posts. Thanks, -- Paul Escobar Mossos skype: paulescom telefono: +57 1 3006815404

