[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15216400#comment-15216400 ]
Sean Owen commented on SPARK-14103: ----------------------------------- You show the length of one line there, not the max. Just double-check. In any event, the question is why the parser sees something different. > Python DataFrame CSV load on large file is writing to console in Ipython > ------------------------------------------------------------------------ > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch > Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (1000001) exceeds the maximum number of characters > defined in your parser settings (1000000). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact 43331058 19371[\n] > 3D4F6CA1 Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 2015 2015/08/02 10.1007/978-3-319-21383-5_12 international > conference on human-computer interaction interact 43331058 > 19502[\n] > ....... > ......... > web snippets 2008 2008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F29802 19489 > 06FA3FFA Interactive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; > aborting job > ^M[Stage 1:> (0 + 1) > / 2] > {code} > For a small sample (<10,000 lines) of the data, I am not getting any error. > But as soon as I go above more than 100,000 samples, I start getting the > error. > I don't think the spark platform should output the actual data to stderr ever > as it decreases the readability. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org