[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217981#comment-15217981 ]
Shubhanshu Mishra commented on SPARK-14103: ------------------------------------------- [~hyukjin.kwon] the temp.txt file is actually just the first 100,000 lines of the Papers.txt file from this URL https://academicgraph.blob.core.windows.net/graph-2016-02-05/Papers.zip This file is part of the Microsoft Academic Graph which is free to download but you need to accept the license. The license can be found at: http://research.microsoft.com/en-us/projects/mag/ Steps for downloading are as follows: * Go to http://research.microsoft.com/en-us/projects/mag/ * Accept the terms and click on "get the data" * Click the link 2016-02-05 under West United States * Under individual files download Papers (9.05GB), it is a zip file. * Unzip the Papers.zip * Run the command {code}head -n 100000 Papers.txt > temp.txt{code} * Try to load the temp.txt in spark dataframe {code} df = sqlContext.read.load("temp.txt", format="csv", header="false", inferSchema="true", delimiter="\t", maxCharsPerColumn=2679350) # Gives error {code} I am using the master branch of Spark from github, it was updated yesterday. Ubuntu, Python 2.7.11, Anaconda 2.5.0 > Python DataFrame CSV load on large file is writing to console in Ipython > ------------------------------------------------------------------------ > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch > Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (1000001) exceeds the maximum number of characters > defined in your parser settings (1000000). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact 43331058 19371[\n] > 3D4F6CA1 Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 2015 2015/08/02 10.1007/978-3-319-21383-5_12 international > conference on human-computer interaction interact 43331058 > 19502[\n] > ....... > ......... > web snippets 2008 2008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F29802 19489 > 06FA3FFA Interactive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; > aborting job > ^M[Stage 1:> (0 + 1) > / 2] > {code} > For a small sample (<10,000 lines) of the data, I am not getting any error. > But as soon as I go above more than 100,000 samples, I start getting the > error. > I don't think the spark platform should output the actual data to stderr ever > as it decreases the readability. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org