[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15229485#comment-15229485 ] Apache Spark commented on SPARK-14103: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/12226 > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; > aborting job > ^M[Stage 1:>
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15225305#comment-15225305 ] Hyukjin Kwon commented on SPARK-14103: -- Just to cut it short, the input is being read as a byte stream, bytes by bytes from each line produced by {{Iterator}} with manually inserted line separators. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; > aborting job > ^M[Stage 1:>
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15225263#comment-15225263 ] Hyukjin Kwon commented on SPARK-14103: -- Oh, sorry I should have mentioned that it reads all the data (including roe separator) regardless of a line separator once it meets a quote character which does not end. In {{BulkCsvReader}}, it sort of uses a {{Reader}} converted from {{Iterator}}, meaning it processes data not line by line in the point of Univocity parser. If this were processed with {{Iterator}} with each line as a input, then it would be just like you said but it is processed with {{Reader}} with whole data as input. So, this even ignores line separators as well as delimiters which ends up reading whole data after a quote as a value. (Actually this is one of the reasons why I am thinkng changing this library to Apache's) > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) >
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15224386#comment-15224386 ] Sean Owen commented on SPARK-14103: --- Go for it. There aren't actually long lines in the file though, so I'm sort of confused how this might resolve it. It wouldn't look past the end of the line for a quote would it? there seems to be a line separator issue in here somewhere. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > 16/03/23 14:01:03 ERROR
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15223650#comment-15223650 ] Hyukjin Kwon commented on SPARK-14103: -- This issue in Univocity is fixed and they will release {{2.0.2}} (See https://github.com/uniVocity/univocity-parsers/issues/60). Could I bump up this version to solve this issue? > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; > aborting
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15223006#comment-15223006 ] Shubhanshu Mishra commented on SPARK-14103: --- [~hyukjin.kwon] thanks for pointing this out. I used the `quote=""` as a value and the dataframe reader was able to correctly parse the file. {code} df = sqlContext.read.load("temp.txt", format="csv", header="false", quote="", inferSchema="true", delimiter="\t") # WORKS {code} After your comment, I looked at the https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala file which sets the default quote character to `"`, however, in the `getChar` function, it is mentioned if the length of the option is 0 then the value will be set to the null unicode char `\u000`. I think this fixes up this issue. However, the long error message should be taken care of. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15222938#comment-15222938 ] Sean Owen commented on SPARK-14103: --- I don't think this case is ambiguous. The second " appears alone without preceding \ or following ". However I don't know if it's valid to quote only part of a field in CSV. And it doesn't seem to match the intent; the content should escape those quotes. I think you could argue it's a bad input problem but the error is odd. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) >
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15222931#comment-15222931 ] Hyukjin Kwon commented on SPARK-14103: -- After thinking further, I realised that this might be a right behaviour in a way. I just checked the [Univocity parser API|http://docs.univocity.com/parsers/1.5.0/com/univocity/parsers/csv/CsvFormat.html] and this mentions a case about this at quote option. I think they intended to work like this as the value in the above case is anyway started with a {{quote}} character and this might imply that it is a value up to another ending {{quote}} character. Maybe those quotes might have to be followed by escape characters or set {{quote}} to another character or {{null}} (not sure if {{null}} works though). I haven't checked how Apache CSV works with this. Let me test this soon and will update if there is something else I should inform. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15222912#comment-15222912 ] Sean Owen commented on SPARK-14103: --- I've used the Apache parser in the past and it has been fine. I have never used this one. That's a funny bug here. Any idea if it's known or easily fixable? that's the ideal way forward. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; >
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15222879#comment-15222879 ] Hyukjin Kwon commented on SPARK-14103: -- cc [~falaki] [~r...@databricks.com] > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; > aborting job > ^M[Stage 1:> (0 + 1) > / 2] > {code} > For a small sample (<10,000 lines) of the
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15222792#comment-15222792 ] Hyukjin Kwon commented on SPARK-14103: -- [~shubhanshumis...@gmail.com] Right, it looks like an issue in Univocity Parser. I could reproduce this error with the data below: {code} "a"bccc ddd {code} and code below: {code} val path = "temp.tsv" sqlContext.read .format("csv") .option("maxCharsPerColumn", "4") .option("delimiter", "\t") .load(path) {code} It looks Univocity parser gets confused when it meets {{quote}} character during parsing a value and the value does not end with the character. It just treats the entire rows and values as a quoted value as a value afterward when this happens. So, it looks your data has such rows, for example below: {code} 7C0E15CD"I did it my way": moving away from the tyranny of turn-by-turn pedestrian navigation i did it my way moving away from the tyranny of turn by turn pedestrian navigation 20102010/09/07 10.1145/1851600.1851660 international conference on human computer interaction interact 4333105818871 {code} All the data after {{"I did it my way}} was being treated as a quoted value. [~sowen] Actually, I have been a bit questionable if Spark should use Univocity parse. It looks it is generally true that this library itself is faster then Apache CSV parser but it brought complexity of codes and there are pretty messy additional logics to use Univocity for now. Also, it became pretty difficult to figure out such issues. I am thinking about changing Univocity to Apahce CSV parser after performance tests. Do you think this makes sense? > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at >
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218001#comment-15218001 ] Hyukjin Kwon commented on SPARK-14103: -- Thanks for detailed directions. Fortunately, I think i found some clues. Let me provide some examples and explanation in a PR if I'm getting right. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; > aborting job > ^M[Stage 1:>
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217981#comment-15217981 ] Shubhanshu Mishra commented on SPARK-14103: --- [~hyukjin.kwon] the temp.txt file is actually just the first 100,000 lines of the Papers.txt file from this URL https://academicgraph.blob.core.windows.net/graph-2016-02-05/Papers.zip This file is part of the Microsoft Academic Graph which is free to download but you need to accept the license. The license can be found at: http://research.microsoft.com/en-us/projects/mag/ Steps for downloading are as follows: * Go to http://research.microsoft.com/en-us/projects/mag/ * Accept the terms and click on "get the data" * Click the link 2016-02-05 under West United States * Under individual files download Papers (9.05GB), it is a zip file. * Unzip the Papers.zip * Run the command {code}head -n 10 Papers.txt > temp.txt{code} * Try to load the temp.txt in spark dataframe {code} df = sqlContext.read.load("temp.txt", format="csv", header="false", inferSchema="true", delimiter="\t", maxCharsPerColumn=2679350) # Gives error {code} I am using the master branch of Spark from github, it was updated yesterday. Ubuntu, Python 2.7.11, Anaconda 2.5.0 > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at >
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217853#comment-15217853 ] Hyukjin Kwon commented on SPARK-14103: -- [~shubhanshumis...@gmail.com] I just wonder if I could have that {{temp.txt}} file if you are okay although I will try to reproduce this with a dummy file. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; > aborting job > ^M[Stage 1:>
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217850#comment-15217850 ] Hyukjin Kwon commented on SPARK-14103: -- Thank you so much for cutting it short. Currenrly Im not too sure. Let me investigate this. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; > aborting job > ^M[Stage 1:> (0 + 1) > /
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217821#comment-15217821 ] Sean Owen commented on SPARK-14103: --- The exception is in the description: {code} com.univocity.parsers.common.TextParsingException: Error processing input: Length of parsed input (101) exceeds the maximum number of characters defined in your parser settings (100). Identified line separator characters in the parsed content. This may be the cause of the error. The line separator in your parser settings is set to '\n'. Parsed content: {code} It does really sound like a very long line but the tests above do seem to argue that the lines are quite normal and short. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at >
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217212#comment-15217212 ] Hyukjin Kwon commented on SPARK-14103: -- For long messages, there is a JIRA opened already here, SPARK-13792. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; > aborting job > ^M[Stage 1:> (0 + 1) > / 2] > {code} > For a
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217198#comment-15217198 ] Hyukjin Kwon commented on SPARK-14103: -- As [~sowen] said, CRLF is dealt with in {{TextInputFormat}} which calls [LineReader.readDefaultLine(...)|https://github.com/apache/hadoop/blob/7fd00b3db4b7d73afd41276ba9a06ec06a0e1762/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L186] at the end, meaning it would not be a problem. [~shubhanshumis...@gmail.com] So, the codes below: {code} df = sqlContext.read.load("temp.txt", format="csv", header="false", inferSchema="true", delimiter="\t", maxCharsPerColumn=2679350) # Gives error {code} throws an exception, right? Could you maybe share the error message? I (or may blind eyes) cannot find the exception message about this. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217200#comment-15217200 ] Hyukjin Kwon commented on SPARK-14103: -- As [~sowen] said, CRLF is dealt with in {{TextInputFormat}} which calls [LineReader.readDefaultLine(...)|https://github.com/apache/hadoop/blob/7fd00b3db4b7d73afd41276ba9a06ec06a0e1762/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L186] at the end, meaning it would not be a problem. [~shubhanshumis...@gmail.com] So, the codes below: {code} df = sqlContext.read.load("temp.txt", format="csv", header="false", inferSchema="true", delimiter="\t", maxCharsPerColumn=2679350) # Gives error {code} throws an exception, right? Could you maybe share the error message? I (or may blind eyes) cannot find the exception message about this. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217199#comment-15217199 ] Hyukjin Kwon commented on SPARK-14103: -- As [~sowen] said, CRLF is dealt with in {{TextInputFormat}} which calls [LineReader.readDefaultLine(...)|https://github.com/apache/hadoop/blob/7fd00b3db4b7d73afd41276ba9a06ec06a0e1762/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L186] at the end, meaning it would not be a problem. [~shubhanshumis...@gmail.com] So, the codes below: {code} df = sqlContext.read.load("temp.txt", format="csv", header="false", inferSchema="true", delimiter="\t", maxCharsPerColumn=2679350) # Gives error {code} throws an exception, right? Could you maybe share the error message? I (or may blind eyes) cannot find the exception message about this. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216705#comment-15216705 ] Sean Owen commented on SPARK-14103: --- Printing the bad line is helpful but would be great if the length were capped. I don't know if we can control this much if it's from the library. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; > aborting job > ^M[Stage 1:>
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216660#comment-15216660 ] Shubhanshu Mishra commented on SPARK-14103: --- [~srowen] yes, you are right. The issue is not with "\r\n" but something else. I convered the "\r\n" in the file to "\n" and the error continues to exist. Another issue which should be addressed is the long debug message which is printed to the console with the whole content of the file. This is annoying. It should just print the stack trace and the error message and not the whole string of file contents which were read by the parser. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at >
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216639#comment-15216639 ] Sean Owen commented on SPARK-14103: --- I'm saying that this is already done by {{TextInputFormat}}... or should be. Certainly that's what you're already using in your examples because they call textFile(), or they wouldn't work right? And it looks like the CSV parser consumes the already-parsed lines, meaning, I don't see how the parser's notion of record separator matters. But this could be where some problem lies. It is suspicious that you have this problem with Windows-formatted newlines. However breaking on \n should cause it to break on \r\n but just leave the \r in. So I also don't immediately see how this leads to a line-too-long problem. [~hyukjin.kwon] is any of this ringing a bell? > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216547#comment-15216547 ] Shubhanshu Mishra commented on SPARK-14103: --- Another issue with your [#comment-15216228] is that I am actually reading a file, which was generated on a windows system, on a linux system. So the direct conversion of CRLF to LF will not take place and hence a proper assignment of rowSeperator is needed. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216541#comment-15216541 ] Shubhanshu Mishra commented on SPARK-14103: --- Ok I tried your suggestion of increasing maxCharsPerColumn to an insanely high value and that has made the code load my file into the dataframe {code} wc -l temp.txt # Output is 10 3181726 25693963 temp.txt # Any number larger than 3181726 works df = sqlContext.read.load("temp.txt", format="csv", header="false", inferSchema="true", delimiter="\t", maxCharsPerColumn=3181726) # Works df = sqlContext.read.load("temp.txt", format="csv", header="false", inferSchema="true", delimiter="\t", maxCharsPerColumn=1000) # Works df = sqlContext.read.load("temp.txt", format="csv", header="false", inferSchema="true", delimiter="\t", maxCharsPerColumn=2679360) # Works # However, this one fails df = sqlContext.read.load("temp.txt", format="csv", header="false", inferSchema="true", delimiter="\t", maxCharsPerColumn=2679350) # Gives error {code} I check the file at that point using the following bash commands: {code} $ head -c2679360 temp1.txt | tail -n3 5F59257APerformance of multicarrier CDMA technique combined with space-time block coding over Rayleigh channel performance of multicarrier cdma technique combined with space time block coding over rayleigh channel 2002 200210.1109/ISSSTA.2002.1048562 international symposium on information theory and its applications isita 44B587D117005 6C9A7181Compressive receiver sidelobes suppression based on mismatching algorithms compressive receiver sidelobes suppression based on mismatching algorithms 19981998 10.1109/ISSSTA.1998.722528 international symposium on information theory and its applications isita 44B587D117166 777FD068UE Counting Mechanism for MBMS Considering PtM Macro Diversity Combining Support in UMTS Networks ue counting mechanism for mbms considering ptm macro diversity combining support in umts networks 2006 2006/08 10.1109/ISSSTA.2006.311795 internation {code} I don't see any issues with the data in these lines. Especially between the following characters: {code} $ head -c2679360 temp1.txt | tail -c20 6.311795\tinternation {code} > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216437#comment-15216437 ] Shubhanshu Mishra commented on SPARK-14103: --- I just double checked using the following code and this is the output I found: {code} In [8]: data = sc.textFile("temp.txt").map(lambda x: len(x)).collect() max(data), min(data) Out[8]: (696, 96) In [11]: data = sc.textFile("temp.txt").map(lambda x: len(x.split("\t"))).collect() max(data), min(data) Out[11]: (11, 11) In [12]: data = sc.textFile("temp.txt").map(lambda x: max([len(k) for k in x.split("\t")])).collect() max(data), min(data) Out[12]: (286, 20) {code} > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at >
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216400#comment-15216400 ] Sean Owen commented on SPARK-14103: --- You show the length of one line there, not the max. Just double-check. In any event, the question is why the parser sees something different. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; > aborting job > ^M[Stage 1:>
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216394#comment-15216394 ] Shubhanshu Mishra commented on SPARK-14103: --- [~srowen] In [#comment-15215064] comment I did mention how I extracted the length of each line and what the maximum line length was. I will try with raising the limit as well. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; >
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216378#comment-15216378 ] Sean Owen commented on SPARK-14103: --- Did you compute the maximum line length? that's not quite what you show earlier, and not what you show in this snippet. Assuming you did, it still doesn't explain why the parser thinks there is a run of text far too long to process. Your example does not exercise the CSV parser here. You should try raising the limit just to confirm it fixes it (and, then you have your workaround) and then do a little more debugging to understand why it's seeing a very long line. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216350#comment-15216350 ] Shubhanshu Mishra commented on SPARK-14103: --- [~srowen] As I have mentioned above the maximum characters in each line is 697 which is well within the limits of the default maxCharsPerColumn. Also, as I have mentioned above, when I read the file the spark context it works correctly: {code} data = sc.textFile("temp.txt").map(lambda x: x.split("\t")).collect() In [1]: data = sc.textFile("temp.txt").map(lambda x: len(x.split("\t"))).collect() In [2]: max(data) Out[2]: 11 In [3]: min(data) Out[3]: 11 {code} > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) >
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216228#comment-15216228 ] Sean Owen commented on SPARK-14103: --- It shouldn't be platform dependent; configurable maybe. I actually suspect this field does nothing, since it is operating on lines as parsed by Hadoop's {{TextInputFormat}} already, and that will actually read CR, LF or CRLF as a delimiter. But that means my explanation is wrong. See https://github.com/databricks/spark-csv/pull/307 which is the same issue. But this suggests it happens because of one very large field. In that case it sounds like the idea is to increase "maxCharsPerColumn" to work for your data. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at >
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15215076#comment-15215076 ] Shubhanshu Mishra commented on SPARK-14103: --- [~srowen] I just checked the Spark Code on github and found that the line separator [mentioned as rowSeparator] is hard coded as "\n" in the code. https://github.com/apache/spark/blob/e474088144cdd2632cf2fef6b2cf10b3cd191c23/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala Ideally, the line separator should be fetched from the platform dependent settings CRLF for Windows and LF for Unix based systems. Also, when users define a custom RowSeparator it should override the default settings. This might be causing issues. I can send a PR for accepting the line separators and setting the defaults to the System specific setting. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15215064#comment-15215064 ] Shubhanshu Mishra commented on SPARK-14103: --- [~srowen]thanks for the reply. As I have mentioned above, I tried setting the rowSeparator above as "\r\n" but couldn't get it to work. When I process the same file using `sc.textFile("temp.txt").map(lambda x: len(x))`, I am able to get the correct result of 697. Even using `sc.textFile("temp.txt").map(lambda x: x.split("\t")).collect()` is working. I further tested to see if the number of columns are inconsistent in the rows and found the number is the same for all rows: {code} In [1]: data = sc.textFile("temp.txt").map(lambda x: len(x.split("\t"))).collect() In [2]: max(data) Out[2]: 11 In [3]: min(data) Out[3]: 11 {code} So the issue is in the parsing function of the dataframe csv reader. If I don't use the inferSchema option, then error happens if I run any operation on the DataFrame which tries to read the full DataFrame. Even after using the `dos2unix` command to convert the text file to the format which converts each `\r\n` to `\n`. ` > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) >
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15215027#comment-15215027 ] Shubhanshu Mishra commented on SPARK-14103: --- Yes, the error does say so. However, I have checked the file. It has `\r\n` style line endings and otherwise looks perfectly file. I am able to process the file correctly using cut but am not able to do it using the CSV format reader. The maximum number of characters in a given line in my file are 697 with 97 as the minimum. It looks like the line ending characters are causing an issue and spark is normalizing the line ending character to be just "\n" instead of "\r\n". I tried to run my command with the following settings and was able to prove this intuition: {code} In [2]: df = sqlContext.read.load("temp.txt", format="csv", header="false", inferSchema="true", delimiter="\t", rowSeparator="\r\n", inputBufferSize=500, maxColumns=20, maxCharsPerColumn=1000) 16/03/28 17:17:39 ERROR Executor: Exception in task 1.0 in stage 3.0 (TID 5) com.univocity.parsers.common.TextParsingException: Error processing input: Length of parsed input (1001) exceeds the maximum number of characters defined in your parser settings (1000). Identified line separator characters in the parsed content. This may be the cause of the error. The line separator in your parser settings is set to '\n'. Parsed content: I did it my way": moving away from the tyranny of turn-by-turn pedestrian navigationi did it my way moving away from the tyranny of turn by turn pedestrian navigation 2010 2010/09/07 10.1145/1851600.1851660 international conference on human computer interaction interact4333105818871[\n] 770CA612Fixed in time and "time in motion": mobility of vision through a SenseCam lens fixed in time and time in motion mobility of vision through a sensecam lens 2009 2009/09/15 10.1145/1613858.1613861 international conference on human computer interaction interact 4333105819370[\n] 7B5DE5DEAssistive Wearable Technology for Visually Impaired assistive wearable technology for visually impaired 20152015/08/24 international conference on human computer interaction interact 4333105819555[\n] 085BEC09HOUDINI: Introducing Object Tracking and Pen Recognition for LLP Tabletops houdini introducing object tracking and pen recognition for llp tabletops 2014 2014/06/22 10.1007/978-3-319-07230-2_23international c Parser Configuration: CsvParserSettings: Column reordering enabled=true Empty value=null Header extraction enabled=false Headers=[C0, C1, C2, C3, C4, C5, C6, C7, C8, C9, C10] Ignore leading whitespaces=false Ignore trailing whitespaces=false Input buffer size=128 Input reading on separate thread=false Line separator detection enabled=false Maximum number of characters per column=1000 Maximum number of columns=20 Null value= Number of records to read=all Parse unescaped quotes=true Row processor=none Selected fields=none Skip empty lines=trueFormat configuration: CsvFormat: Comment character=\0 Field delimiter=\t Line separator (normalized)=\n Line separator sequence=\n Quote character=" Quote escape character=quote escape Quote escape escape character=\0, line=36, char=9828. Content parsed: [I did it my way": moving away from the tyranny of turn-by-turn pedestrian navigation i did it my way moving away from the tyranny of turn by turn pedestrian navigation 20102010/09/07 10.1145/1851600.1851660 international conference on human computer interaction interact 43331058 18871 770CA612Fixed in time and "time in motion": mobility of vision through a SenseCam lens fixed in time and time in motion mobility of vision through a sensecam lens 20092009/09/15 10.1145/1613858.1613861 international conference on human computer interaction interact43331058 19370 7B5DE5DEAssistive Wearable Technology for Visually Impaired assistive wearable technology for visually impaired 20152015/08/24 international conference on human computer interaction interact 4333105819555 085BEC09HOUDINI: Introducing Object Tracking and Pen Recognition for LLP Tabletops houdini introducing object tracking and pen recognition for llp tabletops 20142014/06/22 10.1007/978-3-319-07230-2_23 international c] at
[jira] [Commented] (SPARK-14103) Python DataFrame CSV load on large file is writing to console in Ipython
[ https://issues.apache.org/jira/browse/SPARK-14103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209096#comment-15209096 ] Sean Owen commented on SPARK-14103: --- Isn't this saying that one _line_ is extremely long? The error pretty much tells you that. > Python DataFrame CSV load on large file is writing to console in Ipython > > > Key: SPARK-14103 > URL: https://issues.apache.org/jira/browse/SPARK-14103 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master > branch >Reporter: Shubhanshu Mishra > Labels: csv, csvparser, dataframe, pyspark > > I am using the spark from the master branch and when I run the following > command on a large tab separated file then I get the contents of the file > being written to the stderr > {code} > df = sqlContext.read.load("temp.txt", format="csv", header="false", > inferSchema="true", delimiter="\t") > {code} > Here is a sample of output: > {code} > ^M[Stage 1:> (0 + 2) > / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 2) > com.univocity.parsers.common.TextParsingException: Error processing input: > Length of parsed input (101) exceeds the maximum number of characters > defined in your parser settings (100). Identified line separator > characters in the parsed content. This may be the cause of the error. The > line separator in your parser settings is set to '\n'. Parsed content: > Privacy-shake",: a haptic interface for managing privacy settings in > mobile location sharing applications privacy shake a haptic interface > for managing privacy settings in mobile location sharing applications 2010 > 2010/09/07 international conference on human computer > interaction interact4333105819371[\n] > 3D4F6CA1Between the Profiles: Another such Bias. Technology > Acceptance Studies on Social Network Services between the profiles > another such bias technology acceptance studies on social network services > 20152015/08/02 10.1007/978-3-319-21383-5_12international > conference on human-computer interaction interact43331058 > 19502[\n] > ... > . > web snippets20082008/05/04 10.1007/978-3-642-01344-7_13 > international conference on web information systems and technologies > webist 44F2980219489 > 06FA3FFAInteractive 3D User Interfaces for Neuroanatomy Exploration > interactive 3d user interfaces for neuroanatomy exploration 2009 > internationa] > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241) > at > com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120) > at scala.collection.Iterator$class.foreach(Iterator.scala:742) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212) > at > org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at > org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69) > at org.apache.spark.scheduler.Task.run(Task.scala:82) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException > 16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; > aborting job > ^M[Stage 1:> (0 + 1) > / 2] >