[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265993#comment-16265993 ] ABHISHEK CHOUDHARY commented on SPARK-18016: I found the same issue in Latest spark 2.2.0 while using with pyspark. Number of columns I am expecting is more than 50K , do you think, the patch will fix that kind of huge number as well ? > Code Generation: Constant Pool Past Limit for Wide/Nested Dataset > - > > Key: SPARK-18016 > URL: https://issues.apache.org/jira/browse/SPARK-18016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Aleksander Eskilson >Assignee: Aleksander Eskilson > Fix For: 2.3.0 > > > When attempting to encode collections of large Java objects to Datasets > having very wide or deeply nested schemas, code generation can fail, yielding: > {code} > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for > class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > has grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) > at > org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) > at > org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) > at > org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309) > at
[jira] [Updated] (SPARK-18005) optional binary Dataframe Column throws (UTF8) is not a group while loading a Dataframe
[ https://issues.apache.org/jira/browse/SPARK-18005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK CHOUDHARY updated SPARK-18005: --- Summary: optional binary Dataframe Column throws (UTF8) is not a group while loading a Dataframe (was: optional binary CertificateChains (UTF8) is not a group while loading a Dataframe) > optional binary Dataframe Column throws (UTF8) is not a group while loading a > Dataframe > --- > > Key: SPARK-18005 > URL: https://issues.apache.org/jira/browse/SPARK-18005 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: ABHISHEK CHOUDHARY > > In some scenario, while loading a Parquet file, spark is throwing exception > as- > java.lang.ClassCastException: optional binary CertificateChains (UTF8) is not > a group > Entire Dataframe is not corrupted as I managed to load starting 20 rows of > the data but trying to load the next one throws the error and any operations > over entire dataset throws the same exception like count. > Full Exception Stack - > {quote} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in > stage 594.0 failed 4 times, most recent failure: Lost task 2.3 in stage 594.0 > (TID 6726, ): java.lang.ClassCastException: optional binary CertificateChains > (UTF8) is not a group > at org.apache.parquet.schema.Type.asGroupType(Type.java:202) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.org$apache$spark$sql$execution$datasources$parquet$ParquetReadSupport$$clipParquetType(ParquetReadSupport.scala:122) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:272) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:269) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at org.apache.spark.sql.types.StructType.map(StructType.scala:95) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetGroupFields(ParquetReadSupport.scala:269) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetSchema(ParquetReadSupport.scala:111) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport.init(ParquetReadSupport.scala:67) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:168) > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:192) > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:377) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:339) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.hasNext(InMemoryRelation.scala:151) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:213) > at >
[jira] [Updated] (SPARK-18005) optional binary CertificateChains (UTF8) is not a group while loading a Dataframe
[ https://issues.apache.org/jira/browse/SPARK-18005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK CHOUDHARY updated SPARK-18005: --- Description: In some scenario, while loading a Parquet file, spark is throwing exception as- java.lang.ClassCastException: optional binary CertificateChains (UTF8) is not a group Entire Dataframe is not corrupted as I managed to load starting 20 rows of the data but trying to load the next one throws the error and any operations over entire dataset throws the same exception like count. Full Exception Stack - {quote} org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 594.0 failed 4 times, most recent failure: Lost task 2.3 in stage 594.0 (TID 6726, ): java.lang.ClassCastException: optional binary CertificateChains (UTF8) is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:202) at org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.org$apache$spark$sql$execution$datasources$parquet$ParquetReadSupport$$clipParquetType(ParquetReadSupport.scala:122) at org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272) at org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:272) at org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:269) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at org.apache.spark.sql.types.StructType.map(StructType.scala:95) at org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetGroupFields(ParquetReadSupport.scala:269) at org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetSchema(ParquetReadSupport.scala:111) at org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport.init(ParquetReadSupport.scala:67) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:168) at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:192) at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:377) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:339) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.hasNext(InMemoryRelation.scala:151) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:213) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at
[jira] [Created] (SPARK-18005) optional binary CertificateChains (UTF8) is not a group while loading a Dataframe
ABHISHEK CHOUDHARY created SPARK-18005: -- Summary: optional binary CertificateChains (UTF8) is not a group while loading a Dataframe Key: SPARK-18005 URL: https://issues.apache.org/jira/browse/SPARK-18005 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.0.0 Reporter: ABHISHEK CHOUDHARY In some scenario, while loading a Parquet file, spark is throwing exception as- java.lang.ClassCastException: optional binary CertificateChains (UTF8) is not a group Entire Dataframe is not corrupted as I managed to load starting 20 rows of the data but trying to load the next one throws the error and any operations over entire dataset throws the same exception like count. Full Exception Stack - org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 594.0 failed 4 times, most recent failure: Lost task 2.3 in stage 594.0 (TID 6726, ): java.lang.ClassCastException: optional binary CertificateChains (UTF8) is not a group at org.apache.parquet.schema.Type.asGroupType(Type.java:202) at org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.org$apache$spark$sql$execution$datasources$parquet$ParquetReadSupport$$clipParquetType(ParquetReadSupport.scala:122) at org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272) at org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:272) at org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:269) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at org.apache.spark.sql.types.StructType.map(StructType.scala:95) at org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetGroupFields(ParquetReadSupport.scala:269) at org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetSchema(ParquetReadSupport.scala:111) at org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport.init(ParquetReadSupport.scala:67) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:168) at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:192) at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:377) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:339) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.hasNext(InMemoryRelation.scala:151) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:213) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668) at
[jira] [Commented] (SPARK-10189) python rdd socket connection problem
[ https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729618#comment-14729618 ] ABHISHEK CHOUDHARY commented on SPARK-10189: Well the problem was actually with Java Version. pyspark is raising socket connection problem while using Java 1.8. I tried with java 1.7 and its working fine. > python rdd socket connection problem > > > Key: SPARK-10189 > URL: https://issues.apache.org/jira/browse/SPARK-10189 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.1 >Reporter: ABHISHEK CHOUDHARY > Labels: pyspark, socket > > I am trying to use wholeTextFiles with pyspark , and now I am getting the > same error - > {code} > textFiles = sc.wholeTextFiles('/file/content') > textFiles.take(1) > Traceback (most recent call last): > File "", line 1, in > File > "/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py", > line 1277, in take > res = self.context.runJob(self, takeUpToNumLeft, p, True) > File > "/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py", > line 898, in runJob > return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) > File > "/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py", > line 138, in _load_from_socket > raise Exception("could not open socket") > Exception: could not open socket > >>> 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator > java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404) > at java.net.ServerSocket.implAccept(ServerSocket.java:545) > at java.net.ServerSocket.accept(ServerSocket.java:513) > at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623) > {code} > Current piece of code in rdd.py- > {code:title=rdd.py|borderStyle=solid} > def _load_from_socket(port, serializer): > sock = None > # Support for both IPv4 and IPv6. > # On most of IPv6-ready systems, IPv6 will take precedence. > for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, > socket.SOCK_STREAM): > af, socktype, proto, canonname, sa = res > try: > sock = socket.socket(af, socktype, proto) > sock.settimeout(3) > sock.connect(sa) > except socket.error: > sock = None > continue > break > if not sock: > raise Exception("could not open socket") > try: > rf = sock.makefile("rb", 65536) > for item in serializer.load_stream(rf): > yield item > finally: > sock.close() > {code} > On further investigate the issue , i realized that in context.py , runJob is > not actually triggering the server and so there is nothing to connect - > {code:title=context.py|borderStyle=solid} > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10189) python rdd socket connection problem
[ https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK CHOUDHARY updated SPARK-10189: --- Description: I am trying to use wholeTextFiles with pyspark , and now I am getting the same error - ``` textFiles = sc.wholeTextFiles('/file/content') textFiles.take(1) ``` ``` Traceback (most recent call last): File stdin, line 1, in module File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 1277, in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py, line 898, in runJob return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 138, in _load_from_socket raise Exception(could not open socket) Exception: could not open socket 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623) ``` Current piece of code in rdd.py- ``` def _load_from_socket(port, serializer): sock = None # Support for both IPv4 and IPv6. # On most of IPv6-ready systems, IPv6 will take precedence. for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res try: sock = socket.socket(af, socktype, proto) sock.settimeout(3) sock.connect(sa) except socket.error: sock = None continue break if not sock: raise Exception(could not open socket) try: rf = sock.makefile(rb, 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() ``` On further investigate the issue , i realized that in context.py , runJob is not actually triggering the server and so there is nothing to connect - ``` port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) ``` was: I am trying to use wholeTextFiles with pyspark , and now I am getting the same error - ``` textFiles = sc.wholeTextFiles('/file/content') textFiles.take(1) ``` ``` Traceback (most recent call last): File stdin, line 1, in module File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 1277, in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py, line 898, in runJob return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 138, in _load_from_socket raise Exception(could not open socket) Exception: could not open socket 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623) ``` Current piece of code in rdd.py- ``` def _load_from_socket(port, serializer): sock = None # Support for both IPv4 and IPv6. # On most of IPv6-ready systems, IPv6 will take precedence. for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res try: sock = socket.socket(af, socktype, proto) sock.settimeout(3) sock.connect(sa) except socket.error: sock = None continue break if not sock: raise Exception(could not open socket) try: rf = sock.makefile(rb, 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() ``` python rdd socket connection problem Key: SPARK-10189 URL: https://issues.apache.org/jira/browse/SPARK-10189 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Reporter: ABHISHEK CHOUDHARY Labels: pyspark, socket I am trying to use wholeTextFiles with pyspark , and now I am getting the same error - ``` textFiles =
[jira] [Updated] (SPARK-10189) python rdd socket connection problem
[ https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK CHOUDHARY updated SPARK-10189: --- Description: I am trying to use wholeTextFiles with pyspark , and now I am getting the same error - {code} textFiles = sc.wholeTextFiles('/file/content') textFiles.take(1) Traceback (most recent call last): File stdin, line 1, in module File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 1277, in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py, line 898, in runJob return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 138, in _load_from_socket raise Exception(could not open socket) Exception: could not open socket 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623) {code} Current piece of code in rdd.py- {code:title=rdd.py|borderStyle=solid} def _load_from_socket(port, serializer): sock = None # Support for both IPv4 and IPv6. # On most of IPv6-ready systems, IPv6 will take precedence. for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res try: sock = socket.socket(af, socktype, proto) sock.settimeout(3) sock.connect(sa) except socket.error: sock = None continue break if not sock: raise Exception(could not open socket) try: rf = sock.makefile(rb, 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() {code} On further investigate the issue , i realized that in context.py , runJob is not actually triggering the server and so there is nothing to connect - {code:title=context.py|borderStyle=solid} port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) {code} was: I am trying to use wholeTextFiles with pyspark , and now I am getting the same error - {code} textFiles = sc.wholeTextFiles('/file/content') textFiles.take(1) Traceback (most recent call last): File stdin, line 1, in module File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 1277, in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py, line 898, in runJob return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 138, in _load_from_socket raise Exception(could not open socket) Exception: could not open socket 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623) {code} Current piece of code in rdd.py- {code:title=rdd.py|borderStyle=solid} def _load_from_socket(port, serializer): sock = None # Support for both IPv4 and IPv6. # On most of IPv6-ready systems, IPv6 will take precedence. for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res try: sock = socket.socket(af, socktype, proto) sock.settimeout(3) sock.connect(sa) except socket.error: sock = None continue break if not sock: raise Exception(could not open socket) try: rf = sock.makefile(rb, 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() {code} On further investigate the issue , i realized that in context.py , runJob is not actually triggering the server and so there is nothing to connect - ``` port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) ``` python rdd socket connection problem Key: SPARK-10189 URL:
[jira] [Updated] (SPARK-10189) python rdd socket connection problem
[ https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK CHOUDHARY updated SPARK-10189: --- Description: I am trying to use wholeTextFiles with pyspark , and now I am getting the same error - {code} textFiles = sc.wholeTextFiles('/file/content') textFiles.take(1) Traceback (most recent call last): File stdin, line 1, in module File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 1277, in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py, line 898, in runJob return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 138, in _load_from_socket raise Exception(could not open socket) Exception: could not open socket 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623) {code} Current piece of code in rdd.py- {code:title=rdd.py|borderStyle=solid} def _load_from_socket(port, serializer): sock = None # Support for both IPv4 and IPv6. # On most of IPv6-ready systems, IPv6 will take precedence. for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res try: sock = socket.socket(af, socktype, proto) sock.settimeout(3) sock.connect(sa) except socket.error: sock = None continue break if not sock: raise Exception(could not open socket) try: rf = sock.makefile(rb, 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() {code} On further investigate the issue , i realized that in context.py , runJob is not actually triggering the server and so there is nothing to connect - ``` port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) ``` was: I am trying to use wholeTextFiles with pyspark , and now I am getting the same error - ``` textFiles = sc.wholeTextFiles('/file/content') textFiles.take(1) ``` ``` Traceback (most recent call last): File stdin, line 1, in module File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 1277, in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py, line 898, in runJob return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 138, in _load_from_socket raise Exception(could not open socket) Exception: could not open socket 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623) ``` Current piece of code in rdd.py- ``` def _load_from_socket(port, serializer): sock = None # Support for both IPv4 and IPv6. # On most of IPv6-ready systems, IPv6 will take precedence. for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res try: sock = socket.socket(af, socktype, proto) sock.settimeout(3) sock.connect(sa) except socket.error: sock = None continue break if not sock: raise Exception(could not open socket) try: rf = sock.makefile(rb, 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() ``` On further investigate the issue , i realized that in context.py , runJob is not actually triggering the server and so there is nothing to connect - ``` port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) ``` python rdd socket connection problem Key: SPARK-10189 URL: https://issues.apache.org/jira/browse/SPARK-10189 Project: Spark
[jira] [Created] (SPARK-10189) python rdd socket connection problem
ABHISHEK CHOUDHARY created SPARK-10189: -- Summary: python rdd socket connection problem Key: SPARK-10189 URL: https://issues.apache.org/jira/browse/SPARK-10189 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.1 Reporter: ABHISHEK CHOUDHARY I am trying to use wholeTextFiles with pyspark , and now I am getting the same error - ``` textFiles = sc.wholeTextFiles('/file/content') textFiles.take(1) ``` ``` Traceback (most recent call last): File stdin, line 1, in module File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 1277, in take res = self.context.runJob(self, takeUpToNumLeft, p, True) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py, line 898, in runJob return list(_load_from_socket(port, mappedRDD._jrdd_deserializer)) File /Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, line 138, in _load_from_socket raise Exception(could not open socket) Exception: could not open socket 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623) ``` Current piece of code in rdd.py- ``` def _load_from_socket(port, serializer): sock = None # Support for both IPv4 and IPv6. # On most of IPv6-ready systems, IPv6 will take precedence. for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res try: sock = socket.socket(af, socktype, proto) sock.settimeout(3) sock.connect(sa) except socket.error: sock = None continue break if not sock: raise Exception(could not open socket) try: rf = sock.makefile(rb, 65536) for item in serializer.load_stream(rf): yield item finally: sock.close() ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8296) Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError
[ https://issues.apache.org/jira/browse/SPARK-8296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK CHOUDHARY closed SPARK-8296. - Resolution: Done Fix Version/s: 1.3.1 When I debug I found that Spark was receiving wrong Hadoop URL , a minor mistake in configuration , but the error stacktrace didn't reveal that. So Its not a bug , its configuration issue Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError -- Key: SPARK-8296 URL: https://issues.apache.org/jira/browse/SPARK-8296 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.3.1 Environment: MAC OS Reporter: ABHISHEK CHOUDHARY Labels: test Fix For: 1.3.1 While trying to load a json file using sqlcontext in prebuilt spark-1.3.1-bin-hadoop2.4 version, it throws py4j.protocol.Py4JJavaError from pyspark.sql import SQLContext from pyspark import SparkContext sc = SparkContext() sqlContext = SQLContext(sc) # Create the DataFrame df = sqlContext.jsonFile(changes.json) # Show the content of the DataFrame df.show() Error thrown - File /Users/abhishekchoudhary/Work/python/evolveML/kaggle/avirto/test.py, line 11, in module df = sqlContext.jsonFile(changes.json) File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/context.py, line 377, in jsonFile df = self._ssql_ctx.jsonFile(path, samplingRatio) File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError On checking through the source code, I found that 'gateway_client' is not valid . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8296) Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError
[ https://issues.apache.org/jira/browse/SPARK-8296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK CHOUDHARY updated SPARK-8296: -- Environment: MAC OS Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError -- Key: SPARK-8296 URL: https://issues.apache.org/jira/browse/SPARK-8296 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.3.1 Environment: MAC OS Reporter: ABHISHEK CHOUDHARY Labels: test While trying to load a json file using sqlcontext in prebuilt spark-1.3.1-bin-hadoop2.4 version, it throws py4j.protocol.Py4JJavaError from pyspark.sql import SQLContext from pyspark import SparkContext sc = SparkContext() sqlContext = SQLContext(sc) # Create the DataFrame df = sqlContext.jsonFile(changes.json) # Show the content of the DataFrame df.show() Error thrown - File /Users/abhishekchoudhary/Work/python/evolveML/kaggle/avirto/test.py, line 11, in module df = sqlContext.jsonFile(changes.json) File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/context.py, line 377, in jsonFile df = self._ssql_ctx.jsonFile(path, samplingRatio) File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError On checking through the source code, I found that 'gateway_client' is not valid . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8296) Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError
ABHISHEK CHOUDHARY created SPARK-8296: - Summary: Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError Key: SPARK-8296 URL: https://issues.apache.org/jira/browse/SPARK-8296 Project: Spark Issue Type: Bug Affects Versions: 1.3.1 Reporter: ABHISHEK CHOUDHARY While trying to load a json file using sqlcontext in prebuilt spark-1.3.1-bin-hadoop2.4 version, it throws py4j.protocol.Py4JJavaError from pyspark.sql import SQLContext from pyspark import SparkContext sc = SparkContext() sqlContext = SQLContext(sc) # Create the DataFrame df = sqlContext.jsonFile(changes.json) # Show the content of the DataFrame df.show() Error thrown - File /Users/abhishekchoudhary/Work/python/evolveML/kaggle/avirto/test.py, line 11, in module df = sqlContext.jsonFile(changes.json) File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/context.py, line 377, in jsonFile df = self._ssql_ctx.jsonFile(path, samplingRatio) File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8296) Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError
[ https://issues.apache.org/jira/browse/SPARK-8296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ABHISHEK CHOUDHARY updated SPARK-8296: -- Description: While trying to load a json file using sqlcontext in prebuilt spark-1.3.1-bin-hadoop2.4 version, it throws py4j.protocol.Py4JJavaError from pyspark.sql import SQLContext from pyspark import SparkContext sc = SparkContext() sqlContext = SQLContext(sc) # Create the DataFrame df = sqlContext.jsonFile(changes.json) # Show the content of the DataFrame df.show() Error thrown - File /Users/abhishekchoudhary/Work/python/evolveML/kaggle/avirto/test.py, line 11, in module df = sqlContext.jsonFile(changes.json) File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/context.py, line 377, in jsonFile df = self._ssql_ctx.jsonFile(path, samplingRatio) File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError On checking through the source code, I found that 'gateway_client' is not valid . was: While trying to load a json file using sqlcontext in prebuilt spark-1.3.1-bin-hadoop2.4 version, it throws py4j.protocol.Py4JJavaError from pyspark.sql import SQLContext from pyspark import SparkContext sc = SparkContext() sqlContext = SQLContext(sc) # Create the DataFrame df = sqlContext.jsonFile(changes.json) # Show the content of the DataFrame df.show() Error thrown - File /Users/abhishekchoudhary/Work/python/evolveML/kaggle/avirto/test.py, line 11, in module df = sqlContext.jsonFile(changes.json) File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/context.py, line 377, in jsonFile df = self._ssql_ctx.jsonFile(path, samplingRatio) File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError -- Key: SPARK-8296 URL: https://issues.apache.org/jira/browse/SPARK-8296 Project: Spark Issue Type: Bug Affects Versions: 1.3.1 Reporter: ABHISHEK CHOUDHARY Labels: test While trying to load a json file using sqlcontext in prebuilt spark-1.3.1-bin-hadoop2.4 version, it throws py4j.protocol.Py4JJavaError from pyspark.sql import SQLContext from pyspark import SparkContext sc = SparkContext() sqlContext = SQLContext(sc) # Create the DataFrame df = sqlContext.jsonFile(changes.json) # Show the content of the DataFrame df.show() Error thrown - File /Users/abhishekchoudhary/Work/python/evolveML/kaggle/avirto/test.py, line 11, in module df = sqlContext.jsonFile(changes.json) File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/context.py, line 377, in jsonFile df = self._ssql_ctx.jsonFile(path, samplingRatio) File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 538, in __call__ File /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError On checking through the source code, I found that 'gateway_client' is not valid . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7054) Spark jobs hang for ~15 mins when a node goes down
[ https://issues.apache.org/jira/browse/SPARK-7054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Choudhary updated SPARK-7054: -- Summary: Spark jobs hang for ~15 mins when a node goes down (was: Spark joobs hang for ~15 mins when a node goes down) Spark jobs hang for ~15 mins when a node goes down -- Key: SPARK-7054 URL: https://issues.apache.org/jira/browse/SPARK-7054 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: Cent OS - 6 ,Java 8 Reporter: Abhishek Choudhary Priority: Blocker In a four node cluster (on VMs) having 2 Namenodes and 2 Datanodes with 10 executors (Yarn 2.4) Spark jobs are running in yarn-client mode. When a running vm is shut down, spark job hangs for ~15 mins . After ~45-50 seconds driver got information of lost block managers, From logs : 2015-04-22 09:50:30,000 [sparkDriver-akka.actor.default-dispatcher-15] WARN org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager BlockManagerId(9, ACUME-DN2, 40898) with no recent heart beats: 59674ms exceeds 45000ms 2015-04-22 09:50:30,000 [sparkDriver-akka.actor.default-dispatcher-15] WARN org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager BlockManagerId(5, ACUME-DN2, 37947) with no recent heart beats: 60044ms exceeds 45000ms 2015-04-22 09:50:30,001 [sparkDriver-akka.actor.default-dispatcher-15] WARN org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager BlockManagerId(3, ACUME-DN2, 49808) with no recent heart beats: 54637ms exceeds 45000ms 2015-04-22 09:50:30,001 [sparkDriver-akka.actor.default-dispatcher-15] WARN org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager BlockManagerId(1, ACUME-DN2, 44090) with no recent heart beats: 59049ms exceeds 45000ms 2015-04-22 09:50:30,001 [sparkDriver-akka.actor.default-dispatcher-15] WARN org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager BlockManagerId(7, ACUME-DN2, 47267) with no recent heart beats: 56879ms exceeds 45000ms After ~15 mins Spark driver got executor lost event and rescheduled failed tasks From logs : 2015-04-22 10:05:04,965 [sparkDriver-akka.actor.default-dispatcher-19] ERROR org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - Lost executor 1 on ACUME-DN2: remote Akka client disassociated For these 15 mins all the jobs were stuck for executors running on shutdown vm . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7054) Spark joobs hang for ~15 mins when a node goes down
Abhishek Choudhary created SPARK-7054: - Summary: Spark joobs hang for ~15 mins when a node goes down Key: SPARK-7054 URL: https://issues.apache.org/jira/browse/SPARK-7054 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1 Environment: Cent OS - 6 ,Java 8 Reporter: Abhishek Choudhary Priority: Blocker In a four node cluster (on VMs) having 2 Namenodes and 2 Datanodes with 10 executors (Yarn 2.4) Spark jobs are running in yarn-client mode. When a running vm is shut down, spark job hangs for ~15 mins . After ~45-50 seconds driver got information of lost block managers, From logs : 2015-04-22 09:50:30,000 [sparkDriver-akka.actor.default-dispatcher-15] WARN org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager BlockManagerId(9, ACUME-DN2, 40898) with no recent heart beats: 59674ms exceeds 45000ms 2015-04-22 09:50:30,000 [sparkDriver-akka.actor.default-dispatcher-15] WARN org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager BlockManagerId(5, ACUME-DN2, 37947) with no recent heart beats: 60044ms exceeds 45000ms 2015-04-22 09:50:30,001 [sparkDriver-akka.actor.default-dispatcher-15] WARN org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager BlockManagerId(3, ACUME-DN2, 49808) with no recent heart beats: 54637ms exceeds 45000ms 2015-04-22 09:50:30,001 [sparkDriver-akka.actor.default-dispatcher-15] WARN org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager BlockManagerId(1, ACUME-DN2, 44090) with no recent heart beats: 59049ms exceeds 45000ms 2015-04-22 09:50:30,001 [sparkDriver-akka.actor.default-dispatcher-15] WARN org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager BlockManagerId(7, ACUME-DN2, 47267) with no recent heart beats: 56879ms exceeds 45000ms After ~15 mins Spark driver got executor lost event and rescheduled failed tasks From logs : 2015-04-22 10:05:04,965 [sparkDriver-akka.actor.default-dispatcher-19] ERROR org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - Lost executor 1 on ACUME-DN2: remote Akka client disassociated For these 15 mins all the jobs were stuck for executors running on shutdown vm . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org