[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-11-26 Thread ABHISHEK CHOUDHARY (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16265993#comment-16265993
 ] 

ABHISHEK CHOUDHARY commented on SPARK-18016:


I found the same issue in Latest spark 2.2.0 while using with pyspark.
Number of columns I am expecting is more than 50K , do you think, the patch 
will fix that kind of huge number as well ?

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at 

[jira] [Updated] (SPARK-18005) optional binary Dataframe Column throws (UTF8) is not a group while loading a Dataframe

2016-11-01 Thread ABHISHEK CHOUDHARY (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK CHOUDHARY updated SPARK-18005:
---
Summary: optional binary Dataframe Column throws (UTF8) is not a group 
while loading a Dataframe  (was: optional binary CertificateChains (UTF8) is 
not a group while loading a Dataframe)

> optional binary Dataframe Column throws (UTF8) is not a group while loading a 
> Dataframe
> ---
>
> Key: SPARK-18005
> URL: https://issues.apache.org/jira/browse/SPARK-18005
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: ABHISHEK CHOUDHARY
>
> In some scenario, while loading a Parquet file, spark is throwing exception 
> as-
> java.lang.ClassCastException: optional binary CertificateChains (UTF8) is not 
> a group
> Entire Dataframe is not corrupted as I managed to load starting 20 rows of 
> the data but trying to load the next one throws the error and any operations 
> over entire dataset throws the same exception like count.
> Full Exception Stack -
> {quote}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
> stage 594.0 failed 4 times, most recent failure: Lost task 2.3 in stage 594.0 
> (TID 6726, ): java.lang.ClassCastException: optional binary CertificateChains 
> (UTF8) is not a group
>   at org.apache.parquet.schema.Type.asGroupType(Type.java:202)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.org$apache$spark$sql$execution$datasources$parquet$ParquetReadSupport$$clipParquetType(ParquetReadSupport.scala:122)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:269)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetGroupFields(ParquetReadSupport.scala:269)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetSchema(ParquetReadSupport.scala:111)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport.init(ParquetReadSupport.scala:67)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:168)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:192)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:377)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:339)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.hasNext(InMemoryRelation.scala:151)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:213)
>   at 
> 

[jira] [Updated] (SPARK-18005) optional binary CertificateChains (UTF8) is not a group while loading a Dataframe

2016-10-19 Thread ABHISHEK CHOUDHARY (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK CHOUDHARY updated SPARK-18005:
---
Description: 
In some scenario, while loading a Parquet file, spark is throwing exception as-
java.lang.ClassCastException: optional binary CertificateChains (UTF8) is not a 
group

Entire Dataframe is not corrupted as I managed to load starting 20 rows of the 
data but trying to load the next one throws the error and any operations over 
entire dataset throws the same exception like count.



Full Exception Stack -
{quote}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
stage 594.0 failed 4 times, most recent failure: Lost task 2.3 in stage 594.0 
(TID 6726, ): java.lang.ClassCastException: optional binary CertificateChains 
(UTF8) is not a group
at org.apache.parquet.schema.Type.asGroupType(Type.java:202)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.org$apache$spark$sql$execution$datasources$parquet$ParquetReadSupport$$clipParquetType(ParquetReadSupport.scala:122)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:272)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:269)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetGroupFields(ParquetReadSupport.scala:269)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetSchema(ParquetReadSupport.scala:111)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport.init(ParquetReadSupport.scala:67)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:168)
at 
org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:192)
at 
org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:377)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:339)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.hasNext(InMemoryRelation.scala:151)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:213)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at 

[jira] [Created] (SPARK-18005) optional binary CertificateChains (UTF8) is not a group while loading a Dataframe

2016-10-19 Thread ABHISHEK CHOUDHARY (JIRA)
ABHISHEK CHOUDHARY created SPARK-18005:
--

 Summary: optional binary CertificateChains (UTF8) is not a group 
while loading a Dataframe
 Key: SPARK-18005
 URL: https://issues.apache.org/jira/browse/SPARK-18005
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.0.0
Reporter: ABHISHEK CHOUDHARY


In some scenario, while loading a Parquet file, spark is throwing exception as-
java.lang.ClassCastException: optional binary CertificateChains (UTF8) is not a 
group

Entire Dataframe is not corrupted as I managed to load starting 20 rows of the 
data but trying to load the next one throws the error and any operations over 
entire dataset throws the same exception like count.



Full Exception Stack -
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
stage 594.0 failed 4 times, most recent failure: Lost task 2.3 in stage 594.0 
(TID 6726, ): java.lang.ClassCastException: optional binary CertificateChains 
(UTF8) is not a group
at org.apache.parquet.schema.Type.asGroupType(Type.java:202)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.org$apache$spark$sql$execution$datasources$parquet$ParquetReadSupport$$clipParquetType(ParquetReadSupport.scala:122)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:272)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:269)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetGroupFields(ParquetReadSupport.scala:269)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetSchema(ParquetReadSupport.scala:111)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport.init(ParquetReadSupport.scala:67)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:168)
at 
org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:192)
at 
org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:377)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:339)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.hasNext(InMemoryRelation.scala:151)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:213)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
at 

[jira] [Commented] (SPARK-10189) python rdd socket connection problem

2015-09-03 Thread ABHISHEK CHOUDHARY (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14729618#comment-14729618
 ] 

ABHISHEK CHOUDHARY commented on SPARK-10189:


Well the problem was actually with Java Version.
pyspark is raising socket connection problem while using Java 1.8.
I tried with java 1.7 and its working fine.


> python rdd socket connection problem
> 
>
> Key: SPARK-10189
> URL: https://issues.apache.org/jira/browse/SPARK-10189
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: ABHISHEK CHOUDHARY
>  Labels: pyspark, socket
>
> I am trying to use wholeTextFiles with pyspark , and now I am getting the 
> same error -
> {code}
> textFiles = sc.wholeTextFiles('/file/content')
> textFiles.take(1)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py",
>  line 1277, in take
> res = self.context.runJob(self, takeUpToNumLeft, p, True)
>   File 
> "/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py",
>  line 898, in runJob
> return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
>   File 
> "/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py",
>  line 138, in _load_from_socket
> raise Exception("could not open socket")
> Exception: could not open socket
> >>> 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
> java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
> at java.net.ServerSocket.implAccept(ServerSocket.java:545)
> at java.net.ServerSocket.accept(ServerSocket.java:513)
> at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
> {code}
> Current piece of code in rdd.py-
> {code:title=rdd.py|borderStyle=solid}
> def _load_from_socket(port, serializer):
> sock = None
> # Support for both IPv4 and IPv6.
> # On most of IPv6-ready systems, IPv6 will take precedence.
> for res in socket.getaddrinfo("localhost", port, socket.AF_UNSPEC, 
> socket.SOCK_STREAM):
> af, socktype, proto, canonname, sa = res
> try:
> sock = socket.socket(af, socktype, proto)
> sock.settimeout(3)
> sock.connect(sa)
> except socket.error:
> sock = None
> continue
> break
> if not sock:
> raise Exception("could not open socket")
> try:
> rf = sock.makefile("rb", 65536)
> for item in serializer.load_stream(rf):
> yield item
> finally:
> sock.close()
> {code}
> On further investigate the issue , i realized that in context.py , runJob is 
> not actually triggering the server and so there is nothing to connect -
> {code:title=context.py|borderStyle=solid}
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10189) python rdd socket connection problem

2015-08-30 Thread ABHISHEK CHOUDHARY (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK CHOUDHARY updated SPARK-10189:
---
Description: 
I am trying to use wholeTextFiles with pyspark , and now I am getting the same 
error -

```
textFiles = sc.wholeTextFiles('/file/content')
textFiles.take(1)
```

```
Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 1277, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py,
 line 898, in runJob
return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 138, in _load_from_socket
raise Exception(could not open socket)
Exception: could not open socket
 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
```

Current piece of code in rdd.py-

```
def _load_from_socket(port, serializer):
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
try:
sock = socket.socket(af, socktype, proto)
sock.settimeout(3)
sock.connect(sa)
except socket.error:
sock = None
continue
break
if not sock:
raise Exception(could not open socket)
try:
rf = sock.makefile(rb, 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
```


On further investigate the issue , i realized that in context.py , runJob is 
not actually triggering the server and so there is nothing to connect -
```
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
```

  was:
I am trying to use wholeTextFiles with pyspark , and now I am getting the same 
error -

```
textFiles = sc.wholeTextFiles('/file/content')
textFiles.take(1)
```

```
Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 1277, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py,
 line 898, in runJob
return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 138, in _load_from_socket
raise Exception(could not open socket)
Exception: could not open socket
 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
```

Current piece of code in rdd.py-

```
def _load_from_socket(port, serializer):
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
try:
sock = socket.socket(af, socktype, proto)
sock.settimeout(3)
sock.connect(sa)
except socket.error:
sock = None
continue
break
if not sock:
raise Exception(could not open socket)
try:
rf = sock.makefile(rb, 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
```


 python rdd socket connection problem
 

 Key: SPARK-10189
 URL: https://issues.apache.org/jira/browse/SPARK-10189
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.1
Reporter: ABHISHEK CHOUDHARY
  Labels: pyspark, socket

 I am trying to use wholeTextFiles with pyspark , and now I am getting the 
 same error -
 ```
 textFiles = 

[jira] [Updated] (SPARK-10189) python rdd socket connection problem

2015-08-30 Thread ABHISHEK CHOUDHARY (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK CHOUDHARY updated SPARK-10189:
---
Description: 
I am trying to use wholeTextFiles with pyspark , and now I am getting the same 
error -

{code}
textFiles = sc.wholeTextFiles('/file/content')
textFiles.take(1)



Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 1277, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py,
 line 898, in runJob
return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 138, in _load_from_socket
raise Exception(could not open socket)
Exception: could not open socket
 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
{code}

Current piece of code in rdd.py-

{code:title=rdd.py|borderStyle=solid}
def _load_from_socket(port, serializer):
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
try:
sock = socket.socket(af, socktype, proto)
sock.settimeout(3)
sock.connect(sa)
except socket.error:
sock = None
continue
break
if not sock:
raise Exception(could not open socket)
try:
rf = sock.makefile(rb, 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
{code}


On further investigate the issue , i realized that in context.py , runJob is 
not actually triggering the server and so there is nothing to connect -
{code:title=context.py|borderStyle=solid}
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
{code}

  was:
I am trying to use wholeTextFiles with pyspark , and now I am getting the same 
error -

{code}
textFiles = sc.wholeTextFiles('/file/content')
textFiles.take(1)



Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 1277, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py,
 line 898, in runJob
return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 138, in _load_from_socket
raise Exception(could not open socket)
Exception: could not open socket
 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
{code}

Current piece of code in rdd.py-

{code:title=rdd.py|borderStyle=solid}
def _load_from_socket(port, serializer):
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
try:
sock = socket.socket(af, socktype, proto)
sock.settimeout(3)
sock.connect(sa)
except socket.error:
sock = None
continue
break
if not sock:
raise Exception(could not open socket)
try:
rf = sock.makefile(rb, 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
{code}


On further investigate the issue , i realized that in context.py , runJob is 
not actually triggering the server and so there is nothing to connect -
```
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
```


 python rdd socket connection problem
 

 Key: SPARK-10189
 URL: 

[jira] [Updated] (SPARK-10189) python rdd socket connection problem

2015-08-30 Thread ABHISHEK CHOUDHARY (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK CHOUDHARY updated SPARK-10189:
---
Description: 
I am trying to use wholeTextFiles with pyspark , and now I am getting the same 
error -

{code}
textFiles = sc.wholeTextFiles('/file/content')
textFiles.take(1)



Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 1277, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py,
 line 898, in runJob
return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 138, in _load_from_socket
raise Exception(could not open socket)
Exception: could not open socket
 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
{code}

Current piece of code in rdd.py-

{code:title=rdd.py|borderStyle=solid}
def _load_from_socket(port, serializer):
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
try:
sock = socket.socket(af, socktype, proto)
sock.settimeout(3)
sock.connect(sa)
except socket.error:
sock = None
continue
break
if not sock:
raise Exception(could not open socket)
try:
rf = sock.makefile(rb, 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
{code}


On further investigate the issue , i realized that in context.py , runJob is 
not actually triggering the server and so there is nothing to connect -
```
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
```

  was:
I am trying to use wholeTextFiles with pyspark , and now I am getting the same 
error -

```
textFiles = sc.wholeTextFiles('/file/content')
textFiles.take(1)
```

```
Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 1277, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py,
 line 898, in runJob
return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 138, in _load_from_socket
raise Exception(could not open socket)
Exception: could not open socket
 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
```

Current piece of code in rdd.py-

```
def _load_from_socket(port, serializer):
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
try:
sock = socket.socket(af, socktype, proto)
sock.settimeout(3)
sock.connect(sa)
except socket.error:
sock = None
continue
break
if not sock:
raise Exception(could not open socket)
try:
rf = sock.makefile(rb, 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
```


On further investigate the issue , i realized that in context.py , runJob is 
not actually triggering the server and so there is nothing to connect -
```
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
```


 python rdd socket connection problem
 

 Key: SPARK-10189
 URL: https://issues.apache.org/jira/browse/SPARK-10189
 Project: Spark
  

[jira] [Created] (SPARK-10189) python rdd socket connection problem

2015-08-24 Thread ABHISHEK CHOUDHARY (JIRA)
ABHISHEK CHOUDHARY created SPARK-10189:
--

 Summary: python rdd socket connection problem
 Key: SPARK-10189
 URL: https://issues.apache.org/jira/browse/SPARK-10189
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.1
Reporter: ABHISHEK CHOUDHARY


I am trying to use wholeTextFiles with pyspark , and now I am getting the same 
error -

```
textFiles = sc.wholeTextFiles('/file/content')
textFiles.take(1)
```

```
Traceback (most recent call last):
  File stdin, line 1, in module
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 1277, in take
res = self.context.runJob(self, takeUpToNumLeft, p, True)
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/context.py,
 line 898, in runJob
return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
  File 
/Volumes/work/bigdata/CHD5.4/spark-1.4.1-bin-hadoop2.6/python/pyspark/rdd.py, 
line 138, in _load_from_socket
raise Exception(could not open socket)
Exception: could not open socket
 15/08/24 20:09:27 ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:623)
```

Current piece of code in rdd.py-

```
def _load_from_socket(port, serializer):
sock = None
# Support for both IPv4 and IPv6.
# On most of IPv6-ready systems, IPv6 will take precedence.
for res in socket.getaddrinfo(localhost, port, socket.AF_UNSPEC, 
socket.SOCK_STREAM):
af, socktype, proto, canonname, sa = res
try:
sock = socket.socket(af, socktype, proto)
sock.settimeout(3)
sock.connect(sa)
except socket.error:
sock = None
continue
break
if not sock:
raise Exception(could not open socket)
try:
rf = sock.makefile(rb, 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8296) Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError

2015-06-11 Thread ABHISHEK CHOUDHARY (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK CHOUDHARY closed SPARK-8296.
-
   Resolution: Done
Fix Version/s: 1.3.1

When I debug I found that Spark was receiving wrong Hadoop URL , a minor 
mistake in configuration , but the error stacktrace didn't reveal that.

So Its not a bug , its configuration issue

 Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError
 --

 Key: SPARK-8296
 URL: https://issues.apache.org/jira/browse/SPARK-8296
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.3.1
 Environment: MAC OS
Reporter: ABHISHEK CHOUDHARY
  Labels: test
 Fix For: 1.3.1


 While trying to load a json file using sqlcontext in prebuilt 
 spark-1.3.1-bin-hadoop2.4 version, it throws py4j.protocol.Py4JJavaError
 from pyspark.sql import SQLContext
 from pyspark import SparkContext
 sc = SparkContext()
 sqlContext = SQLContext(sc)
 # Create the DataFrame
 df = sqlContext.jsonFile(changes.json)
 # Show the content of the DataFrame
 df.show()
 Error thrown -
   File /Users/abhishekchoudhary/Work/python/evolveML/kaggle/avirto/test.py, 
 line 11, in module
 df = sqlContext.jsonFile(changes.json)
   File 
 /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/context.py,
  line 377, in jsonFile
 df = self._ssql_ctx.jsonFile(path, samplingRatio)
   File 
 /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 538, in __call__
   File 
 /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
  line 300, in get_return_value
 py4j.protocol.Py4JJavaError
 On checking through the source code, I found that 'gateway_client' is not 
 valid .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8296) Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError

2015-06-11 Thread ABHISHEK CHOUDHARY (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK CHOUDHARY updated SPARK-8296:
--
Environment: MAC OS

 Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError
 --

 Key: SPARK-8296
 URL: https://issues.apache.org/jira/browse/SPARK-8296
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.3.1
 Environment: MAC OS
Reporter: ABHISHEK CHOUDHARY
  Labels: test

 While trying to load a json file using sqlcontext in prebuilt 
 spark-1.3.1-bin-hadoop2.4 version, it throws py4j.protocol.Py4JJavaError
 from pyspark.sql import SQLContext
 from pyspark import SparkContext
 sc = SparkContext()
 sqlContext = SQLContext(sc)
 # Create the DataFrame
 df = sqlContext.jsonFile(changes.json)
 # Show the content of the DataFrame
 df.show()
 Error thrown -
   File /Users/abhishekchoudhary/Work/python/evolveML/kaggle/avirto/test.py, 
 line 11, in module
 df = sqlContext.jsonFile(changes.json)
   File 
 /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/context.py,
  line 377, in jsonFile
 df = self._ssql_ctx.jsonFile(path, samplingRatio)
   File 
 /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 538, in __call__
   File 
 /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
  line 300, in get_return_value
 py4j.protocol.Py4JJavaError
 On checking through the source code, I found that 'gateway_client' is not 
 valid .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8296) Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError

2015-06-10 Thread ABHISHEK CHOUDHARY (JIRA)
ABHISHEK CHOUDHARY created SPARK-8296:
-

 Summary: Not able to load Dataframe using Python throws 
py4j.protocol.Py4JJavaError
 Key: SPARK-8296
 URL: https://issues.apache.org/jira/browse/SPARK-8296
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.1
Reporter: ABHISHEK CHOUDHARY


While trying to load a json file using sqlcontext in prebuilt 
spark-1.3.1-bin-hadoop2.4 version, it throws py4j.protocol.Py4JJavaError

from pyspark.sql import SQLContext
from pyspark import SparkContext

sc = SparkContext()
sqlContext = SQLContext(sc)

# Create the DataFrame
df = sqlContext.jsonFile(changes.json)

# Show the content of the DataFrame
df.show()

Error thrown -

  File /Users/abhishekchoudhary/Work/python/evolveML/kaggle/avirto/test.py, 
line 11, in module
df = sqlContext.jsonFile(changes.json)
  File 
/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/context.py,
 line 377, in jsonFile
df = self._ssql_ctx.jsonFile(path, samplingRatio)
  File 
/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
 line 538, in __call__
  File 
/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
 line 300, in get_return_value
py4j.protocol.Py4JJavaError




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8296) Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError

2015-06-10 Thread ABHISHEK CHOUDHARY (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK CHOUDHARY updated SPARK-8296:
--
Description: 
While trying to load a json file using sqlcontext in prebuilt 
spark-1.3.1-bin-hadoop2.4 version, it throws py4j.protocol.Py4JJavaError

from pyspark.sql import SQLContext
from pyspark import SparkContext

sc = SparkContext()
sqlContext = SQLContext(sc)

# Create the DataFrame
df = sqlContext.jsonFile(changes.json)

# Show the content of the DataFrame
df.show()

Error thrown -

  File /Users/abhishekchoudhary/Work/python/evolveML/kaggle/avirto/test.py, 
line 11, in module
df = sqlContext.jsonFile(changes.json)
  File 
/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/context.py,
 line 377, in jsonFile
df = self._ssql_ctx.jsonFile(path, samplingRatio)
  File 
/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
 line 538, in __call__
  File 
/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
 line 300, in get_return_value
py4j.protocol.Py4JJavaError


On checking through the source code, I found that 'gateway_client' is not valid 
.

  was:
While trying to load a json file using sqlcontext in prebuilt 
spark-1.3.1-bin-hadoop2.4 version, it throws py4j.protocol.Py4JJavaError

from pyspark.sql import SQLContext
from pyspark import SparkContext

sc = SparkContext()
sqlContext = SQLContext(sc)

# Create the DataFrame
df = sqlContext.jsonFile(changes.json)

# Show the content of the DataFrame
df.show()

Error thrown -

  File /Users/abhishekchoudhary/Work/python/evolveML/kaggle/avirto/test.py, 
line 11, in module
df = sqlContext.jsonFile(changes.json)
  File 
/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/context.py,
 line 377, in jsonFile
df = self._ssql_ctx.jsonFile(path, samplingRatio)
  File 
/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
 line 538, in __call__
  File 
/Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
 line 300, in get_return_value
py4j.protocol.Py4JJavaError



 Not able to load Dataframe using Python throws py4j.protocol.Py4JJavaError
 --

 Key: SPARK-8296
 URL: https://issues.apache.org/jira/browse/SPARK-8296
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.3.1
Reporter: ABHISHEK CHOUDHARY
  Labels: test

 While trying to load a json file using sqlcontext in prebuilt 
 spark-1.3.1-bin-hadoop2.4 version, it throws py4j.protocol.Py4JJavaError
 from pyspark.sql import SQLContext
 from pyspark import SparkContext
 sc = SparkContext()
 sqlContext = SQLContext(sc)
 # Create the DataFrame
 df = sqlContext.jsonFile(changes.json)
 # Show the content of the DataFrame
 df.show()
 Error thrown -
   File /Users/abhishekchoudhary/Work/python/evolveML/kaggle/avirto/test.py, 
 line 11, in module
 df = sqlContext.jsonFile(changes.json)
   File 
 /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/pyspark/sql/context.py,
  line 377, in jsonFile
 df = self._ssql_ctx.jsonFile(path, samplingRatio)
   File 
 /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 538, in __call__
   File 
 /Users/abhishekchoudhary/bigdata/cdh5.2.0/spark-1.3.1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
  line 300, in get_return_value
 py4j.protocol.Py4JJavaError
 On checking through the source code, I found that 'gateway_client' is not 
 valid .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7054) Spark jobs hang for ~15 mins when a node goes down

2015-04-22 Thread Abhishek Choudhary (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Choudhary updated SPARK-7054:
--
Summary: Spark jobs hang for ~15 mins when a node goes down  (was: Spark 
joobs hang for ~15 mins when a node goes down)

 Spark jobs hang for ~15 mins when a node goes down
 --

 Key: SPARK-7054
 URL: https://issues.apache.org/jira/browse/SPARK-7054
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: Cent OS - 6 ,Java 8
Reporter: Abhishek Choudhary
Priority: Blocker

 In a four node cluster (on VMs) having 2 Namenodes and 2 Datanodes with 10 
 executors (Yarn 2.4) Spark jobs are running in yarn-client mode. When a 
 running vm is shut down, spark job hangs for ~15 mins .
 After ~45-50 seconds driver got information of lost block managers,
 From logs : 
 2015-04-22 09:50:30,000 [sparkDriver-akka.actor.default-dispatcher-15] WARN  
 org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager 
 BlockManagerId(9, ACUME-DN2, 40898) with no recent heart beats: 59674ms 
 exceeds 45000ms
 2015-04-22 09:50:30,000 [sparkDriver-akka.actor.default-dispatcher-15] WARN  
 org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager 
 BlockManagerId(5, ACUME-DN2, 37947) with no recent heart beats: 60044ms 
 exceeds 45000ms
 2015-04-22 09:50:30,001 [sparkDriver-akka.actor.default-dispatcher-15] WARN  
 org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager 
 BlockManagerId(3, ACUME-DN2, 49808) with no recent heart beats: 54637ms 
 exceeds 45000ms
 2015-04-22 09:50:30,001 [sparkDriver-akka.actor.default-dispatcher-15] WARN  
 org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager 
 BlockManagerId(1, ACUME-DN2, 44090) with no recent heart beats: 59049ms 
 exceeds 45000ms
 2015-04-22 09:50:30,001 [sparkDriver-akka.actor.default-dispatcher-15] WARN  
 org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager 
 BlockManagerId(7, ACUME-DN2, 47267) with no recent heart beats: 56879ms 
 exceeds 45000ms
 After ~15 mins Spark driver got executor lost event and rescheduled failed 
 tasks
 From logs :
 2015-04-22 10:05:04,965 [sparkDriver-akka.actor.default-dispatcher-19] ERROR 
 org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - Lost executor 
 1 on ACUME-DN2: remote Akka client disassociated
 For these 15 mins all the jobs were stuck for executors running on shutdown 
 vm .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7054) Spark joobs hang for ~15 mins when a node goes down

2015-04-22 Thread Abhishek Choudhary (JIRA)
Abhishek Choudhary created SPARK-7054:
-

 Summary: Spark joobs hang for ~15 mins when a node goes down
 Key: SPARK-7054
 URL: https://issues.apache.org/jira/browse/SPARK-7054
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: Cent OS - 6 ,Java 8
Reporter: Abhishek Choudhary
Priority: Blocker


In a four node cluster (on VMs) having 2 Namenodes and 2 Datanodes with 10 
executors (Yarn 2.4) Spark jobs are running in yarn-client mode. When a running 
vm is shut down, spark job hangs for ~15 mins .

After ~45-50 seconds driver got information of lost block managers,
From logs : 

2015-04-22 09:50:30,000 [sparkDriver-akka.actor.default-dispatcher-15] WARN  
org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager 
BlockManagerId(9, ACUME-DN2, 40898) with no recent heart beats: 59674ms exceeds 
45000ms
2015-04-22 09:50:30,000 [sparkDriver-akka.actor.default-dispatcher-15] WARN  
org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager 
BlockManagerId(5, ACUME-DN2, 37947) with no recent heart beats: 60044ms exceeds 
45000ms
2015-04-22 09:50:30,001 [sparkDriver-akka.actor.default-dispatcher-15] WARN  
org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager 
BlockManagerId(3, ACUME-DN2, 49808) with no recent heart beats: 54637ms exceeds 
45000ms
2015-04-22 09:50:30,001 [sparkDriver-akka.actor.default-dispatcher-15] WARN  
org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager 
BlockManagerId(1, ACUME-DN2, 44090) with no recent heart beats: 59049ms exceeds 
45000ms
2015-04-22 09:50:30,001 [sparkDriver-akka.actor.default-dispatcher-15] WARN  
org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager 
BlockManagerId(7, ACUME-DN2, 47267) with no recent heart beats: 56879ms exceeds 
45000ms


After ~15 mins Spark driver got executor lost event and rescheduled failed tasks

From logs :

2015-04-22 10:05:04,965 [sparkDriver-akka.actor.default-dispatcher-19] ERROR 
org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - Lost executor 1 
on ACUME-DN2: remote Akka client disassociated

For these 15 mins all the jobs were stuck for executors running on shutdown vm .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org