installing packages with pyspark
Hi all, I had couple of questions. 1. Is there documentation on how to add the graphframes or any other package for that matter on the google dataproc managed spark clusters ? 2. Is there a way to add a package to an existing pyspark context through a jupyter notebook ? --aj
Re: installing packages with pyspark
Thanks Jakob, Felix. I am aware you can do it with --packages but i was wondering if there is a way to do something like "!pip install " like i do for other packages from jupyter notebook for python. But I guess I cannot add a package once i launch the pyspark context right ? On Thu, Mar 17, 2016 at 6:59 PM Felix Cheung wrote: > For some, like graphframes that are Spark packages, you could also use > --packages in the command line of spark-submit or pyspark. See > http://spark.apache.org/docs/latest/submitting-applications.html > > _ > From: Jakob Odersky > Sent: Thursday, March 17, 2016 6:40 PM > Subject: Re: installing packages with pyspark > To: Ajinkya Kale > Cc: > > > Hi, > regarding 1, packages are resolved locally. That means that when you > specify a package, spark-submit will resolve the dependencies and > download any jars on the local machine, before shipping* them to the > cluster. So, without a priori knowledge of dataproc clusters, it > should be no different to specify packages. > > Unfortunatly I can't help with 2. > > --Jakob > > *shipping in this case means making them available via the network > > On Thu, Mar 17, 2016 at 5:36 PM, Ajinkya Kale > wrote: > > Hi all, > > > > I had couple of questions. > > 1. Is there documentation on how to add the graphframes or any other > package > > for that matter on the google dataproc managed spark clusters ? > > > > 2. Is there a way to add a package to an existing pyspark context > through a > > jupyter notebook ? > > > > --aj > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > >
HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode
I have posted this on hbase user list but i thought makes more sense on spark user list. I am able to read the table in yarn-client mode from spark-shell but I have exhausted all online forums for options to get it working in the yarn-cluster mode through spark-submit. I am using this code-example http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase to read a hbase table using Spark with the only change of adding the hbase.zookeeper.quorum through code as it is not picking it from the hbase-site.xml. Spark 1.5.3 HBase 0.98.0 Facing this error - 16/01/20 12:56:59 WARN client.ConnectionManager$HConnectionImplementation: Encountered problems when prefetch hbase:meta table: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=3, exceptions:Wed Jan 20 12:56:58 GMT-07:00 2016, org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, java.lang.IllegalAccessError: class com.google.protobuf.HBaseZeroCopyByteString cannot access its superclass com.google.protobuf.LiteralByteStringWed Jan 20 12:56:58 GMT-07:00 2016, org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteStringWed Jan 20 12:56:59 GMT-07:00 2016, org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136) at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:751) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:147) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.prefetchRegionCache(ConnectionManager.java:1215) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1280) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1128) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1070) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:347) at org.apache.hadoop.hbase.client.HTable.(HTable.java:201) at org.apache.hadoop.hbase.client.HTable.(HTable.java:159) at org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:111) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1281) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:310) at org.apache.spark.rdd.RDD.take(RDD.scala:1276) I tried adding the hbase protocol jar on spar-defaults.conf and in the driver-classpath as suggested here http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalAccessError-class-com-google-protobuf-HBaseZeroCopyByteString-cannot-access-its-supg-td24303.html but no success. Any suggestions ?
Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode
Hi Ted, Thanks for responding. Is there a work around for 0.98.0 ? Adding the hbase-protocol jar to HADOOP_CLASSPATH didnt work for me. On Wed, Jan 20, 2016 at 6:14 PM Ted Yu wrote: > 0.98.0 didn't have fix from HBASE-8 > > Please upgrade your hbase version and try again. > > If still there is problem, please pastebin the stack trace. > > Thanks > > On Wed, Jan 20, 2016 at 5:41 PM, Ajinkya Kale > wrote: > >> >> I have posted this on hbase user list but i thought makes more sense on >> spark user list. >> I am able to read the table in yarn-client mode from spark-shell but I >> have exhausted all online forums for options to get it working in the >> yarn-cluster mode through spark-submit. >> >> I am using this code-example >> http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase >> to >> read a hbase table using Spark with the only change of adding the >> hbase.zookeeper.quorum through code as it is not picking it from the >> hbase-site.xml. >> >> Spark 1.5.3 >> >> HBase 0.98.0 >> >> >> Facing this error - >> >> 16/01/20 12:56:59 WARN client.ConnectionManager$HConnectionImplementation: >> Encountered problems when prefetch hbase:meta table: >> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after >> attempts=3, exceptions:Wed Jan 20 12:56:58 GMT-07:00 2016, >> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, >> java.lang.IllegalAccessError: class >> com.google.protobuf.HBaseZeroCopyByteString cannot access its superclass >> com.google.protobuf.LiteralByteStringWed Jan 20 12:56:58 GMT-07:00 2016, >> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, >> java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteStringWed >> Jan 20 12:56:59 GMT-07:00 2016, >> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, >> java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString >> >> at >> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136) >> at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:751) >> at >> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:147) >> at >> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.prefetchRegionCache(ConnectionManager.java:1215) >> at >> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1280) >> at >> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1128) >> at >> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:) >> at >> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1070) >> at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:347) >> at org.apache.hadoop.hbase.client.HTable.(HTable.java:201) >> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159) >> at >> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101) >> at >> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:111) >> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) >> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) >> at scala.Option.getOrElse(Option.scala:120) >> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) >> at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1281) >> at >> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) >> at >> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) >> at org.apache.spark.rdd.RDD.withScope(RDD.scala:310) >> at org.apache.spark.rdd.RDD.take(RDD.scala:1276) >> >> I tried adding the hbase protocol jar on spar-defaults.conf and in the >> driver-classpath as suggested here >> http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalAccessError-class-com-google-protobuf-HBaseZeroCopyByteString-cannot-access-its-supg-td24303.html >> but >> no success. >> Any suggestions ? >> >> >
Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode
Unfortunately I cannot at this moment (not a decision I can make) :( On Wed, Jan 20, 2016 at 6:46 PM Ted Yu wrote: > I am not aware of a workaround. > > Can you upgrade to 0.98.4+ release ? > > Cheers > > On Wed, Jan 20, 2016 at 6:26 PM, Ajinkya Kale > wrote: > >> Hi Ted, >> >> Thanks for responding. >> Is there a work around for 0.98.0 ? Adding the hbase-protocol jar to >> HADOOP_CLASSPATH didnt work for me. >> >> On Wed, Jan 20, 2016 at 6:14 PM Ted Yu wrote: >> >>> 0.98.0 didn't have fix from HBASE-8 >>> >>> Please upgrade your hbase version and try again. >>> >>> If still there is problem, please pastebin the stack trace. >>> >>> Thanks >>> >>> On Wed, Jan 20, 2016 at 5:41 PM, Ajinkya Kale >>> wrote: >>> >>>> >>>> I have posted this on hbase user list but i thought makes more sense on >>>> spark user list. >>>> I am able to read the table in yarn-client mode from spark-shell but I >>>> have exhausted all online forums for options to get it working in the >>>> yarn-cluster mode through spark-submit. >>>> >>>> I am using this code-example >>>> http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase >>>> to >>>> read a hbase table using Spark with the only change of adding the >>>> hbase.zookeeper.quorum through code as it is not picking it from the >>>> hbase-site.xml. >>>> >>>> Spark 1.5.3 >>>> >>>> HBase 0.98.0 >>>> >>>> >>>> Facing this error - >>>> >>>> 16/01/20 12:56:59 WARN >>>> client.ConnectionManager$HConnectionImplementation: Encountered problems >>>> when prefetch hbase:meta table: >>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after >>>> attempts=3, exceptions:Wed Jan 20 12:56:58 GMT-07:00 2016, >>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, >>>> java.lang.IllegalAccessError: class >>>> com.google.protobuf.HBaseZeroCopyByteString cannot access its superclass >>>> com.google.protobuf.LiteralByteStringWed Jan 20 12:56:58 GMT-07:00 2016, >>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, >>>> java.lang.IllegalAccessError: >>>> com/google/protobuf/HBaseZeroCopyByteStringWed Jan 20 12:56:59 GMT-07:00 >>>> 2016, org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, >>>> java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString >>>> >>>> at >>>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136) >>>> at >>>> org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:751) >>>> at >>>> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:147) >>>> at >>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.prefetchRegionCache(ConnectionManager.java:1215) >>>> at >>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1280) >>>> at >>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1128) >>>> at >>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:) >>>> at >>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1070) >>>> at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:347) >>>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:201) >>>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159) >>>> at >>>> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101) >>>> at >>>> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:111) >>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) >>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) >>>> at scala.Option.getOrElse(Option.scala:120) >>>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) >>>> at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1281) >>>> at >>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) >>>> at >>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) >>>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:310) >>>> at org.apache.spark.rdd.RDD.take(RDD.scala:1276) >>>> >>>> I tried adding the hbase protocol jar on spar-defaults.conf and in the >>>> driver-classpath as suggested here >>>> http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalAccessError-class-com-google-protobuf-HBaseZeroCopyByteString-cannot-access-its-supg-td24303.html >>>> but >>>> no success. >>>> Any suggestions ? >>>> >>>> >>> >
Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode
Is this issue only when the computations are in distributed mode ? If I do (pseudo code) : rdd.collect.call_to_hbase I dont get this error, but if I do : rdd.call_to_hbase.collect it throws this error. On Wed, Jan 20, 2016 at 6:50 PM Ajinkya Kale wrote: > Unfortunately I cannot at this moment (not a decision I can make) :( > > On Wed, Jan 20, 2016 at 6:46 PM Ted Yu wrote: > >> I am not aware of a workaround. >> >> Can you upgrade to 0.98.4+ release ? >> >> Cheers >> >> On Wed, Jan 20, 2016 at 6:26 PM, Ajinkya Kale >> wrote: >> >>> Hi Ted, >>> >>> Thanks for responding. >>> Is there a work around for 0.98.0 ? Adding the hbase-protocol jar to >>> HADOOP_CLASSPATH didnt work for me. >>> >>> On Wed, Jan 20, 2016 at 6:14 PM Ted Yu wrote: >>> >>>> 0.98.0 didn't have fix from HBASE-8 >>>> >>>> Please upgrade your hbase version and try again. >>>> >>>> If still there is problem, please pastebin the stack trace. >>>> >>>> Thanks >>>> >>>> On Wed, Jan 20, 2016 at 5:41 PM, Ajinkya Kale >>>> wrote: >>>> >>>>> >>>>> I have posted this on hbase user list but i thought makes more sense >>>>> on spark user list. >>>>> I am able to read the table in yarn-client mode from spark-shell but I >>>>> have exhausted all online forums for options to get it working in the >>>>> yarn-cluster mode through spark-submit. >>>>> >>>>> I am using this code-example >>>>> http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase >>>>> to >>>>> read a hbase table using Spark with the only change of adding the >>>>> hbase.zookeeper.quorum through code as it is not picking it from the >>>>> hbase-site.xml. >>>>> >>>>> Spark 1.5.3 >>>>> >>>>> HBase 0.98.0 >>>>> >>>>> >>>>> Facing this error - >>>>> >>>>> 16/01/20 12:56:59 WARN >>>>> client.ConnectionManager$HConnectionImplementation: Encountered problems >>>>> when prefetch hbase:meta table: >>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after >>>>> attempts=3, exceptions:Wed Jan 20 12:56:58 GMT-07:00 2016, >>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, >>>>> java.lang.IllegalAccessError: class >>>>> com.google.protobuf.HBaseZeroCopyByteString cannot access its superclass >>>>> com.google.protobuf.LiteralByteStringWed Jan 20 12:56:58 GMT-07:00 2016, >>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, >>>>> java.lang.IllegalAccessError: >>>>> com/google/protobuf/HBaseZeroCopyByteStringWed Jan 20 12:56:59 GMT-07:00 >>>>> 2016, org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, >>>>> java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString >>>>> >>>>> at >>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136) >>>>> at >>>>> org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:751) >>>>> at >>>>> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:147) >>>>> at >>>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.prefetchRegionCache(ConnectionManager.java:1215) >>>>> at >>>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1280) >>>>> at >>>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1128) >>>>> at >>>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:) >>>>> at >>>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1070) >>>>> at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:347) >>>>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:201) >>>>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159) >>&g
Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode
Hi Ted, Is there a way for the executors to have the hbase-protocol jar on their classpath ? On Fri, Jan 22, 2016 at 4:00 PM Ted Yu wrote: > The class path formations on driver and executors are different. > > Cheers > > On Fri, Jan 22, 2016 at 3:25 PM, Ajinkya Kale > wrote: > >> Is this issue only when the computations are in distributed mode ? >> If I do (pseudo code) : >> rdd.collect.call_to_hbase I dont get this error, >> >> but if I do : >> rdd.call_to_hbase.collect it throws this error. >> >> On Wed, Jan 20, 2016 at 6:50 PM Ajinkya Kale >> wrote: >> >>> Unfortunately I cannot at this moment (not a decision I can make) :( >>> >>> On Wed, Jan 20, 2016 at 6:46 PM Ted Yu wrote: >>> >>>> I am not aware of a workaround. >>>> >>>> Can you upgrade to 0.98.4+ release ? >>>> >>>> Cheers >>>> >>>> On Wed, Jan 20, 2016 at 6:26 PM, Ajinkya Kale >>>> wrote: >>>> >>>>> Hi Ted, >>>>> >>>>> Thanks for responding. >>>>> Is there a work around for 0.98.0 ? Adding the hbase-protocol jar to >>>>> HADOOP_CLASSPATH didnt work for me. >>>>> >>>>> On Wed, Jan 20, 2016 at 6:14 PM Ted Yu wrote: >>>>> >>>>>> 0.98.0 didn't have fix from HBASE-8 >>>>>> >>>>>> Please upgrade your hbase version and try again. >>>>>> >>>>>> If still there is problem, please pastebin the stack trace. >>>>>> >>>>>> Thanks >>>>>> >>>>>> On Wed, Jan 20, 2016 at 5:41 PM, Ajinkya Kale >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> I have posted this on hbase user list but i thought makes more sense >>>>>>> on spark user list. >>>>>>> I am able to read the table in yarn-client mode from spark-shell but >>>>>>> I have exhausted all online forums for options to get it working in the >>>>>>> yarn-cluster mode through spark-submit. >>>>>>> >>>>>>> I am using this code-example >>>>>>> http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase >>>>>>> to >>>>>>> read a hbase table using Spark with the only change of adding the >>>>>>> hbase.zookeeper.quorum through code as it is not picking it from the >>>>>>> hbase-site.xml. >>>>>>> >>>>>>> Spark 1.5.3 >>>>>>> >>>>>>> HBase 0.98.0 >>>>>>> >>>>>>> >>>>>>> Facing this error - >>>>>>> >>>>>>> 16/01/20 12:56:59 WARN >>>>>>> client.ConnectionManager$HConnectionImplementation: Encountered >>>>>>> problems when prefetch hbase:meta table: >>>>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after >>>>>>> attempts=3, exceptions:Wed Jan 20 12:56:58 GMT-07:00 2016, >>>>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, >>>>>>> java.lang.IllegalAccessError: class >>>>>>> com.google.protobuf.HBaseZeroCopyByteString cannot access its >>>>>>> superclass com.google.protobuf.LiteralByteStringWed Jan 20 12:56:58 >>>>>>> GMT-07:00 2016, >>>>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, >>>>>>> java.lang.IllegalAccessError: >>>>>>> com/google/protobuf/HBaseZeroCopyByteStringWed Jan 20 12:56:59 >>>>>>> GMT-07:00 2016, >>>>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, >>>>>>> java.lang.IllegalAccessError: >>>>>>> com/google/protobuf/HBaseZeroCopyByteString >>>>>>> >>>>>>> at >>>>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136) >>>>>>> at >>>>>>> org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:751) >>>>>>> at >>>>>>> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:147) >>>>>>> at >>>>>>&
Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode
I tried --jars which supposedly does that but that did not work. On Fri, Jan 22, 2016 at 4:33 PM Ajinkya Kale wrote: > Hi Ted, > Is there a way for the executors to have the hbase-protocol jar on their > classpath ? > > On Fri, Jan 22, 2016 at 4:00 PM Ted Yu wrote: > >> The class path formations on driver and executors are different. >> >> Cheers >> >> On Fri, Jan 22, 2016 at 3:25 PM, Ajinkya Kale >> wrote: >> >>> Is this issue only when the computations are in distributed mode ? >>> If I do (pseudo code) : >>> rdd.collect.call_to_hbase I dont get this error, >>> >>> but if I do : >>> rdd.call_to_hbase.collect it throws this error. >>> >>> On Wed, Jan 20, 2016 at 6:50 PM Ajinkya Kale >>> wrote: >>> >>>> Unfortunately I cannot at this moment (not a decision I can make) :( >>>> >>>> On Wed, Jan 20, 2016 at 6:46 PM Ted Yu wrote: >>>> >>>>> I am not aware of a workaround. >>>>> >>>>> Can you upgrade to 0.98.4+ release ? >>>>> >>>>> Cheers >>>>> >>>>> On Wed, Jan 20, 2016 at 6:26 PM, Ajinkya Kale >>>>> wrote: >>>>> >>>>>> Hi Ted, >>>>>> >>>>>> Thanks for responding. >>>>>> Is there a work around for 0.98.0 ? Adding the hbase-protocol jar to >>>>>> HADOOP_CLASSPATH didnt work for me. >>>>>> >>>>>> On Wed, Jan 20, 2016 at 6:14 PM Ted Yu wrote: >>>>>> >>>>>>> 0.98.0 didn't have fix from HBASE-8 >>>>>>> >>>>>>> Please upgrade your hbase version and try again. >>>>>>> >>>>>>> If still there is problem, please pastebin the stack trace. >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> On Wed, Jan 20, 2016 at 5:41 PM, Ajinkya Kale >>>>>> > wrote: >>>>>>> >>>>>>>> >>>>>>>> I have posted this on hbase user list but i thought makes more >>>>>>>> sense on spark user list. >>>>>>>> I am able to read the table in yarn-client mode from spark-shell >>>>>>>> but I have exhausted all online forums for options to get it working >>>>>>>> in the >>>>>>>> yarn-cluster mode through spark-submit. >>>>>>>> >>>>>>>> I am using this code-example >>>>>>>> http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase >>>>>>>> to >>>>>>>> read a hbase table using Spark with the only change of adding the >>>>>>>> hbase.zookeeper.quorum through code as it is not picking it from the >>>>>>>> hbase-site.xml. >>>>>>>> >>>>>>>> Spark 1.5.3 >>>>>>>> >>>>>>>> HBase 0.98.0 >>>>>>>> >>>>>>>> >>>>>>>> Facing this error - >>>>>>>> >>>>>>>> 16/01/20 12:56:59 WARN >>>>>>>> client.ConnectionManager$HConnectionImplementation: Encountered >>>>>>>> problems when prefetch hbase:meta table: >>>>>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after >>>>>>>> attempts=3, exceptions:Wed Jan 20 12:56:58 GMT-07:00 2016, >>>>>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, >>>>>>>> java.lang.IllegalAccessError: class >>>>>>>> com.google.protobuf.HBaseZeroCopyByteString cannot access its >>>>>>>> superclass com.google.protobuf.LiteralByteStringWed Jan 20 12:56:58 >>>>>>>> GMT-07:00 2016, >>>>>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, >>>>>>>> java.lang.IllegalAccessError: >>>>>>>> com/google/protobuf/HBaseZeroCopyByteStringWed Jan 20 12:56:59 >>>>>>>> GMT-07:00 2016, >>>>>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, >>>>>>>> java.lang.IllegalAccessError: >>>>>>>> com/google/protobuf/HBaseZeroCopyByteS
Reading multiple avro files from a dir - Spark 1.5.1
Trying to load avro from hdfs. I have around 1000 part avro files in a dir. I am using this to read them - val df = sqlContext.read.format("com.databricks.spark.avro").load("path/to/avro/dir") df.select("QUERY").take(50).foreach(println) It works if I have pass only 1or 2 avro files in the path. But if I pass a dir with 400+ files I get this error. Each avro is around 300mb. org.apache.avro.AvroRuntimeException: java.io.IOException: Filesystem closed at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275) at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197) at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:64) at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:32) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369) at com.databricks.spark.avro.AvroRelation$$anonfun$buildScan$1$$anonfun$4$$anon$1.advanceNextRecord(AvroRelation.scala:157) at com.databricks.spark.avro.AvroRelation$$anonfun$buildScan$1$$anonfun$4$$anon$1.hasNext(AvroRelation.scala:166) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:413) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308) at scala.collection.AbstractIterator.to(Iterator.scala:1194) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287) at scala.collection.AbstractIterator.toArray(Iterator.scala:1194) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:905) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:905) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.avro.mapred.FsInput.read(FsInput.java:46) at org.apache.avro.file.DataFileReader$SeekableInputStream.read(DataFileReader.java:210) at org.apache.avro.io.BinaryDecoder$InputStreamByteSource.tryReadRaw(BinaryDecoder.java:839) at org.apache.avro.io.BinaryDecoder.isEnd(BinaryDecoder.java:444) at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:261) ... 36 more
Re: Logistic Regression using ML Pipeline
Please take a look at the example here http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline On Thu, Feb 18, 2016 at 9:27 PM Arunkumar Pillai wrote: > Hi > > I'm trying to build logistic regression using ML Pipeline > > val lr = new LogisticRegression() > > lr.setFitIntercept(true) > lr.setMaxIter(100) > val model = lr.fit(data) > > println(model.summary) > > I'm getting coefficients but not able to get the predicted and probability > values. > > Please help > > -- > Thanks and Regards > Arun >
Saving a pyspark.ml.feature.PCA model
Is there a way to save a pyspark.ml.feature.PCA model ? I know mllib has that but mllib does not have PCA afaik. How do people do model persistence for inference using the pyspark ml models ? Did not find any documentation on model persistency for ml. --ajinkya
Re: Saving a pyspark.ml.feature.PCA model
I am using google cloud dataproc which comes with spark 1.6.1. So upgrade is not really an option. No way / hack to save the models in spark 1.6.1 ? On Tue, Jul 19, 2016 at 8:13 PM Shuai Lin wrote: > It's added in not-released-yet 2.0.0 version. > > https://issues.apache.org/jira/browse/SPARK-13036 > https://github.com/apache/spark/commit/83302c3b > > so i guess you need to wait for 2.0 release (or use the current rc4). > > On Wed, Jul 20, 2016 at 6:54 AM, Ajinkya Kale > wrote: > >> Is there a way to save a pyspark.ml.feature.PCA model ? I know mllib has >> that but mllib does not have PCA afaik. How do people do model persistence >> for inference using the pyspark ml models ? Did not find any documentation >> on model persistency for ml. >> >> --ajinkya >> > >
Re: Saving a pyspark.ml.feature.PCA model
Just found Google dataproc has a preview of spark 2.0. Tried it and save/load works! Thanks Shuai. Followup question - is there a way to export the pyspark.ml models to PMML ? If not, what is the best way to integrate the model for inference in a production service ? On Tue, Jul 19, 2016 at 8:22 PM Ajinkya Kale wrote: > I am using google cloud dataproc which comes with spark 1.6.1. So upgrade > is not really an option. > No way / hack to save the models in spark 1.6.1 ? > > On Tue, Jul 19, 2016 at 8:13 PM Shuai Lin wrote: > >> It's added in not-released-yet 2.0.0 version. >> >> https://issues.apache.org/jira/browse/SPARK-13036 >> https://github.com/apache/spark/commit/83302c3b >> >> so i guess you need to wait for 2.0 release (or use the current rc4). >> >> On Wed, Jul 20, 2016 at 6:54 AM, Ajinkya Kale >> wrote: >> >>> Is there a way to save a pyspark.ml.feature.PCA model ? I know mllib has >>> that but mllib does not have PCA afaik. How do people do model persistence >>> for inference using the pyspark ml models ? Did not find any documentation >>> on model persistency for ml. >>> >>> --ajinkya >>> >> >>