installing packages with pyspark

2016-03-19 Thread Ajinkya Kale
Hi all,

I had couple of questions.
1. Is there documentation on how to add the graphframes or any other
package for that matter on the google dataproc managed spark clusters ?

2. Is there a way to add a package to an existing pyspark context through a
jupyter notebook ?

--aj


Re: installing packages with pyspark

2016-03-19 Thread Ajinkya Kale
Thanks Jakob, Felix. I am aware you can do it with --packages but i was
wondering if there is a way to do something like "!pip install "
like i do for other packages from jupyter notebook for python. But I guess
I cannot add a package once i launch the pyspark context right ?

On Thu, Mar 17, 2016 at 6:59 PM Felix Cheung 
wrote:

> For some, like graphframes that are Spark packages, you could also use
> --packages in the command line of spark-submit or pyspark. See
> http://spark.apache.org/docs/latest/submitting-applications.html
>
> _
> From: Jakob Odersky 
> Sent: Thursday, March 17, 2016 6:40 PM
> Subject: Re: installing packages with pyspark
> To: Ajinkya Kale 
> Cc: 
>
>
> Hi,
> regarding 1, packages are resolved locally. That means that when you
> specify a package, spark-submit will resolve the dependencies and
> download any jars on the local machine, before shipping* them to the
> cluster. So, without a priori knowledge of dataproc clusters, it
> should be no different to specify packages.
>
> Unfortunatly I can't help with 2.
>
> --Jakob
>
> *shipping in this case means making them available via the network
>
> On Thu, Mar 17, 2016 at 5:36 PM, Ajinkya Kale 
> wrote:
> > Hi all,
> >
> > I had couple of questions.
> > 1. Is there documentation on how to add the graphframes or any other
> package
> > for that matter on the google dataproc managed spark clusters ?
> >
> > 2. Is there a way to add a package to an existing pyspark context
> through a
> > jupyter notebook ?
> >
> > --aj
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>


HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-20 Thread Ajinkya Kale
I have posted this on hbase user list but i thought makes more sense on
spark user list.
I am able to read the table in yarn-client mode from spark-shell but I have
exhausted all online forums for options to get it working in the
yarn-cluster mode through spark-submit.

I am using this code-example
http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase
to
read a hbase table using Spark with the only change of adding the
hbase.zookeeper.quorum through code as it is not picking it from the
hbase-site.xml.

Spark 1.5.3

HBase 0.98.0


Facing this error -

 16/01/20 12:56:59 WARN
client.ConnectionManager$HConnectionImplementation: Encountered
problems when prefetch hbase:meta table:
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after
attempts=3, exceptions:Wed Jan 20 12:56:58 GMT-07:00 2016,
org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e,
java.lang.IllegalAccessError: class
com.google.protobuf.HBaseZeroCopyByteString cannot access its
superclass com.google.protobuf.LiteralByteStringWed Jan 20 12:56:58
GMT-07:00 2016,
org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e,
java.lang.IllegalAccessError:
com/google/protobuf/HBaseZeroCopyByteStringWed Jan 20 12:56:59
GMT-07:00 2016,
org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e,
java.lang.IllegalAccessError:
com/google/protobuf/HBaseZeroCopyByteString

at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136)
at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:751)
at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:147)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.prefetchRegionCache(ConnectionManager.java:1215)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1280)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1128)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1070)
at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:347)
at org.apache.hadoop.hbase.client.HTable.(HTable.java:201)
at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
at 
org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:111)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1281)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.RDD.take(RDD.scala:1276)

I tried adding the hbase protocol jar on spar-defaults.conf and in the
driver-classpath as suggested here
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalAccessError-class-com-google-protobuf-HBaseZeroCopyByteString-cannot-access-its-supg-td24303.html
but
no success.
Any suggestions ?


Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-20 Thread Ajinkya Kale
Hi Ted,

Thanks for responding.
Is there a work around for 0.98.0 ? Adding the hbase-protocol jar to
HADOOP_CLASSPATH didnt work for me.

On Wed, Jan 20, 2016 at 6:14 PM Ted Yu  wrote:

> 0.98.0 didn't have fix from HBASE-8
>
> Please upgrade your hbase version and try again.
>
> If still there is problem, please pastebin the stack trace.
>
> Thanks
>
> On Wed, Jan 20, 2016 at 5:41 PM, Ajinkya Kale 
> wrote:
>
>>
>> I have posted this on hbase user list but i thought makes more sense on
>> spark user list.
>> I am able to read the table in yarn-client mode from spark-shell but I
>> have exhausted all online forums for options to get it working in the
>> yarn-cluster mode through spark-submit.
>>
>> I am using this code-example
>> http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase
>>  to
>> read a hbase table using Spark with the only change of adding the
>> hbase.zookeeper.quorum through code as it is not picking it from the
>> hbase-site.xml.
>>
>> Spark 1.5.3
>>
>> HBase 0.98.0
>>
>>
>> Facing this error -
>>
>>  16/01/20 12:56:59 WARN client.ConnectionManager$HConnectionImplementation: 
>> Encountered problems when prefetch hbase:meta table:
>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
>> attempts=3, exceptions:Wed Jan 20 12:56:58 GMT-07:00 2016, 
>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>> java.lang.IllegalAccessError: class 
>> com.google.protobuf.HBaseZeroCopyByteString cannot access its superclass 
>> com.google.protobuf.LiteralByteStringWed Jan 20 12:56:58 GMT-07:00 2016, 
>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>> java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteStringWed 
>> Jan 20 12:56:59 GMT-07:00 2016, 
>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>> java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString
>>
>> at 
>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136)
>> at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:751)
>> at 
>> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:147)
>> at 
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.prefetchRegionCache(ConnectionManager.java:1215)
>> at 
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1280)
>> at 
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1128)
>> at 
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:)
>> at 
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1070)
>> at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:347)
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:201)
>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
>> at 
>> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
>> at 
>> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:111)
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>> at scala.Option.getOrElse(Option.scala:120)
>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>> at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1281)
>> at 
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>> at 
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
>> at org.apache.spark.rdd.RDD.take(RDD.scala:1276)
>>
>> I tried adding the hbase protocol jar on spar-defaults.conf and in the
>> driver-classpath as suggested here
>> http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalAccessError-class-com-google-protobuf-HBaseZeroCopyByteString-cannot-access-its-supg-td24303.html
>>  but
>> no success.
>> Any suggestions ?
>>
>>
>


Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-20 Thread Ajinkya Kale
Unfortunately I cannot at this moment (not a decision I can make) :(

On Wed, Jan 20, 2016 at 6:46 PM Ted Yu  wrote:

> I am not aware of a workaround.
>
> Can you upgrade to 0.98.4+ release ?
>
> Cheers
>
> On Wed, Jan 20, 2016 at 6:26 PM, Ajinkya Kale 
> wrote:
>
>> Hi Ted,
>>
>> Thanks for responding.
>> Is there a work around for 0.98.0 ? Adding the hbase-protocol jar to
>> HADOOP_CLASSPATH didnt work for me.
>>
>> On Wed, Jan 20, 2016 at 6:14 PM Ted Yu  wrote:
>>
>>> 0.98.0 didn't have fix from HBASE-8
>>>
>>> Please upgrade your hbase version and try again.
>>>
>>> If still there is problem, please pastebin the stack trace.
>>>
>>> Thanks
>>>
>>> On Wed, Jan 20, 2016 at 5:41 PM, Ajinkya Kale 
>>> wrote:
>>>
>>>>
>>>> I have posted this on hbase user list but i thought makes more sense on
>>>> spark user list.
>>>> I am able to read the table in yarn-client mode from spark-shell but I
>>>> have exhausted all online forums for options to get it working in the
>>>> yarn-cluster mode through spark-submit.
>>>>
>>>> I am using this code-example
>>>> http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase
>>>>  to
>>>> read a hbase table using Spark with the only change of adding the
>>>> hbase.zookeeper.quorum through code as it is not picking it from the
>>>> hbase-site.xml.
>>>>
>>>> Spark 1.5.3
>>>>
>>>> HBase 0.98.0
>>>>
>>>>
>>>> Facing this error -
>>>>
>>>>  16/01/20 12:56:59 WARN 
>>>> client.ConnectionManager$HConnectionImplementation: Encountered problems 
>>>> when prefetch hbase:meta table:
>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
>>>> attempts=3, exceptions:Wed Jan 20 12:56:58 GMT-07:00 2016, 
>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>>>> java.lang.IllegalAccessError: class 
>>>> com.google.protobuf.HBaseZeroCopyByteString cannot access its superclass 
>>>> com.google.protobuf.LiteralByteStringWed Jan 20 12:56:58 GMT-07:00 2016, 
>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>>>> java.lang.IllegalAccessError: 
>>>> com/google/protobuf/HBaseZeroCopyByteStringWed Jan 20 12:56:59 GMT-07:00 
>>>> 2016, org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>>>> java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString
>>>>
>>>> at 
>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136)
>>>> at 
>>>> org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:751)
>>>> at 
>>>> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:147)
>>>> at 
>>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.prefetchRegionCache(ConnectionManager.java:1215)
>>>> at 
>>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1280)
>>>> at 
>>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1128)
>>>> at 
>>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:)
>>>> at 
>>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1070)
>>>> at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:347)
>>>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:201)
>>>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
>>>> at 
>>>> org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:101)
>>>> at 
>>>> org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:111)
>>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>>>> at scala.Option.getOrElse(Option.scala:120)
>>>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>>>> at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1281)
>>>> at 
>>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>>>> at 
>>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>>>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
>>>> at org.apache.spark.rdd.RDD.take(RDD.scala:1276)
>>>>
>>>> I tried adding the hbase protocol jar on spar-defaults.conf and in the
>>>> driver-classpath as suggested here
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalAccessError-class-com-google-protobuf-HBaseZeroCopyByteString-cannot-access-its-supg-td24303.html
>>>>  but
>>>> no success.
>>>> Any suggestions ?
>>>>
>>>>
>>>
>


Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-22 Thread Ajinkya Kale
Is this issue only when the computations are in distributed mode ?
If I do (pseudo code) :
rdd.collect.call_to_hbase  I dont get this error,

but if I do :
rdd.call_to_hbase.collect it throws this error.

On Wed, Jan 20, 2016 at 6:50 PM Ajinkya Kale  wrote:

> Unfortunately I cannot at this moment (not a decision I can make) :(
>
> On Wed, Jan 20, 2016 at 6:46 PM Ted Yu  wrote:
>
>> I am not aware of a workaround.
>>
>> Can you upgrade to 0.98.4+ release ?
>>
>> Cheers
>>
>> On Wed, Jan 20, 2016 at 6:26 PM, Ajinkya Kale 
>> wrote:
>>
>>> Hi Ted,
>>>
>>> Thanks for responding.
>>> Is there a work around for 0.98.0 ? Adding the hbase-protocol jar to
>>> HADOOP_CLASSPATH didnt work for me.
>>>
>>> On Wed, Jan 20, 2016 at 6:14 PM Ted Yu  wrote:
>>>
>>>> 0.98.0 didn't have fix from HBASE-8
>>>>
>>>> Please upgrade your hbase version and try again.
>>>>
>>>> If still there is problem, please pastebin the stack trace.
>>>>
>>>> Thanks
>>>>
>>>> On Wed, Jan 20, 2016 at 5:41 PM, Ajinkya Kale 
>>>> wrote:
>>>>
>>>>>
>>>>> I have posted this on hbase user list but i thought makes more sense
>>>>> on spark user list.
>>>>> I am able to read the table in yarn-client mode from spark-shell but I
>>>>> have exhausted all online forums for options to get it working in the
>>>>> yarn-cluster mode through spark-submit.
>>>>>
>>>>> I am using this code-example
>>>>> http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase
>>>>>  to
>>>>> read a hbase table using Spark with the only change of adding the
>>>>> hbase.zookeeper.quorum through code as it is not picking it from the
>>>>> hbase-site.xml.
>>>>>
>>>>> Spark 1.5.3
>>>>>
>>>>> HBase 0.98.0
>>>>>
>>>>>
>>>>> Facing this error -
>>>>>
>>>>>  16/01/20 12:56:59 WARN 
>>>>> client.ConnectionManager$HConnectionImplementation: Encountered problems 
>>>>> when prefetch hbase:meta table:
>>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
>>>>> attempts=3, exceptions:Wed Jan 20 12:56:58 GMT-07:00 2016, 
>>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>>>>> java.lang.IllegalAccessError: class 
>>>>> com.google.protobuf.HBaseZeroCopyByteString cannot access its superclass 
>>>>> com.google.protobuf.LiteralByteStringWed Jan 20 12:56:58 GMT-07:00 2016, 
>>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>>>>> java.lang.IllegalAccessError: 
>>>>> com/google/protobuf/HBaseZeroCopyByteStringWed Jan 20 12:56:59 GMT-07:00 
>>>>> 2016, org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>>>>> java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString
>>>>>
>>>>> at 
>>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136)
>>>>> at 
>>>>> org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:751)
>>>>> at 
>>>>> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:147)
>>>>> at 
>>>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.prefetchRegionCache(ConnectionManager.java:1215)
>>>>> at 
>>>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1280)
>>>>> at 
>>>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1128)
>>>>> at 
>>>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:)
>>>>> at 
>>>>> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1070)
>>>>> at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:347)
>>>>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:201)
>>>>> at org.apache.hadoop.hbase.client.HTable.(HTable.java:159)
>>&g

Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-22 Thread Ajinkya Kale
Hi Ted,
Is there a way for the executors to have the hbase-protocol jar on their
classpath ?

On Fri, Jan 22, 2016 at 4:00 PM Ted Yu  wrote:

> The class path formations on driver and executors are different.
>
> Cheers
>
> On Fri, Jan 22, 2016 at 3:25 PM, Ajinkya Kale 
> wrote:
>
>> Is this issue only when the computations are in distributed mode ?
>> If I do (pseudo code) :
>> rdd.collect.call_to_hbase  I dont get this error,
>>
>> but if I do :
>> rdd.call_to_hbase.collect it throws this error.
>>
>> On Wed, Jan 20, 2016 at 6:50 PM Ajinkya Kale 
>> wrote:
>>
>>> Unfortunately I cannot at this moment (not a decision I can make) :(
>>>
>>> On Wed, Jan 20, 2016 at 6:46 PM Ted Yu  wrote:
>>>
>>>> I am not aware of a workaround.
>>>>
>>>> Can you upgrade to 0.98.4+ release ?
>>>>
>>>> Cheers
>>>>
>>>> On Wed, Jan 20, 2016 at 6:26 PM, Ajinkya Kale 
>>>> wrote:
>>>>
>>>>> Hi Ted,
>>>>>
>>>>> Thanks for responding.
>>>>> Is there a work around for 0.98.0 ? Adding the hbase-protocol jar to
>>>>> HADOOP_CLASSPATH didnt work for me.
>>>>>
>>>>> On Wed, Jan 20, 2016 at 6:14 PM Ted Yu  wrote:
>>>>>
>>>>>> 0.98.0 didn't have fix from HBASE-8
>>>>>>
>>>>>> Please upgrade your hbase version and try again.
>>>>>>
>>>>>> If still there is problem, please pastebin the stack trace.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Wed, Jan 20, 2016 at 5:41 PM, Ajinkya Kale 
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> I have posted this on hbase user list but i thought makes more sense
>>>>>>> on spark user list.
>>>>>>> I am able to read the table in yarn-client mode from spark-shell but
>>>>>>> I have exhausted all online forums for options to get it working in the
>>>>>>> yarn-cluster mode through spark-submit.
>>>>>>>
>>>>>>> I am using this code-example
>>>>>>> http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase
>>>>>>>  to
>>>>>>> read a hbase table using Spark with the only change of adding the
>>>>>>> hbase.zookeeper.quorum through code as it is not picking it from the
>>>>>>> hbase-site.xml.
>>>>>>>
>>>>>>> Spark 1.5.3
>>>>>>>
>>>>>>> HBase 0.98.0
>>>>>>>
>>>>>>>
>>>>>>> Facing this error -
>>>>>>>
>>>>>>>  16/01/20 12:56:59 WARN 
>>>>>>> client.ConnectionManager$HConnectionImplementation: Encountered 
>>>>>>> problems when prefetch hbase:meta table:
>>>>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
>>>>>>> attempts=3, exceptions:Wed Jan 20 12:56:58 GMT-07:00 2016, 
>>>>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>>>>>>> java.lang.IllegalAccessError: class 
>>>>>>> com.google.protobuf.HBaseZeroCopyByteString cannot access its 
>>>>>>> superclass com.google.protobuf.LiteralByteStringWed Jan 20 12:56:58 
>>>>>>> GMT-07:00 2016, 
>>>>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>>>>>>> java.lang.IllegalAccessError: 
>>>>>>> com/google/protobuf/HBaseZeroCopyByteStringWed Jan 20 12:56:59 
>>>>>>> GMT-07:00 2016, 
>>>>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>>>>>>> java.lang.IllegalAccessError: 
>>>>>>> com/google/protobuf/HBaseZeroCopyByteString
>>>>>>>
>>>>>>> at 
>>>>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136)
>>>>>>> at 
>>>>>>> org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:751)
>>>>>>> at 
>>>>>>> org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:147)
>>>>>>> at 
>>>>>>&

Re: HBase 0.98.0 with Spark 1.5.3 issue in yarn-cluster mode

2016-01-22 Thread Ajinkya Kale
I tried --jars which supposedly does that but that did not work.

On Fri, Jan 22, 2016 at 4:33 PM Ajinkya Kale  wrote:

> Hi Ted,
> Is there a way for the executors to have the hbase-protocol jar on their
> classpath ?
>
> On Fri, Jan 22, 2016 at 4:00 PM Ted Yu  wrote:
>
>> The class path formations on driver and executors are different.
>>
>> Cheers
>>
>> On Fri, Jan 22, 2016 at 3:25 PM, Ajinkya Kale 
>> wrote:
>>
>>> Is this issue only when the computations are in distributed mode ?
>>> If I do (pseudo code) :
>>> rdd.collect.call_to_hbase  I dont get this error,
>>>
>>> but if I do :
>>> rdd.call_to_hbase.collect it throws this error.
>>>
>>> On Wed, Jan 20, 2016 at 6:50 PM Ajinkya Kale 
>>> wrote:
>>>
>>>> Unfortunately I cannot at this moment (not a decision I can make) :(
>>>>
>>>> On Wed, Jan 20, 2016 at 6:46 PM Ted Yu  wrote:
>>>>
>>>>> I am not aware of a workaround.
>>>>>
>>>>> Can you upgrade to 0.98.4+ release ?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Wed, Jan 20, 2016 at 6:26 PM, Ajinkya Kale 
>>>>> wrote:
>>>>>
>>>>>> Hi Ted,
>>>>>>
>>>>>> Thanks for responding.
>>>>>> Is there a work around for 0.98.0 ? Adding the hbase-protocol jar to
>>>>>> HADOOP_CLASSPATH didnt work for me.
>>>>>>
>>>>>> On Wed, Jan 20, 2016 at 6:14 PM Ted Yu  wrote:
>>>>>>
>>>>>>> 0.98.0 didn't have fix from HBASE-8
>>>>>>>
>>>>>>> Please upgrade your hbase version and try again.
>>>>>>>
>>>>>>> If still there is problem, please pastebin the stack trace.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> On Wed, Jan 20, 2016 at 5:41 PM, Ajinkya Kale >>>>>> > wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> I have posted this on hbase user list but i thought makes more
>>>>>>>> sense on spark user list.
>>>>>>>> I am able to read the table in yarn-client mode from spark-shell
>>>>>>>> but I have exhausted all online forums for options to get it working 
>>>>>>>> in the
>>>>>>>> yarn-cluster mode through spark-submit.
>>>>>>>>
>>>>>>>> I am using this code-example
>>>>>>>> http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase
>>>>>>>>  to
>>>>>>>> read a hbase table using Spark with the only change of adding the
>>>>>>>> hbase.zookeeper.quorum through code as it is not picking it from the
>>>>>>>> hbase-site.xml.
>>>>>>>>
>>>>>>>> Spark 1.5.3
>>>>>>>>
>>>>>>>> HBase 0.98.0
>>>>>>>>
>>>>>>>>
>>>>>>>> Facing this error -
>>>>>>>>
>>>>>>>>  16/01/20 12:56:59 WARN 
>>>>>>>> client.ConnectionManager$HConnectionImplementation: Encountered 
>>>>>>>> problems when prefetch hbase:meta table:
>>>>>>>> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
>>>>>>>> attempts=3, exceptions:Wed Jan 20 12:56:58 GMT-07:00 2016, 
>>>>>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>>>>>>>> java.lang.IllegalAccessError: class 
>>>>>>>> com.google.protobuf.HBaseZeroCopyByteString cannot access its 
>>>>>>>> superclass com.google.protobuf.LiteralByteStringWed Jan 20 12:56:58 
>>>>>>>> GMT-07:00 2016, 
>>>>>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>>>>>>>> java.lang.IllegalAccessError: 
>>>>>>>> com/google/protobuf/HBaseZeroCopyByteStringWed Jan 20 12:56:59 
>>>>>>>> GMT-07:00 2016, 
>>>>>>>> org.apache.hadoop.hbase.client.RpcRetryingCaller@111585e, 
>>>>>>>> java.lang.IllegalAccessError: 
>>>>>>>> com/google/protobuf/HBaseZeroCopyByteS

Reading multiple avro files from a dir - Spark 1.5.1

2016-01-29 Thread Ajinkya Kale
Trying to load avro from hdfs. I have around 1000 part avro files in a dir.
I am using this to read them -

 val df =
sqlContext.read.format("com.databricks.spark.avro").load("path/to/avro/dir")
 df.select("QUERY").take(50).foreach(println)

It works if I have pass only 1or 2 avro files in the path. But if I pass a
dir with 400+ files I get this error. Each avro is around 300mb.

org.apache.avro.AvroRuntimeException: java.io.IOException: Filesystem closed
at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:275)
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:197)
at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:64)
at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:32)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at
com.databricks.spark.avro.AvroRelation$$anonfun$buildScan$1$$anonfun$4$$anon$1.advanceNextRecord(AvroRelation.scala:157)
at
com.databricks.spark.avro.AvroRelation$$anonfun$buildScan$1$$anonfun$4$$anon$1.hasNext(AvroRelation.scala:166)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:413)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
at scala.collection.AbstractIterator.to(Iterator.scala:1194)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:905)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:905)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
at
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.avro.mapred.FsInput.read(FsInput.java:46)
at
org.apache.avro.file.DataFileReader$SeekableInputStream.read(DataFileReader.java:210)
at
org.apache.avro.io.BinaryDecoder$InputStreamByteSource.tryReadRaw(BinaryDecoder.java:839)
at org.apache.avro.io.BinaryDecoder.isEnd(BinaryDecoder.java:444)
at org.apache.avro.file.DataFileStream.hasNextBlock(DataFileStream.java:261)
... 36 more


Re: Logistic Regression using ML Pipeline

2016-02-19 Thread Ajinkya Kale
Please take a look at the example here
http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline

On Thu, Feb 18, 2016 at 9:27 PM Arunkumar Pillai 
wrote:

> Hi
>
> I'm trying to build logistic regression using ML Pipeline
>
>  val lr = new LogisticRegression()
>
> lr.setFitIntercept(true)
> lr.setMaxIter(100)
> val model = lr.fit(data)
>
> println(model.summary)
>
> I'm getting coefficients but not able to get the predicted and probability
> values.
>
> Please help
>
> --
> Thanks and Regards
> Arun
>


Saving a pyspark.ml.feature.PCA model

2016-07-19 Thread Ajinkya Kale
Is there a way to save a pyspark.ml.feature.PCA model ? I know mllib has
that but mllib does not have PCA afaik. How do people do model persistence
for inference using the pyspark ml models ? Did not find any documentation
on model persistency for ml.

--ajinkya


Re: Saving a pyspark.ml.feature.PCA model

2016-07-19 Thread Ajinkya Kale
I am using google cloud dataproc which comes with spark 1.6.1. So upgrade
is not really an option.
No way / hack to save the models in spark 1.6.1 ?

On Tue, Jul 19, 2016 at 8:13 PM Shuai Lin  wrote:

> It's added in not-released-yet 2.0.0 version.
>
> https://issues.apache.org/jira/browse/SPARK-13036
> https://github.com/apache/spark/commit/83302c3b
>
> so i guess you need to wait for 2.0 release (or use the current rc4).
>
> On Wed, Jul 20, 2016 at 6:54 AM, Ajinkya Kale 
> wrote:
>
>> Is there a way to save a pyspark.ml.feature.PCA model ? I know mllib has
>> that but mllib does not have PCA afaik. How do people do model persistence
>> for inference using the pyspark ml models ? Did not find any documentation
>> on model persistency for ml.
>>
>> --ajinkya
>>
>
>


Re: Saving a pyspark.ml.feature.PCA model

2016-07-20 Thread Ajinkya Kale
Just found Google dataproc has a preview of spark 2.0. Tried it and
save/load works! Thanks Shuai.
Followup question - is there a way to export the pyspark.ml models to PMML
? If not, what is the best way to integrate the model for inference in a
production service ?

On Tue, Jul 19, 2016 at 8:22 PM Ajinkya Kale  wrote:

> I am using google cloud dataproc which comes with spark 1.6.1. So upgrade
> is not really an option.
> No way / hack to save the models in spark 1.6.1 ?
>
> On Tue, Jul 19, 2016 at 8:13 PM Shuai Lin  wrote:
>
>> It's added in not-released-yet 2.0.0 version.
>>
>> https://issues.apache.org/jira/browse/SPARK-13036
>> https://github.com/apache/spark/commit/83302c3b
>>
>> so i guess you need to wait for 2.0 release (or use the current rc4).
>>
>> On Wed, Jul 20, 2016 at 6:54 AM, Ajinkya Kale 
>> wrote:
>>
>>> Is there a way to save a pyspark.ml.feature.PCA model ? I know mllib has
>>> that but mllib does not have PCA afaik. How do people do model persistence
>>> for inference using the pyspark ml models ? Did not find any documentation
>>> on model persistency for ml.
>>>
>>> --ajinkya
>>>
>>
>>