Re: Third party library

2016-12-13 Thread vineet chadha
Thanks Jakob for sharing the link. Will try it out.

Regards,
Vineet

On Tue, Dec 13, 2016 at 3:00 PM, Jakob Odersky  wrote:

> Hi Vineet,
> great to see you solved the problem! Since this just appeared in my
> inbox, I wanted to take the opportunity for a shameless plug:
> https://github.com/jodersky/sbt-jni. In case you're using sbt and also
> developing the native library, this plugin may help with the pains of
> building and packaging JNI applications.
>
> cheers,
> --Jakob
>
> On Tue, Dec 13, 2016 at 11:02 AM, vineet chadha 
> wrote:
> > Thanks Steve and Kant. Apologies for late reply as I was out for
> vacation.
> > Got  it working.  For other users:
> >
> > def loadResources() {
> >
> > System.loadLibrary("foolib")
> >
> >  val MyInstance  = new MyClass
> >
> >  val retstr = MyInstance.foo("mystring") // method trying to
> invoke
> >
> >  }
> >
> > val conf = new
> > SparkConf().setAppName("SimpleApplication").set("SPARK_LIBRARY_PATH",
> > "/lib/location")
> >
> > val sc = new SparkContext(conf)
> >
> > sc.parallelize(1 to 10, 2).mapPartitions ( iter => {
> >
> > MySimpleApp.loadResources()
> >
> > iter
> >
> > }).count
> >
> >
> >
> > Regards,
> > Vineet
> >
> > On Sun, Nov 27, 2016 at 2:15 PM, Steve Loughran 
> > wrote:
> >>
> >>
> >> On 27 Nov 2016, at 02:55, kant kodali  wrote:
> >>
> >> I would say instead of LD_LIBRARY_PATH you might want to use
> >> java.library.path
> >>
> >> in the following way
> >>
> >> java -Djava.library.path=/path/to/my/library or pass java.library.path
> >> along with spark-submit
> >>
> >>
> >> This is only going to set up paths on the submitting system; to load JNI
> >> code in the executors, the binary needs to be sent to far end and then
> put
> >> on the Java load path there.
> >>
> >> Copy the relevant binary to somewhere on the PATH of the destination
> >> machine. Do that and you shouldn't have to worry about other JVM
> options,
> >> (though it's been a few years since I did any JNI).
> >>
> >> One trick: write a simple main() object/entry point which calls the JNI
> >> method, and doesn't attempt to use any spark libraries; have it log any
> >> exception and return an error code if the call failed. This will let
> you use
> >> it as a link test after deployment: if you can't run that class then
> things
> >> are broken, before you go near spark
> >>
> >>
> >> On Sat, Nov 26, 2016 at 6:44 PM, Gmail  wrote:
> >>>
> >>> Maybe you've already checked these out. Some basic questions that come
> to
> >>> my mind are:
> >>> 1) is this library "foolib" or "foo-C-library" available on the worker
> >>> node?
> >>> 2) if yes, is it accessible by the user/program (rwx)?
> >>>
> >>> Thanks,
> >>> Vasu.
> >>>
> >>> On Nov 26, 2016, at 5:08 PM, kant kodali  wrote:
> >>>
> >>> If it is working for standalone program I would think you can apply the
> >>> same settings across all the spark worker  and client machines and
> give that
> >>> a try. Lets start with that.
> >>>
> >>> On Sat, Nov 26, 2016 at 11:59 AM, vineet chadha <
> start.vin...@gmail.com>
> >>> wrote:
> 
>  Just subscribed to  Spark User.  So, forwarding message again.
> 
>  On Sat, Nov 26, 2016 at 11:50 AM, vineet chadha <
> start.vin...@gmail.com>
>  wrote:
> >
> > Thanks Kant. Can you give me a sample program which allows me to call
> > jni from executor task ?   I have jni working in standalone program
> in
> > scala/java.
> >
> > Regards,
> > Vineet
> >
> > On Sat, Nov 26, 2016 at 11:43 AM, kant kodali 
> > wrote:
> >>
> >> Yes this is a Java JNI question. Nothing to do with Spark really.
> >>
> >>  java.lang.UnsatisfiedLinkError typically would mean the way you
> setup
> >> LD_LIBRARY_PATH is wrong unless you tell us that it is working for
> other
> >> cases but not this one.
> >>
> >> On Sat, Nov 26, 2016 at 11:23 AM, Reynold Xin 
> >> wrote:
> >>>
> >>> That's just standard JNI and has nothing to do with Spark, does it?
> >>>
> >>>
> >>> On Sat, Nov 26, 2016 at 11:19 AM, vineet chadha
> >>>  wrote:
> 
>  Thanks Reynold for quick reply.
> 
>   I have tried following:
> 
>  class MySimpleApp {
>   // ---Native methods
>    @native def fooMethod (foo: String): String
>  }
> 
>  object MySimpleApp {
>    val flag = false
>    def loadResources() {
>  System.loadLibrary("foo-C-library")
>    val flag = true
>    }
>    def main() {
>  sc.parallelize(1 to 10).mapPartitions ( iter => {
>    if(flag == false){
>    MySimpleApp.loadResources()
>   val SimpleInstance = 

Re: Third party library

2016-12-13 Thread Jakob Odersky
Hi Vineet,
great to see you solved the problem! Since this just appeared in my
inbox, I wanted to take the opportunity for a shameless plug:
https://github.com/jodersky/sbt-jni. In case you're using sbt and also
developing the native library, this plugin may help with the pains of
building and packaging JNI applications.

cheers,
--Jakob

On Tue, Dec 13, 2016 at 11:02 AM, vineet chadha  wrote:
> Thanks Steve and Kant. Apologies for late reply as I was out for vacation.
> Got  it working.  For other users:
>
> def loadResources() {
>
> System.loadLibrary("foolib")
>
>  val MyInstance  = new MyClass
>
>  val retstr = MyInstance.foo("mystring") // method trying to invoke
>
>  }
>
> val conf = new
> SparkConf().setAppName("SimpleApplication").set("SPARK_LIBRARY_PATH",
> "/lib/location")
>
> val sc = new SparkContext(conf)
>
> sc.parallelize(1 to 10, 2).mapPartitions ( iter => {
>
> MySimpleApp.loadResources()
>
> iter
>
> }).count
>
>
>
> Regards,
> Vineet
>
> On Sun, Nov 27, 2016 at 2:15 PM, Steve Loughran 
> wrote:
>>
>>
>> On 27 Nov 2016, at 02:55, kant kodali  wrote:
>>
>> I would say instead of LD_LIBRARY_PATH you might want to use
>> java.library.path
>>
>> in the following way
>>
>> java -Djava.library.path=/path/to/my/library or pass java.library.path
>> along with spark-submit
>>
>>
>> This is only going to set up paths on the submitting system; to load JNI
>> code in the executors, the binary needs to be sent to far end and then put
>> on the Java load path there.
>>
>> Copy the relevant binary to somewhere on the PATH of the destination
>> machine. Do that and you shouldn't have to worry about other JVM options,
>> (though it's been a few years since I did any JNI).
>>
>> One trick: write a simple main() object/entry point which calls the JNI
>> method, and doesn't attempt to use any spark libraries; have it log any
>> exception and return an error code if the call failed. This will let you use
>> it as a link test after deployment: if you can't run that class then things
>> are broken, before you go near spark
>>
>>
>> On Sat, Nov 26, 2016 at 6:44 PM, Gmail  wrote:
>>>
>>> Maybe you've already checked these out. Some basic questions that come to
>>> my mind are:
>>> 1) is this library "foolib" or "foo-C-library" available on the worker
>>> node?
>>> 2) if yes, is it accessible by the user/program (rwx)?
>>>
>>> Thanks,
>>> Vasu.
>>>
>>> On Nov 26, 2016, at 5:08 PM, kant kodali  wrote:
>>>
>>> If it is working for standalone program I would think you can apply the
>>> same settings across all the spark worker  and client machines and give that
>>> a try. Lets start with that.
>>>
>>> On Sat, Nov 26, 2016 at 11:59 AM, vineet chadha 
>>> wrote:

 Just subscribed to  Spark User.  So, forwarding message again.

 On Sat, Nov 26, 2016 at 11:50 AM, vineet chadha 
 wrote:
>
> Thanks Kant. Can you give me a sample program which allows me to call
> jni from executor task ?   I have jni working in standalone program in
> scala/java.
>
> Regards,
> Vineet
>
> On Sat, Nov 26, 2016 at 11:43 AM, kant kodali 
> wrote:
>>
>> Yes this is a Java JNI question. Nothing to do with Spark really.
>>
>>  java.lang.UnsatisfiedLinkError typically would mean the way you setup
>> LD_LIBRARY_PATH is wrong unless you tell us that it is working for other
>> cases but not this one.
>>
>> On Sat, Nov 26, 2016 at 11:23 AM, Reynold Xin 
>> wrote:
>>>
>>> That's just standard JNI and has nothing to do with Spark, does it?
>>>
>>>
>>> On Sat, Nov 26, 2016 at 11:19 AM, vineet chadha
>>>  wrote:

 Thanks Reynold for quick reply.

  I have tried following:

 class MySimpleApp {
  // ---Native methods
   @native def fooMethod (foo: String): String
 }

 object MySimpleApp {
   val flag = false
   def loadResources() {
 System.loadLibrary("foo-C-library")
   val flag = true
   }
   def main() {
 sc.parallelize(1 to 10).mapPartitions ( iter => {
   if(flag == false){
   MySimpleApp.loadResources()
  val SimpleInstance = new MySimpleApp
   }
   SimpleInstance.fooMethod ("fooString")
   iter
 })
   }
 }

 I don't see way to invoke fooMethod which is implemented in
 foo-C-library. Is I am missing something ? If possible, can you point 
 me to
 existing implementation which i can refer to.

 Thanks again.

 ~


 On Fri, 

Re: Third party library

2016-12-13 Thread vineet chadha
Thanks Steve and Kant. Apologies for late reply as I was out for vacation.
Got  it working.  For other users:

def loadResources() {

System.loadLibrary("foolib")

 val MyInstance  = new MyClass

 val retstr = MyInstance.foo("mystring") // method trying to invoke

 }

val conf = new
SparkConf().setAppName("SimpleApplication").set("SPARK_LIBRARY_PATH",
 "/lib/location")

val sc = new SparkContext(conf)

sc.parallelize(1 to 10, 2).mapPartitions ( iter => {

MySimpleApp.loadResources()

iter

}).count


Regards,
Vineet

On Sun, Nov 27, 2016 at 2:15 PM, Steve Loughran 
wrote:

>
> On 27 Nov 2016, at 02:55, kant kodali  wrote:
>
> I would say instead of LD_LIBRARY_PATH you might want to use java.library.
> path
>
> in the following way
>
> java -Djava.library.path=/path/to/my/library or pass java.library.path
> along with spark-submit
>
>
> This is only going to set up paths on the submitting system; to load JNI
> code in the executors, the binary needs to be sent to far end and then put
> on the Java load path there.
>
> Copy the relevant binary to somewhere on the PATH of the destination
> machine. Do that and you shouldn't have to worry about other JVM options,
> (though it's been a few years since I did any JNI).
>
> One trick: write a simple main() object/entry point which calls the JNI
> method, and doesn't attempt to use any spark libraries; have it log any
> exception and return an error code if the call failed. This will let you
> use it as a link test after deployment: if you can't run that class then
> things are broken, before you go near spark
>
>
> On Sat, Nov 26, 2016 at 6:44 PM, Gmail  wrote:
>
>> Maybe you've already checked these out. Some basic questions that come to
>> my mind are:
>> 1) is this library "foolib" or "foo-C-library" available on the worker
>> node?
>> 2) if yes, is it accessible by the user/program (rwx)?
>>
>> Thanks,
>> Vasu.
>>
>> On Nov 26, 2016, at 5:08 PM, kant kodali  wrote:
>>
>> If it is working for standalone program I would think you can apply the
>> same settings across all the spark worker  and client machines and give
>> that a try. Lets start with that.
>>
>> On Sat, Nov 26, 2016 at 11:59 AM, vineet chadha 
>> wrote:
>>
>>> Just subscribed to  Spark User.  So, forwarding message again.
>>>
>>> On Sat, Nov 26, 2016 at 11:50 AM, vineet chadha 
>>> wrote:
>>>
 Thanks Kant. Can you give me a sample program which allows me to call
 jni from executor task ?   I have jni working in standalone program in
 scala/java.

 Regards,
 Vineet

 On Sat, Nov 26, 2016 at 11:43 AM, kant kodali 
 wrote:

> Yes this is a Java JNI question. Nothing to do with Spark really.
>
>  java.lang.UnsatisfiedLinkError typically would mean the way you
> setup LD_LIBRARY_PATH is wrong unless you tell us that it is working
> for other cases but not this one.
>
> On Sat, Nov 26, 2016 at 11:23 AM, Reynold Xin 
> wrote:
>
>> That's just standard JNI and has nothing to do with Spark, does it?
>>
>>
>> On Sat, Nov 26, 2016 at 11:19 AM, vineet chadha <
>> start.vin...@gmail.com> wrote:
>>
>>> Thanks Reynold for quick reply.
>>>
>>>  I have tried following:
>>>
>>> class MySimpleApp {
>>>  // ---Native methods
>>>   @native def fooMethod (foo: String): String
>>> }
>>>
>>> object MySimpleApp {
>>>   val flag = false
>>>   def loadResources() {
>>> System.loadLibrary("foo-C-library")
>>>   val flag = true
>>>   }
>>>   def main() {
>>> sc.parallelize(1 to 10).mapPartitions ( iter => {
>>>   if(flag == false){
>>>   MySimpleApp.loadResources()
>>>  val SimpleInstance = new MySimpleApp
>>>   }
>>>   SimpleInstance.fooMethod ("fooString")
>>>   iter
>>> })
>>>   }
>>> }
>>>
>>> I don't see way to invoke fooMethod which is implemented in
>>> foo-C-library. Is I am missing something ? If possible, can you point 
>>> me to
>>> existing implementation which i can refer to.
>>>
>>> Thanks again.
>>>
>>> ~
>>>
>>> On Fri, Nov 25, 2016 at 3:32 PM, Reynold Xin 
>>> wrote:
>>>
 bcc dev@ and add user@


 This is more a user@ list question rather than a dev@ list
 question. You can do something like this:

 object MySimpleApp {
   def loadResources(): Unit = // define some idempotent way to load
 resources, e.g. with a flag or lazy val

   def main() = {
 ...

 sc.parallelize(1 to 10).mapPartitions { iter =>
   MySimpleApp.loadResources()

 

Re: Third party library

2016-11-27 Thread Steve Loughran

On 27 Nov 2016, at 02:55, kant kodali 
> wrote:

I would say instead of LD_LIBRARY_PATH you might want to use java.library.path

in the following way

java -Djava.library.path=/path/to/my/library or pass java.library.path along 
with spark-submit


This is only going to set up paths on the submitting system; to load JNI code 
in the executors, the binary needs to be sent to far end and then put on the 
Java load path there.

Copy the relevant binary to somewhere on the PATH of the destination machine. 
Do that and you shouldn't have to worry about other JVM options, (though it's 
been a few years since I did any JNI).

One trick: write a simple main() object/entry point which calls the JNI method, 
and doesn't attempt to use any spark libraries; have it log any exception and 
return an error code if the call failed. This will let you use it as a link 
test after deployment: if you can't run that class then things are broken, 
before you go near spark


On Sat, Nov 26, 2016 at 6:44 PM, Gmail 
> wrote:
Maybe you've already checked these out. Some basic questions that come to my 
mind are:
1) is this library "foolib" or "foo-C-library" available on the worker node?
2) if yes, is it accessible by the user/program (rwx)?

Thanks,
Vasu.

On Nov 26, 2016, at 5:08 PM, kant kodali 
> wrote:

If it is working for standalone program I would think you can apply the same 
settings across all the spark worker  and client machines and give that a try. 
Lets start with that.

On Sat, Nov 26, 2016 at 11:59 AM, vineet chadha 
> wrote:
Just subscribed to  Spark User.  So, forwarding message again.

On Sat, Nov 26, 2016 at 11:50 AM, vineet chadha 
> wrote:
Thanks Kant. Can you give me a sample program which allows me to call jni from 
executor task ?   I have jni working in standalone program in scala/java.

Regards,
Vineet

On Sat, Nov 26, 2016 at 11:43 AM, kant kodali 
> wrote:
Yes this is a Java JNI question. Nothing to do with Spark really.

 java.lang.UnsatisfiedLinkError typically would mean the way you setup 
LD_LIBRARY_PATH is wrong unless you tell us that it is working for other cases 
but not this one.

On Sat, Nov 26, 2016 at 11:23 AM, Reynold Xin 
> wrote:
That's just standard JNI and has nothing to do with Spark, does it?


On Sat, Nov 26, 2016 at 11:19 AM, vineet chadha 
> wrote:
Thanks Reynold for quick reply.

 I have tried following:

class MySimpleApp {
 // ---Native methods
  @native def fooMethod (foo: String): String
}

object MySimpleApp {
  val flag = false
  def loadResources() {
System.loadLibrary("foo-C-library")
  val flag = true
  }
  def main() {
sc.parallelize(1 to 10).mapPartitions ( iter => {
  if(flag == false){
  MySimpleApp.loadResources()
 val SimpleInstance = new MySimpleApp
  }
  SimpleInstance.fooMethod ("fooString")
  iter
})
  }
}

I don't see way to invoke fooMethod which is implemented in foo-C-library. Is I 
am missing something ? If possible, can you point me to existing implementation 
which i can refer to.

Thanks again.

~

On Fri, Nov 25, 2016 at 3:32 PM, Reynold Xin 
> wrote:
bcc dev@ and add user@


This is more a user@ list question rather than a dev@ list question. You can do 
something like this:

object MySimpleApp {
  def loadResources(): Unit = // define some idempotent way to load resources, 
e.g. with a flag or lazy val

  def main() = {
...

sc.parallelize(1 to 10).mapPartitions { iter =>
  MySimpleApp.loadResources()

  // do whatever you want with the iterator
}
  }
}





On Fri, Nov 25, 2016 at 2:33 PM, vineet chadha 
> wrote:
Hi,

I am trying to invoke C library from the Spark Stack using JNI interface (here 
is sample  application code)


class SimpleApp {
 // ---Native methods
@native def foo (Top: String): String
}

object SimpleApp  {
   def main(args: Array[String]) {

val conf = new 
SparkConf().setAppName("SimpleApplication").set("SPARK_LIBRARY_PATH", "lib")
val sc = new SparkContext(conf)
 System.loadLibrary("foolib")
//instantiate the class
 val SimpleAppInstance = new SimpleApp
//String passing - Working
val ret = SimpleAppInstance.foo("fooString")
  }

Above code work fines.

I have setup LD_LIBRARY_PATH and spark.executor.extraClassPath,  
spark.executor.extraLibraryPath at worker node

How can i invoke JNI library from worker node ? Where should i load it in 
executor ?
Calling  System.loadLibrary("foolib") inside the work node gives me following 
error :

Exception in thread "main" 

Re: Third party library

2016-11-26 Thread kant kodali
I would say instead of LD_LIBRARY_PATH you might want to use java.library.
path

in the following way

java -Djava.library.path=/path/to/my/library or pass java.library.path
along with spark-submit

On Sat, Nov 26, 2016 at 6:44 PM, Gmail  wrote:

> Maybe you've already checked these out. Some basic questions that come to
> my mind are:
> 1) is this library "foolib" or "foo-C-library" available on the worker
> node?
> 2) if yes, is it accessible by the user/program (rwx)?
>
> Thanks,
> Vasu.
>
> On Nov 26, 2016, at 5:08 PM, kant kodali  wrote:
>
> If it is working for standalone program I would think you can apply the
> same settings across all the spark worker  and client machines and give
> that a try. Lets start with that.
>
> On Sat, Nov 26, 2016 at 11:59 AM, vineet chadha 
> wrote:
>
>> Just subscribed to  Spark User.  So, forwarding message again.
>>
>> On Sat, Nov 26, 2016 at 11:50 AM, vineet chadha 
>> wrote:
>>
>>> Thanks Kant. Can you give me a sample program which allows me to call
>>> jni from executor task ?   I have jni working in standalone program in
>>> scala/java.
>>>
>>> Regards,
>>> Vineet
>>>
>>> On Sat, Nov 26, 2016 at 11:43 AM, kant kodali 
>>> wrote:
>>>
 Yes this is a Java JNI question. Nothing to do with Spark really.

  java.lang.UnsatisfiedLinkError typically would mean the way you setup 
 LD_LIBRARY_PATH
 is wrong unless you tell us that it is working for other cases but not this
 one.

 On Sat, Nov 26, 2016 at 11:23 AM, Reynold Xin 
 wrote:

> That's just standard JNI and has nothing to do with Spark, does it?
>
>
> On Sat, Nov 26, 2016 at 11:19 AM, vineet chadha <
> start.vin...@gmail.com> wrote:
>
>> Thanks Reynold for quick reply.
>>
>>  I have tried following:
>>
>> class MySimpleApp {
>>  // ---Native methods
>>   @native def fooMethod (foo: String): String
>> }
>>
>> object MySimpleApp {
>>   val flag = false
>>   def loadResources() {
>> System.loadLibrary("foo-C-library")
>>   val flag = true
>>   }
>>   def main() {
>> sc.parallelize(1 to 10).mapPartitions ( iter => {
>>   if(flag == false){
>>   MySimpleApp.loadResources()
>>  val SimpleInstance = new MySimpleApp
>>   }
>>   SimpleInstance.fooMethod ("fooString")
>>   iter
>> })
>>   }
>> }
>>
>> I don't see way to invoke fooMethod which is implemented in
>> foo-C-library. Is I am missing something ? If possible, can you point me 
>> to
>> existing implementation which i can refer to.
>>
>> Thanks again.
>>
>> ~
>>
>> On Fri, Nov 25, 2016 at 3:32 PM, Reynold Xin 
>> wrote:
>>
>>> bcc dev@ and add user@
>>>
>>>
>>> This is more a user@ list question rather than a dev@ list
>>> question. You can do something like this:
>>>
>>> object MySimpleApp {
>>>   def loadResources(): Unit = // define some idempotent way to load
>>> resources, e.g. with a flag or lazy val
>>>
>>>   def main() = {
>>> ...
>>>
>>> sc.parallelize(1 to 10).mapPartitions { iter =>
>>>   MySimpleApp.loadResources()
>>>
>>>   // do whatever you want with the iterator
>>> }
>>>   }
>>> }
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Nov 25, 2016 at 2:33 PM, vineet chadha <
>>> start.vin...@gmail.com> wrote:
>>>
 Hi,

 I am trying to invoke C library from the Spark Stack using JNI
 interface (here is sample  application code)


 class SimpleApp {
  // ---Native methods
 @native def foo (Top: String): String
 }

 object SimpleApp  {
def main(args: Array[String]) {

 val conf = new SparkConf().setAppName("Simple
 Application").set("SPARK_LIBRARY_PATH", "lib")
 val sc = new SparkContext(conf)
  System.loadLibrary("foolib")
 //instantiate the class
  val SimpleAppInstance = new SimpleApp
 //String passing - Working
 val ret = SimpleAppInstance.foo("fooString")
   }

 Above code work fines.

 I have setup LD_LIBRARY_PATH and spark.executor.extraClassPath,
 spark.executor.extraLibraryPath at worker node

 How can i invoke JNI library from worker node ? Where should i load
 it in executor ?
 Calling  System.loadLibrary("foolib") inside the work node gives
 me following error :

 Exception in thread "main" java.lang.UnsatisfiedLinkError:

 Any help would be really appreciated.







Re: Third party library

2016-11-26 Thread Gmail
Maybe you've already checked these out. Some basic questions that come to my 
mind are:
1) is this library "foolib" or "foo-C-library" available on the worker node?
2) if yes, is it accessible by the user/program (rwx)?

Thanks,
Vasu. 

> On Nov 26, 2016, at 5:08 PM, kant kodali  wrote:
> 
> If it is working for standalone program I would think you can apply the same 
> settings across all the spark worker  and client machines and give that a 
> try. Lets start with that.
> 
>> On Sat, Nov 26, 2016 at 11:59 AM, vineet chadha  
>> wrote:
>> Just subscribed to  Spark User.  So, forwarding message again.
>> 
>>> On Sat, Nov 26, 2016 at 11:50 AM, vineet chadha  
>>> wrote:
>>> Thanks Kant. Can you give me a sample program which allows me to call jni 
>>> from executor task ?   I have jni working in standalone program in 
>>> scala/java. 
>>> 
>>> Regards,
>>> Vineet
>>> 
 On Sat, Nov 26, 2016 at 11:43 AM, kant kodali  wrote:
 Yes this is a Java JNI question. Nothing to do with Spark really.
 
  java.lang.UnsatisfiedLinkError typically would mean the way you setup 
 LD_LIBRARY_PATH is wrong unless you tell us that it is working for other 
 cases but not this one.
 
> On Sat, Nov 26, 2016 at 11:23 AM, Reynold Xin  wrote:
 
> That's just standard JNI and has nothing to do with Spark, does it?
> 
> 
>> On Sat, Nov 26, 2016 at 11:19 AM, vineet chadha  
>> wrote:
>> Thanks Reynold for quick reply.
>> 
>>  I have tried following: 
>> 
>> class MySimpleApp {
>>  // ---Native methods
>>   @native def fooMethod (foo: String): String
>> }
>> 
>> object MySimpleApp {
>>   val flag = false
>>   def loadResources() {
>>   System.loadLibrary("foo-C-library")
>>   val flag = true
>>   }
>>   def main() {
>> sc.parallelize(1 to 10).mapPartitions ( iter => {
>>   if(flag == false){
>>  MySimpleApp.loadResources()
>>val SimpleInstance = new MySimpleApp
>>   }
>>   SimpleInstance.fooMethod ("fooString") 
>>   iter
>> })
>>   }
>> }
>> 
>> I don't see way to invoke fooMethod which is implemented in 
>> foo-C-library. Is I am missing something ? If possible, can you point me 
>> to existing implementation which i can refer to.
>> 
>> Thanks again. 
>> ~
>> 
>> 
>>> On Fri, Nov 25, 2016 at 3:32 PM, Reynold Xin  
>>> wrote:
>>> bcc dev@ and add user@
>>> 
>>> 
>>> This is more a user@ list question rather than a dev@ list question. 
>>> You can do something like this:
>>> 
>>> object MySimpleApp {
>>>   def loadResources(): Unit = // define some idempotent way to load 
>>> resources, e.g. with a flag or lazy val
>>> 
>>>   def main() = {
>>> ...
>>>
>>> sc.parallelize(1 to 10).mapPartitions { iter =>
>>>   MySimpleApp.loadResources()
>>>   
>>>   // do whatever you want with the iterator
>>> }
>>>   }
>>> }
>>> 
>>> 
>>> 
>>> 
>>> 
 On Fri, Nov 25, 2016 at 2:33 PM, vineet chadha 
  wrote:
 Hi,
 
 I am trying to invoke C library from the Spark Stack using JNI 
 interface (here is sample  application code)
 
 
 class SimpleApp {
  // ---Native methods
 @native def foo (Top: String): String
 }
 
 object SimpleApp  {
def main(args: Array[String]) {
  
 val conf = new 
 SparkConf().setAppName("SimpleApplication").set("SPARK_LIBRARY_PATH", 
 "lib")
 val sc = new SparkContext(conf)
  System.loadLibrary("foolib")
 //instantiate the class
  val SimpleAppInstance = new SimpleApp
 //String passing - Working
 val ret = SimpleAppInstance.foo("fooString")
   }
 
 Above code work fines. 
 
 I have setup LD_LIBRARY_PATH and spark.executor.extraClassPath,  
 spark.executor.extraLibraryPath at worker node
 
 How can i invoke JNI library from worker node ? Where should i load it 
 in executor ?
 Calling  System.loadLibrary("foolib") inside the work node gives me 
 following error :
 
 Exception in thread "main" java.lang.UnsatisfiedLinkError: 
 Any help would be really appreciated.
 
 
 
 
 
 
 
 
 
 
 
 
 
>>> 
>> 
> 
 
>>> 
>> 
> 


Re: Third party library

2016-11-26 Thread kant kodali
If it is working for standalone program I would think you can apply the
same settings across all the spark worker  and client machines and give
that a try. Lets start with that.

On Sat, Nov 26, 2016 at 11:59 AM, vineet chadha 
wrote:

> Just subscribed to  Spark User.  So, forwarding message again.
>
> On Sat, Nov 26, 2016 at 11:50 AM, vineet chadha 
> wrote:
>
>> Thanks Kant. Can you give me a sample program which allows me to call jni
>> from executor task ?   I have jni working in standalone program in
>> scala/java.
>>
>> Regards,
>> Vineet
>>
>> On Sat, Nov 26, 2016 at 11:43 AM, kant kodali  wrote:
>>
>>> Yes this is a Java JNI question. Nothing to do with Spark really.
>>>
>>>  java.lang.UnsatisfiedLinkError typically would mean the way you setup 
>>> LD_LIBRARY_PATH
>>> is wrong unless you tell us that it is working for other cases but not this
>>> one.
>>>
>>> On Sat, Nov 26, 2016 at 11:23 AM, Reynold Xin 
>>> wrote:
>>>
 That's just standard JNI and has nothing to do with Spark, does it?


 On Sat, Nov 26, 2016 at 11:19 AM, vineet chadha  wrote:

> Thanks Reynold for quick reply.
>
>  I have tried following:
>
> class MySimpleApp {
>  // ---Native methods
>   @native def fooMethod (foo: String): String
> }
>
> object MySimpleApp {
>   val flag = false
>   def loadResources() {
> System.loadLibrary("foo-C-library")
>   val flag = true
>   }
>   def main() {
> sc.parallelize(1 to 10).mapPartitions ( iter => {
>   if(flag == false){
>   MySimpleApp.loadResources()
>  val SimpleInstance = new MySimpleApp
>   }
>   SimpleInstance.fooMethod ("fooString")
>   iter
> })
>   }
> }
>
> I don't see way to invoke fooMethod which is implemented in
> foo-C-library. Is I am missing something ? If possible, can you point me 
> to
> existing implementation which i can refer to.
>
> Thanks again.
>
> ~
>
> On Fri, Nov 25, 2016 at 3:32 PM, Reynold Xin 
> wrote:
>
>> bcc dev@ and add user@
>>
>>
>> This is more a user@ list question rather than a dev@ list question.
>> You can do something like this:
>>
>> object MySimpleApp {
>>   def loadResources(): Unit = // define some idempotent way to load
>> resources, e.g. with a flag or lazy val
>>
>>   def main() = {
>> ...
>>
>> sc.parallelize(1 to 10).mapPartitions { iter =>
>>   MySimpleApp.loadResources()
>>
>>   // do whatever you want with the iterator
>> }
>>   }
>> }
>>
>>
>>
>>
>>
>> On Fri, Nov 25, 2016 at 2:33 PM, vineet chadha <
>> start.vin...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am trying to invoke C library from the Spark Stack using JNI
>>> interface (here is sample  application code)
>>>
>>>
>>> class SimpleApp {
>>>  // ---Native methods
>>> @native def foo (Top: String): String
>>> }
>>>
>>> object SimpleApp  {
>>>def main(args: Array[String]) {
>>>
>>> val conf = new SparkConf().setAppName("Simple
>>> Application").set("SPARK_LIBRARY_PATH", "lib")
>>> val sc = new SparkContext(conf)
>>>  System.loadLibrary("foolib")
>>> //instantiate the class
>>>  val SimpleAppInstance = new SimpleApp
>>> //String passing - Working
>>> val ret = SimpleAppInstance.foo("fooString")
>>>   }
>>>
>>> Above code work fines.
>>>
>>> I have setup LD_LIBRARY_PATH and spark.executor.extraClassPath,
>>> spark.executor.extraLibraryPath at worker node
>>>
>>> How can i invoke JNI library from worker node ? Where should i load
>>> it in executor ?
>>> Calling  System.loadLibrary("foolib") inside the work node gives me
>>> following error :
>>>
>>> Exception in thread "main" java.lang.UnsatisfiedLinkError:
>>>
>>> Any help would be really appreciated.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

>>>
>>
>


Re: Third party library

2016-11-26 Thread vineet chadha
Just subscribed to  Spark User.  So, forwarding message again.

On Sat, Nov 26, 2016 at 11:50 AM, vineet chadha 
wrote:

> Thanks Kant. Can you give me a sample program which allows me to call jni
> from executor task ?   I have jni working in standalone program in
> scala/java.
>
> Regards,
> Vineet
>
> On Sat, Nov 26, 2016 at 11:43 AM, kant kodali  wrote:
>
>> Yes this is a Java JNI question. Nothing to do with Spark really.
>>
>>  java.lang.UnsatisfiedLinkError typically would mean the way you setup 
>> LD_LIBRARY_PATH
>> is wrong unless you tell us that it is working for other cases but not this
>> one.
>>
>> On Sat, Nov 26, 2016 at 11:23 AM, Reynold Xin 
>> wrote:
>>
>>> That's just standard JNI and has nothing to do with Spark, does it?
>>>
>>>
>>> On Sat, Nov 26, 2016 at 11:19 AM, vineet chadha 
>>> wrote:
>>>
 Thanks Reynold for quick reply.

  I have tried following:

 class MySimpleApp {
  // ---Native methods
   @native def fooMethod (foo: String): String
 }

 object MySimpleApp {
   val flag = false
   def loadResources() {
 System.loadLibrary("foo-C-library")
   val flag = true
   }
   def main() {
 sc.parallelize(1 to 10).mapPartitions ( iter => {
   if(flag == false){
   MySimpleApp.loadResources()
  val SimpleInstance = new MySimpleApp
   }
   SimpleInstance.fooMethod ("fooString")
   iter
 })
   }
 }

 I don't see way to invoke fooMethod which is implemented in
 foo-C-library. Is I am missing something ? If possible, can you point me to
 existing implementation which i can refer to.

 Thanks again.

 ~

 On Fri, Nov 25, 2016 at 3:32 PM, Reynold Xin 
 wrote:

> bcc dev@ and add user@
>
>
> This is more a user@ list question rather than a dev@ list question.
> You can do something like this:
>
> object MySimpleApp {
>   def loadResources(): Unit = // define some idempotent way to load
> resources, e.g. with a flag or lazy val
>
>   def main() = {
> ...
>
> sc.parallelize(1 to 10).mapPartitions { iter =>
>   MySimpleApp.loadResources()
>
>   // do whatever you want with the iterator
> }
>   }
> }
>
>
>
>
>
> On Fri, Nov 25, 2016 at 2:33 PM, vineet chadha  > wrote:
>
>> Hi,
>>
>> I am trying to invoke C library from the Spark Stack using JNI
>> interface (here is sample  application code)
>>
>>
>> class SimpleApp {
>>  // ---Native methods
>> @native def foo (Top: String): String
>> }
>>
>> object SimpleApp  {
>>def main(args: Array[String]) {
>>
>> val conf = new SparkConf().setAppName("Simple
>> Application").set("SPARK_LIBRARY_PATH", "lib")
>> val sc = new SparkContext(conf)
>>  System.loadLibrary("foolib")
>> //instantiate the class
>>  val SimpleAppInstance = new SimpleApp
>> //String passing - Working
>> val ret = SimpleAppInstance.foo("fooString")
>>   }
>>
>> Above code work fines.
>>
>> I have setup LD_LIBRARY_PATH and spark.executor.extraClassPath,
>> spark.executor.extraLibraryPath at worker node
>>
>> How can i invoke JNI library from worker node ? Where should i load
>> it in executor ?
>> Calling  System.loadLibrary("foolib") inside the work node gives me
>> following error :
>>
>> Exception in thread "main" java.lang.UnsatisfiedLinkError:
>>
>> Any help would be really appreciated.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

>>>
>>
>


Re: Third party library

2016-11-26 Thread kant kodali
Yes this is a Java JNI question. Nothing to do with Spark really.

 java.lang.UnsatisfiedLinkError typically would mean the way you setup
LD_LIBRARY_PATH
is wrong unless you tell us that it is working for other cases but not this
one.

On Sat, Nov 26, 2016 at 11:23 AM, Reynold Xin  wrote:

> That's just standard JNI and has nothing to do with Spark, does it?
>
>
> On Sat, Nov 26, 2016 at 11:19 AM, vineet chadha 
> wrote:
>
>> Thanks Reynold for quick reply.
>>
>>  I have tried following:
>>
>> class MySimpleApp {
>>  // ---Native methods
>>   @native def fooMethod (foo: String): String
>> }
>>
>> object MySimpleApp {
>>   val flag = false
>>   def loadResources() {
>> System.loadLibrary("foo-C-library")
>>   val flag = true
>>   }
>>   def main() {
>> sc.parallelize(1 to 10).mapPartitions ( iter => {
>>   if(flag == false){
>>   MySimpleApp.loadResources()
>>  val SimpleInstance = new MySimpleApp
>>   }
>>   SimpleInstance.fooMethod ("fooString")
>>   iter
>> })
>>   }
>> }
>>
>> I don't see way to invoke fooMethod which is implemented in
>> foo-C-library. Is I am missing something ? If possible, can you point me to
>> existing implementation which i can refer to.
>>
>> Thanks again.
>>
>> ~
>>
>> On Fri, Nov 25, 2016 at 3:32 PM, Reynold Xin  wrote:
>>
>>> bcc dev@ and add user@
>>>
>>>
>>> This is more a user@ list question rather than a dev@ list question.
>>> You can do something like this:
>>>
>>> object MySimpleApp {
>>>   def loadResources(): Unit = // define some idempotent way to load
>>> resources, e.g. with a flag or lazy val
>>>
>>>   def main() = {
>>> ...
>>>
>>> sc.parallelize(1 to 10).mapPartitions { iter =>
>>>   MySimpleApp.loadResources()
>>>
>>>   // do whatever you want with the iterator
>>> }
>>>   }
>>> }
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Nov 25, 2016 at 2:33 PM, vineet chadha 
>>> wrote:
>>>
 Hi,

 I am trying to invoke C library from the Spark Stack using JNI
 interface (here is sample  application code)


 class SimpleApp {
  // ---Native methods
 @native def foo (Top: String): String
 }

 object SimpleApp  {
def main(args: Array[String]) {

 val conf = new SparkConf().setAppName("Simple
 Application").set("SPARK_LIBRARY_PATH", "lib")
 val sc = new SparkContext(conf)
  System.loadLibrary("foolib")
 //instantiate the class
  val SimpleAppInstance = new SimpleApp
 //String passing - Working
 val ret = SimpleAppInstance.foo("fooString")
   }

 Above code work fines.

 I have setup LD_LIBRARY_PATH and spark.executor.extraClassPath,
 spark.executor.extraLibraryPath at worker node

 How can i invoke JNI library from worker node ? Where should i load it
 in executor ?
 Calling  System.loadLibrary("foolib") inside the work node gives me
 following error :

 Exception in thread "main" java.lang.UnsatisfiedLinkError:

 Any help would be really appreciated.













>>>
>>
>


Re: Third party library

2016-11-26 Thread Reynold Xin
That's just standard JNI and has nothing to do with Spark, does it?


On Sat, Nov 26, 2016 at 11:19 AM, vineet chadha 
wrote:

> Thanks Reynold for quick reply.
>
>  I have tried following:
>
> class MySimpleApp {
>  // ---Native methods
>   @native def fooMethod (foo: String): String
> }
>
> object MySimpleApp {
>   val flag = false
>   def loadResources() {
> System.loadLibrary("foo-C-library")
>   val flag = true
>   }
>   def main() {
> sc.parallelize(1 to 10).mapPartitions ( iter => {
>   if(flag == false){
>   MySimpleApp.loadResources()
>  val SimpleInstance = new MySimpleApp
>   }
>   SimpleInstance.fooMethod ("fooString")
>   iter
> })
>   }
> }
>
> I don't see way to invoke fooMethod which is implemented in foo-C-library.
> Is I am missing something ? If possible, can you point me to existing
> implementation which i can refer to.
>
> Thanks again.
>
> ~
>
> On Fri, Nov 25, 2016 at 3:32 PM, Reynold Xin  wrote:
>
>> bcc dev@ and add user@
>>
>>
>> This is more a user@ list question rather than a dev@ list question. You
>> can do something like this:
>>
>> object MySimpleApp {
>>   def loadResources(): Unit = // define some idempotent way to load
>> resources, e.g. with a flag or lazy val
>>
>>   def main() = {
>> ...
>>
>> sc.parallelize(1 to 10).mapPartitions { iter =>
>>   MySimpleApp.loadResources()
>>
>>   // do whatever you want with the iterator
>> }
>>   }
>> }
>>
>>
>>
>>
>>
>> On Fri, Nov 25, 2016 at 2:33 PM, vineet chadha 
>> wrote:
>>
>>> Hi,
>>>
>>> I am trying to invoke C library from the Spark Stack using JNI interface
>>> (here is sample  application code)
>>>
>>>
>>> class SimpleApp {
>>>  // ---Native methods
>>> @native def foo (Top: String): String
>>> }
>>>
>>> object SimpleApp  {
>>>def main(args: Array[String]) {
>>>
>>> val conf = new SparkConf().setAppName("Simple
>>> Application").set("SPARK_LIBRARY_PATH", "lib")
>>> val sc = new SparkContext(conf)
>>>  System.loadLibrary("foolib")
>>> //instantiate the class
>>>  val SimpleAppInstance = new SimpleApp
>>> //String passing - Working
>>> val ret = SimpleAppInstance.foo("fooString")
>>>   }
>>>
>>> Above code work fines.
>>>
>>> I have setup LD_LIBRARY_PATH and spark.executor.extraClassPath,
>>> spark.executor.extraLibraryPath at worker node
>>>
>>> How can i invoke JNI library from worker node ? Where should i load it
>>> in executor ?
>>> Calling  System.loadLibrary("foolib") inside the work node gives me
>>> following error :
>>>
>>> Exception in thread "main" java.lang.UnsatisfiedLinkError:
>>>
>>> Any help would be really appreciated.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>


Re: Third party library

2016-11-25 Thread Reynold Xin
bcc dev@ and add user@


This is more a user@ list question rather than a dev@ list question. You
can do something like this:

object MySimpleApp {
  def loadResources(): Unit = // define some idempotent way to load
resources, e.g. with a flag or lazy val

  def main() = {
...

sc.parallelize(1 to 10).mapPartitions { iter =>
  MySimpleApp.loadResources()

  // do whatever you want with the iterator
}
  }
}





On Fri, Nov 25, 2016 at 2:33 PM, vineet chadha 
wrote:

> Hi,
>
> I am trying to invoke C library from the Spark Stack using JNI interface
> (here is sample  application code)
>
>
> class SimpleApp {
>  // ---Native methods
> @native def foo (Top: String): String
> }
>
> object SimpleApp  {
>def main(args: Array[String]) {
>
> val conf = new 
> SparkConf().setAppName("SimpleApplication").set("SPARK_LIBRARY_PATH",
> "lib")
> val sc = new SparkContext(conf)
>  System.loadLibrary("foolib")
> //instantiate the class
>  val SimpleAppInstance = new SimpleApp
> //String passing - Working
> val ret = SimpleAppInstance.foo("fooString")
>   }
>
> Above code work fines.
>
> I have setup LD_LIBRARY_PATH and spark.executor.extraClassPath,
> spark.executor.extraLibraryPath at worker node
>
> How can i invoke JNI library from worker node ? Where should i load it in
> executor ?
> Calling  System.loadLibrary("foolib") inside the work node gives me
> following error :
>
> Exception in thread "main" java.lang.UnsatisfiedLinkError:
>
> Any help would be really appreciated.
>
>
>
>
>
>
>
>
>
>
>
>
>


Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Lan Jiang
I am happy to report that after set spark.dirver.userClassPathFirst, I can use 
protobuf 3 with spark-shell. Looks like the classloading issue in the driver, 
not executor. 

Marcelo, thank you very much for the tip!

Lan


> On Sep 15, 2015, at 1:40 PM, Marcelo Vanzin  wrote:
> 
> Hi,
> 
> Just "spark.executor.userClassPathFirst" is not enough. You should
> also set "spark.driver.userClassPathFirst". Also not that I don't
> think this was really tested with the shell, but that should work with
> regular apps started using spark-submit.
> 
> If that doesn't work, I'd recommend shading, as others already have.
> 
> On Tue, Sep 15, 2015 at 9:19 AM, Lan Jiang  wrote:
>> I used the --conf spark.files.userClassPathFirst=true  in the spark-shell
>> option, it still gave me the eror: java.lang.NoSuchFieldError: unknownFields
>> if I use protobuf 3.
>> 
>> The output says spark.files.userClassPathFirst is deprecated and suggest
>> using spark.executor.userClassPathFirst. I tried that and it did not work
>> either.
> 
> -- 
> Marcelo


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Marcelo Vanzin
Hi,

Just "spark.executor.userClassPathFirst" is not enough. You should
also set "spark.driver.userClassPathFirst". Also not that I don't
think this was really tested with the shell, but that should work with
regular apps started using spark-submit.

If that doesn't work, I'd recommend shading, as others already have.

On Tue, Sep 15, 2015 at 9:19 AM, Lan Jiang  wrote:
> I used the --conf spark.files.userClassPathFirst=true  in the spark-shell
> option, it still gave me the eror: java.lang.NoSuchFieldError: unknownFields
> if I use protobuf 3.
>
> The output says spark.files.userClassPathFirst is deprecated and suggest
> using spark.executor.userClassPathFirst. I tried that and it did not work
> either.

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Steve Loughran


On 15 Sep 2015, at 05:47, Lan Jiang 
> wrote:

Hi, there,

I am using Spark 1.4.1. The protobuf 2.5 is included by Spark 1.4.1 by default. 
However, I would like to use Protobuf 3 in my spark application so that I can 
use some new features such as Map support.  Is there anyway to do that?

Right now if I build a uber.jar with dependencies including protobuf 3 classes 
and pass to spark-shell through --jars option, during the execution, I got the 
error java.lang.NoSuchFieldError: unknownFields.


protobuf is an absolute nightmare version-wise, as protoc generates 
incompatible java classes even across point versions. Hadoop 2.2+ is and will 
always be protobuf 2.5 only; that applies transitively to downstream projects  
(the great protobuf upgrade of 2013 was actually pushed by the HBase team, and 
required a co-ordinated change across multiple projects)


Is there anyway to use a different version of Protobuf other than the default 
one included in the Spark distribution? I guess I can generalize and extend the 
question to any third party libraries. How to deal with version conflict for 
any third party libraries included in the Spark distribution?

maven shading is the strategy. Generally it is less needed, though the 
troublesome binaries are,  across the entire apache big data stack:

google protobuf
google guava
kryo
jackson

you can generally bump up the other versions, at least by point releases.


Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Lan Jiang
I used the --conf spark.files.userClassPathFirst=true  in the spark-shell 
option, it still gave me the eror: java.lang.NoSuchFieldError: unknownFields if 
I use protobuf 3. 

The output says spark.files.userClassPathFirst is deprecated and suggest using 
spark.executor.userClassPathFirst. I tried that and it did not work either. 

Lan



> On Sep 15, 2015, at 10:31 AM, java8964 <java8...@hotmail.com> wrote:
> 
> If you use Standalone mode, just start spark-shell like following:
> 
> spark-shell --jars your_uber_jar --conf spark.files.userClassPathFirst=true 
> 
> Yong
> 
> Date: Tue, 15 Sep 2015 09:33:40 -0500
> Subject: Re: Change protobuf version or any other third party library version 
> in Spark application
> From: ljia...@gmail.com
> To: java8...@hotmail.com
> CC: ste...@hortonworks.com; user@spark.apache.org
> 
> Steve,
> 
> Thanks for the input. You are absolutely right. When I use protobuf 2.6.1, I 
> also ran into method not defined errors. You suggest using Maven sharding 
> strategy, but I have already built the uber jar to package all my custom 
> classes and its dependencies including protobuf 3. The problem is how to 
> configure spark shell to use my uber jar first. 
> 
> java8964 -- appreciate the link and I will try the configuration. Looks 
> promising. However, the "user classpath first" attribute does not apply to 
> spark-shell, am I correct? 
> 
> Lan
> 
> On Tue, Sep 15, 2015 at 8:24 AM, java8964 <java8...@hotmail.com 
> <mailto:java8...@hotmail.com>> wrote:
> It is a bad idea to use the major version change of protobuf, as it most 
> likely won't work.
> 
> But you really want to give it a try, set the "user classpath first", so the 
> protobuf 3 coming with your jar will be used.
> 
> The setting depends on your deployment mode, check this for the parameter:
> 
> https://issues.apache.org/jira/browse/SPARK-2996 
> <https://issues.apache.org/jira/browse/SPARK-2996>
> 
> Yong
> 
> Subject: Re: Change protobuf version or any other third party library version 
> in Spark application
> From: ste...@hortonworks.com <mailto:ste...@hortonworks.com>
> To: ljia...@gmail.com <mailto:ljia...@gmail.com>
> CC: user@spark.apache.org <mailto:user@spark.apache.org>
> Date: Tue, 15 Sep 2015 09:19:28 +
> 
> 
> 
> 
> On 15 Sep 2015, at 05:47, Lan Jiang <ljia...@gmail.com 
> <mailto:ljia...@gmail.com>> wrote:
> 
> Hi, there,
> 
> I am using Spark 1.4.1. The protobuf 2.5 is included by Spark 1.4.1 by 
> default. However, I would like to use Protobuf 3 in my spark application so 
> that I can use some new features such as Map support.  Is there anyway to do 
> that? 
> 
> Right now if I build a uber.jar with dependencies including protobuf 3 
> classes and pass to spark-shell through --jars option, during the execution, 
> I got the error java.lang.NoSuchFieldError: unknownFields. 
> 
> 
> protobuf is an absolute nightmare version-wise, as protoc generates 
> incompatible java classes even across point versions. Hadoop 2.2+ is and will 
> always be protobuf 2.5 only; that applies transitively to downstream projects 
>  (the great protobuf upgrade of 2013 was actually pushed by the HBase team, 
> and required a co-ordinated change across multiple projects)
> 
> 
> Is there anyway to use a different version of Protobuf other than the default 
> one included in the Spark distribution? I guess I can generalize and extend 
> the question to any third party libraries. How to deal with version conflict 
> for any third party libraries included in the Spark distribution? 
> 
> maven shading is the strategy. Generally it is less needed, though the 
> troublesome binaries are,  across the entire apache big data stack:
> 
> google protobuf
> google guava
> kryo
> jackson
> 
> you can generally bump up the other versions, at least by point releases.



Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Guru Medasani
Hi Lan,

Reading the pull request below. Looks like you should be able to use the config 
to both drivers and executors. I would give it a try with the Spark-shell on 
Yarn client mode.

https://github.com/apache/spark/pull/3233 
<https://github.com/apache/spark/pull/3233>

Yarn's config option spark.yarn.user.classpath.first does not work the same way 
as
spark.files.userClassPathFirst; Yarn's version is a lot more dangerous, in that 
it
modifies the system classpath, instead of restricting the changes to the user's 
class
loader. So this change implements the behavior of the latter for Yarn, and 
deprecates
the more dangerous choice.

To be able to achieve feature-parity, I also implemented the option for drivers 
(the existing
option only applies to executors). So now there are two options, each 
controlling whether
to apply userClassPathFirst to the driver or executors. The old option was 
deprecated, and
aliased to the new one (spark.executor.userClassPathFirst).

The existing "child-first" class loader also had to be fixed. It didn't handle 
resources, and it
was also doing some things that ended up causing JVM errors depending on how 
things
were being called.


Guru Medasani
gdm...@gmail.com



> On Sep 15, 2015, at 9:33 AM, Lan Jiang <ljia...@gmail.com> wrote:
> 
> Steve,
> 
> Thanks for the input. You are absolutely right. When I use protobuf 2.6.1, I 
> also ran into method not defined errors. You suggest using Maven sharding 
> strategy, but I have already built the uber jar to package all my custom 
> classes and its dependencies including protobuf 3. The problem is how to 
> configure spark shell to use my uber jar first. 
> 
> java8964 -- appreciate the link and I will try the configuration. Looks 
> promising. However, the "user classpath first" attribute does not apply to 
> spark-shell, am I correct? 
> 
> Lan
> 
> On Tue, Sep 15, 2015 at 8:24 AM, java8964 <java8...@hotmail.com 
> <mailto:java8...@hotmail.com>> wrote:
> It is a bad idea to use the major version change of protobuf, as it most 
> likely won't work.
> 
> But you really want to give it a try, set the "user classpath first", so the 
> protobuf 3 coming with your jar will be used.
> 
> The setting depends on your deployment mode, check this for the parameter:
> 
> https://issues.apache.org/jira/browse/SPARK-2996 
> <https://issues.apache.org/jira/browse/SPARK-2996>
> 
> Yong
> 
> Subject: Re: Change protobuf version or any other third party library version 
> in Spark application
> From: ste...@hortonworks.com <mailto:ste...@hortonworks.com>
> To: ljia...@gmail.com <mailto:ljia...@gmail.com>
> CC: user@spark.apache.org <mailto:user@spark.apache.org>
> Date: Tue, 15 Sep 2015 09:19:28 +
> 
> 
> 
> 
> On 15 Sep 2015, at 05:47, Lan Jiang <ljia...@gmail.com 
> <mailto:ljia...@gmail.com>> wrote:
> 
> Hi, there,
> 
> I am using Spark 1.4.1. The protobuf 2.5 is included by Spark 1.4.1 by 
> default. However, I would like to use Protobuf 3 in my spark application so 
> that I can use some new features such as Map support.  Is there anyway to do 
> that? 
> 
> Right now if I build a uber.jar with dependencies including protobuf 3 
> classes and pass to spark-shell through --jars option, during the execution, 
> I got the error java.lang.NoSuchFieldError: unknownFields. 
> 
> 
> protobuf is an absolute nightmare version-wise, as protoc generates 
> incompatible java classes even across point versions. Hadoop 2.2+ is and will 
> always be protobuf 2.5 only; that applies transitively to downstream projects 
>  (the great protobuf upgrade of 2013 was actually pushed by the HBase team, 
> and required a co-ordinated change across multiple projects)
> 
> 
> Is there anyway to use a different version of Protobuf other than the default 
> one included in the Spark distribution? I guess I can generalize and extend 
> the question to any third party libraries. How to deal with version conflict 
> for any third party libraries included in the Spark distribution? 
> 
> maven shading is the strategy. Generally it is less needed, though the 
> troublesome binaries are,  across the entire apache big data stack:
> 
> google protobuf
> google guava
> kryo
> jackson
> 
> you can generally bump up the other versions, at least by point releases.
> 



RE: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread java8964
If you use Standalone mode, just start spark-shell like following:
spark-shell --jars your_uber_jar --conf spark.files.userClassPathFirst=true 
Yong
Date: Tue, 15 Sep 2015 09:33:40 -0500
Subject: Re: Change protobuf version or any other third party library version 
in Spark application
From: ljia...@gmail.com
To: java8...@hotmail.com
CC: ste...@hortonworks.com; user@spark.apache.org

Steve,
Thanks for the input. You are absolutely right. When I use protobuf 2.6.1, I 
also ran into method not defined errors. You suggest using Maven sharding 
strategy, but I have already built the uber jar to package all my custom 
classes and its dependencies including protobuf 3. The problem is how to 
configure spark shell to use my uber jar first. 
java8964 -- appreciate the link and I will try the configuration. Looks 
promising. However, the "user classpath first" attribute does not apply to 
spark-shell, am I correct? 

Lan
On Tue, Sep 15, 2015 at 8:24 AM, java8964 <java8...@hotmail.com> wrote:



It is a bad idea to use the major version change of protobuf, as it most likely 
won't work.
But you really want to give it a try, set the "user classpath first", so the 
protobuf 3 coming with your jar will be used.
The setting depends on your deployment mode, check this for the parameter:
https://issues.apache.org/jira/browse/SPARK-2996
Yong

Subject: Re: Change protobuf version or any other third party library version 
in Spark application
From: ste...@hortonworks.com
To: ljia...@gmail.com
CC: user@spark.apache.org
Date: Tue, 15 Sep 2015 09:19:28 +













On 15 Sep 2015, at 05:47, Lan Jiang <ljia...@gmail.com> wrote:



Hi, there,



I am using Spark 1.4.1. The protobuf 2.5 is included by Spark 1.4.1 by default. 
However, I would like to use Protobuf 3 in my spark application so that I can 
use some new features such as Map support.  Is there anyway to do that? 



Right now if I build a uber.jar with dependencies including protobuf 3 classes 
and pass to spark-shell through --jars option, during the execution, I got the 
error java.lang.NoSuchFieldError: unknownFields. 









protobuf is an absolute nightmare version-wise, as protoc generates 
incompatible java classes even across point versions. Hadoop 2.2+ is and will 
always be protobuf 2.5 only; that applies transitively to downstream projects  
(the great protobuf upgrade
 of 2013 was actually pushed by the HBase team, and required a co-ordinated 
change across multiple projects)








Is there anyway to use a different version of Protobuf other than the default 
one included in the Spark distribution? I guess I can generalize and extend the 
question to any third party libraries. How to deal with version conflict for 
any third
 party libraries included in the Spark distribution? 







maven shading is the strategy. Generally it is less needed, though the 
troublesome binaries are,  across the entire apache big data stack:


google protobuf
google guava
kryo

jackson



you can generally bump up the other versions, at least by point releases.   
  

  

Re: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread Lan Jiang
Steve,

Thanks for the input. You are absolutely right. When I use protobuf 2.6.1,
I also ran into method not defined errors. You suggest using Maven sharding
strategy, but I have already built the uber jar to package all my custom
classes and its dependencies including protobuf 3. The problem is how to
configure spark shell to use my uber jar first.

java8964 -- appreciate the link and I will try the configuration. Looks
promising. However, the "user classpath first" attribute does not apply to
spark-shell, am I correct?

Lan

On Tue, Sep 15, 2015 at 8:24 AM, java8964 <java8...@hotmail.com> wrote:

> It is a bad idea to use the major version change of protobuf, as it most
> likely won't work.
>
> But you really want to give it a try, set the "user classpath first", so
> the protobuf 3 coming with your jar will be used.
>
> The setting depends on your deployment mode, check this for the parameter:
>
> https://issues.apache.org/jira/browse/SPARK-2996
>
> Yong
>
> ------
> Subject: Re: Change protobuf version or any other third party library
> version in Spark application
> From: ste...@hortonworks.com
> To: ljia...@gmail.com
> CC: user@spark.apache.org
> Date: Tue, 15 Sep 2015 09:19:28 +
>
>
>
>
> On 15 Sep 2015, at 05:47, Lan Jiang <ljia...@gmail.com> wrote:
>
> Hi, there,
>
> I am using Spark 1.4.1. The protobuf 2.5 is included by Spark 1.4.1 by
> default. However, I would like to use Protobuf 3 in my spark application so
> that I can use some new features such as Map support.  Is there anyway to
> do that?
>
> Right now if I build a uber.jar with dependencies including protobuf 3
> classes and pass to spark-shell through --jars option, during the
> execution, I got the error *java.lang.NoSuchFieldError: unknownFields. *
>
>
>
> protobuf is an absolute nightmare version-wise, as protoc generates
> incompatible java classes even across point versions. Hadoop 2.2+ is and
> will always be protobuf 2.5 only; that applies transitively to downstream
> projects  (the great protobuf upgrade of 2013 was actually pushed by the
> HBase team, and required a co-ordinated change across multiple projects)
>
>
> Is there anyway to use a different version of Protobuf other than the
> default one included in the Spark distribution? I guess I can generalize
> and extend the question to any third party libraries. How to deal with
> version conflict for any third party libraries included in the Spark
> distribution?
>
>
> maven shading is the strategy. Generally it is less needed, though the
> troublesome binaries are,  across the entire apache big data stack:
>
> google protobuf
> google guava
> kryo
> jackson
>
> you can generally bump up the other versions, at least by point releases.
>


RE: Change protobuf version or any other third party library version in Spark application

2015-09-15 Thread java8964
It is a bad idea to use the major version change of protobuf, as it most likely 
won't work.
But you really want to give it a try, set the "user classpath first", so the 
protobuf 3 coming with your jar will be used.
The setting depends on your deployment mode, check this for the parameter:
https://issues.apache.org/jira/browse/SPARK-2996
Yong

Subject: Re: Change protobuf version or any other third party library version 
in Spark application
From: ste...@hortonworks.com
To: ljia...@gmail.com
CC: user@spark.apache.org
Date: Tue, 15 Sep 2015 09:19:28 +













On 15 Sep 2015, at 05:47, Lan Jiang <ljia...@gmail.com> wrote:


Hi, there,



I am using Spark 1.4.1. The protobuf 2.5 is included by Spark 1.4.1 by default. 
However, I would like to use Protobuf 3 in my spark application so that I can 
use some new features such as Map support.  Is there anyway to do that? 



Right now if I build a uber.jar with dependencies including protobuf 3 classes 
and pass to spark-shell through --jars option, during the execution, I got the 
error java.lang.NoSuchFieldError: unknownFields. 









protobuf is an absolute nightmare version-wise, as protoc generates 
incompatible java classes even across point versions. Hadoop 2.2+ is and will 
always be protobuf 2.5 only; that applies transitively to downstream projects  
(the great protobuf upgrade
 of 2013 was actually pushed by the HBase team, and required a co-ordinated 
change across multiple projects)








Is there anyway to use a different version of Protobuf other than the default 
one included in the Spark distribution? I guess I can generalize and extend the 
question to any third party libraries. How to deal with version conflict for 
any third
 party libraries included in the Spark distribution? 







maven shading is the strategy. Generally it is less needed, though the 
troublesome binaries are,  across the entire apache big data stack:


google protobuf
google guava
kryo

jackson



you can generally bump up the other versions, at least by point releases.   
  

Change protobuf version or any other third party library version in Spark application

2015-09-14 Thread Lan Jiang
Hi, there,

I am using Spark 1.4.1. The protobuf 2.5 is included by Spark 1.4.1 by
default. However, I would like to use Protobuf 3 in my spark application so
that I can use some new features such as Map support.  Is there anyway to
do that?

Right now if I build a uber.jar with dependencies including protobuf 3
classes and pass to spark-shell through --jars option, during the
execution, I got the error *java.lang.NoSuchFieldError: unknownFields. *

Is there anyway to use a different version of Protobuf other than the
default one included in the Spark distribution? I guess I can generalize
and extend the question to any third party libraries. How to deal with
version conflict for any third party libraries included in the Spark
distribution?

Thanks!

Lan


Re: Why can't Spark Streaming recover from the checkpoint directory when using a third party library for processingmulti-line JSON?

2015-03-04 Thread Emre Sevinc
I've also tried the following:

Configuration hadoopConfiguration = new Configuration();
hadoopConfiguration.set(multilinejsoninputformat.member, itemSet);

JavaStreamingContext ssc =
JavaStreamingContext.getOrCreate(checkpointDirectory, hadoopConfiguration,
factory, false);


but I still get the same exception.

Why doesn't getOrCreate ignore that Hadoop configuration part (which
normally works, e.g. when not recovering)?

--
Emre


On Tue, Mar 3, 2015 at 3:36 PM, Emre Sevinc emre.sev...@gmail.com wrote:

 Hello,

 I have a Spark Streaming application (that uses Spark 1.2.1) that listens
 to an input directory, and when new JSON files are copied to that directory
 processes them, and writes them to an output directory.

 It uses a 3rd party library to process the multi-line JSON files (
 https://github.com/alexholmes/json-mapreduce). You can see the relevant
 part of the streaming application at:

   https://gist.github.com/emres/ec18ee264e4eb0dd8f1a

 When I run this application locally, it works perfectly fine. But then I
 wanted to test whether it could recover from failure, e.g. if I stopped it
 right in the middle of processing some files. I started the streaming
 application, copied 100 files to the input directory, and hit Ctrl+C when
 it has alread processed about 50 files:

 ...
 2015-03-03 15:06:20 INFO  FileInputFormat:280 - Total input paths to
 process : 1
 2015-03-03 15:06:20 INFO  FileInputFormat:280 - Total input paths to
 process : 1
 2015-03-03 15:06:20 INFO  FileInputFormat:280 - Total input paths to
 process : 1
 [Stage
 0:==
 (65 + 4) / 100]
 ^C

 Then I started the application again, expecting that it could recover from
 the checkpoint. For a while it started to read files again and then gave an
 exception:

 ...
 2015-03-03 15:06:39 INFO  FileInputFormat:280 - Total input paths to
 process : 1
 2015-03-03 15:06:39 INFO  FileInputFormat:280 - Total input paths to
 process : 1
 2015-03-03 15:06:39 WARN  SchemaValidatorDriver:145 -
  * * * hadoopConfiguration: itemSet
 2015-03-03 15:06:40 ERROR Executor:96 - Exception in task 0.0 in stage 0.0
 (TID 0)
 java.io.IOException: Missing configuration value for
 multilinejsoninputformat.member
 at
 com.alexholmes.json.mapreduce.MultiLineJsonInputFormat.createRecordReader(MultiLineJsonInputFormat.java:30)
 at
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:133)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

 Since in the exception it refers to a missing configuration
 multilinejsoninputformat.member, I think it is about the following line:

ssc.ssc().sc().hadoopConfiguration().set(
 multilinejsoninputformat.member, itemSet);

 And this is why I also log the value of it, and as you can see above, just
 before it gives the exception in the recovery process, it shows that 
 multilinejsoninputformat.member
 is set to itemSet. But somehow it is not found during the recovery.
 This exception happens only when it tries to recover from a previously
 interrupted run.

 I've also tried moving the above line into the createContext method, but
 still had the same exception.

 Why is that?

 And how can I work around it?

 --
 Emre Sevinç
 http://www.bigindustries.be/




-- 
Emre Sevinc


Re: Why can't Spark Streaming recover from the checkpoint directory when using a third party library for processingmulti-line JSON?

2015-03-04 Thread Tathagata Das
That could be a corner case bug. How do you add the 3rd party library to
the class path of the driver? Through spark-submit? Could you give the
command you used?

TD

On Wed, Mar 4, 2015 at 12:42 AM, Emre Sevinc emre.sev...@gmail.com wrote:

 I've also tried the following:

 Configuration hadoopConfiguration = new Configuration();
 hadoopConfiguration.set(multilinejsoninputformat.member, itemSet);

 JavaStreamingContext ssc =
 JavaStreamingContext.getOrCreate(checkpointDirectory, hadoopConfiguration,
 factory, false);


 but I still get the same exception.

 Why doesn't getOrCreate ignore that Hadoop configuration part (which
 normally works, e.g. when not recovering)?

 --
 Emre


 On Tue, Mar 3, 2015 at 3:36 PM, Emre Sevinc emre.sev...@gmail.com wrote:

 Hello,

 I have a Spark Streaming application (that uses Spark 1.2.1) that listens
 to an input directory, and when new JSON files are copied to that directory
 processes them, and writes them to an output directory.

 It uses a 3rd party library to process the multi-line JSON files (
 https://github.com/alexholmes/json-mapreduce). You can see the relevant
 part of the streaming application at:

   https://gist.github.com/emres/ec18ee264e4eb0dd8f1a

 When I run this application locally, it works perfectly fine. But then I
 wanted to test whether it could recover from failure, e.g. if I stopped it
 right in the middle of processing some files. I started the streaming
 application, copied 100 files to the input directory, and hit Ctrl+C when
 it has alread processed about 50 files:

 ...
 2015-03-03 15:06:20 INFO  FileInputFormat:280 - Total input paths to
 process : 1
 2015-03-03 15:06:20 INFO  FileInputFormat:280 - Total input paths to
 process : 1
 2015-03-03 15:06:20 INFO  FileInputFormat:280 - Total input paths to
 process : 1
 [Stage
 0:==
 (65 + 4) / 100]
 ^C

 Then I started the application again, expecting that it could recover
 from the checkpoint. For a while it started to read files again and then
 gave an exception:

 ...
 2015-03-03 15:06:39 INFO  FileInputFormat:280 - Total input paths to
 process : 1
 2015-03-03 15:06:39 INFO  FileInputFormat:280 - Total input paths to
 process : 1
 2015-03-03 15:06:39 WARN  SchemaValidatorDriver:145 -
  * * * hadoopConfiguration: itemSet
 2015-03-03 15:06:40 ERROR Executor:96 - Exception in task 0.0 in stage
 0.0 (TID 0)
 java.io.IOException: Missing configuration value for
 multilinejsoninputformat.member
 at
 com.alexholmes.json.mapreduce.MultiLineJsonInputFormat.createRecordReader(MultiLineJsonInputFormat.java:30)
 at
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:133)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

 Since in the exception it refers to a missing configuration
 multilinejsoninputformat.member, I think it is about the following line:

ssc.ssc().sc().hadoopConfiguration().set(
 multilinejsoninputformat.member, itemSet);

 And this is why I also log the value of it, and as you can see above,
 just before it gives the exception in the recovery process, it shows that 
 multilinejsoninputformat.member
 is set to itemSet. But somehow it is not found during the recovery.
 This exception happens only when it tries to recover from a previously
 interrupted run.

 I've also tried moving the above line into the createContext method,
 but still had the same exception.

 Why is that?

 And how can I work around it?

 --
 Emre Sevinç
 http://www.bigindustries.be/




 

Re: Why can't Spark Streaming recover from the checkpoint directory when using a third party library for processingmulti-line JSON?

2015-03-04 Thread Emre Sevinc
I'm adding this 3rd party library to my Maven pom.xml file so that it's
embedded into the JAR I send to spark-submit:

  dependency
  groupIdjson-mapreduce/groupId
  artifactIdjson-mapreduce/artifactId
  version1.0-SNAPSHOT/version
  exclusions
exclusion
  groupIdjavax.servlet/groupId
  artifactId*/artifactId
/exclusion
exclusion
  groupIdcommons-io/groupId
  artifactId*/artifactId
/exclusion
exclusion
  groupIdcommons-lang/groupId
  artifactId*/artifactId
/exclusion
exclusion
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-common/artifactId
/exclusion
  /exclusions
/dependency


Then I build my über JAR, and then I run my Spark Streaming application via
the command line:

 spark-submit --class com.example.schemavalidator.SchemaValidatorDriver
--master local[4] --deploy-mode client target/myapp-1.0-SNAPSHOT.jar

--
Emre Sevinç


On Wed, Mar 4, 2015 at 11:19 AM, Tathagata Das t...@databricks.com wrote:

 That could be a corner case bug. How do you add the 3rd party library to
 the class path of the driver? Through spark-submit? Could you give the
 command you used?

 TD

 On Wed, Mar 4, 2015 at 12:42 AM, Emre Sevinc emre.sev...@gmail.com
 wrote:

 I've also tried the following:

 Configuration hadoopConfiguration = new Configuration();
 hadoopConfiguration.set(multilinejsoninputformat.member, itemSet);

 JavaStreamingContext ssc =
 JavaStreamingContext.getOrCreate(checkpointDirectory, hadoopConfiguration,
 factory, false);


 but I still get the same exception.

 Why doesn't getOrCreate ignore that Hadoop configuration part (which
 normally works, e.g. when not recovering)?

 --
 Emre


 On Tue, Mar 3, 2015 at 3:36 PM, Emre Sevinc emre.sev...@gmail.com
 wrote:

 Hello,

 I have a Spark Streaming application (that uses Spark 1.2.1) that
 listens to an input directory, and when new JSON files are copied to that
 directory processes them, and writes them to an output directory.

 It uses a 3rd party library to process the multi-line JSON files (
 https://github.com/alexholmes/json-mapreduce). You can see the relevant
 part of the streaming application at:

   https://gist.github.com/emres/ec18ee264e4eb0dd8f1a

 When I run this application locally, it works perfectly fine. But then I
 wanted to test whether it could recover from failure, e.g. if I stopped it
 right in the middle of processing some files. I started the streaming
 application, copied 100 files to the input directory, and hit Ctrl+C when
 it has alread processed about 50 files:

 ...
 2015-03-03 15:06:20 INFO  FileInputFormat:280 - Total input paths to
 process : 1
 2015-03-03 15:06:20 INFO  FileInputFormat:280 - Total input paths to
 process : 1
 2015-03-03 15:06:20 INFO  FileInputFormat:280 - Total input paths to
 process : 1
 [Stage
 0:==
 (65 + 4) / 100]
 ^C

 Then I started the application again, expecting that it could recover
 from the checkpoint. For a while it started to read files again and then
 gave an exception:

 ...
 2015-03-03 15:06:39 INFO  FileInputFormat:280 - Total input paths to
 process : 1
 2015-03-03 15:06:39 INFO  FileInputFormat:280 - Total input paths to
 process : 1
 2015-03-03 15:06:39 WARN  SchemaValidatorDriver:145 -
  * * * hadoopConfiguration: itemSet
 2015-03-03 15:06:40 ERROR Executor:96 - Exception in task 0.0 in stage
 0.0 (TID 0)
 java.io.IOException: Missing configuration value for
 multilinejsoninputformat.member
 at
 com.alexholmes.json.mapreduce.MultiLineJsonInputFormat.createRecordReader(MultiLineJsonInputFormat.java:30)
 at
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:133)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at 

Why can't Spark Streaming recover from the checkpoint directory when using a third party library for processingmulti-line JSON?

2015-03-03 Thread Emre Sevinc
Hello,

I have a Spark Streaming application (that uses Spark 1.2.1) that listens
to an input directory, and when new JSON files are copied to that directory
processes them, and writes them to an output directory.

It uses a 3rd party library to process the multi-line JSON files (
https://github.com/alexholmes/json-mapreduce). You can see the relevant
part of the streaming application at:

  https://gist.github.com/emres/ec18ee264e4eb0dd8f1a

When I run this application locally, it works perfectly fine. But then I
wanted to test whether it could recover from failure, e.g. if I stopped it
right in the middle of processing some files. I started the streaming
application, copied 100 files to the input directory, and hit Ctrl+C when
it has alread processed about 50 files:

...
2015-03-03 15:06:20 INFO  FileInputFormat:280 - Total input paths to
process : 1
2015-03-03 15:06:20 INFO  FileInputFormat:280 - Total input paths to
process : 1
2015-03-03 15:06:20 INFO  FileInputFormat:280 - Total input paths to
process : 1
[Stage
0:==
(65 + 4) / 100]
^C

Then I started the application again, expecting that it could recover from
the checkpoint. For a while it started to read files again and then gave an
exception:

...
2015-03-03 15:06:39 INFO  FileInputFormat:280 - Total input paths to
process : 1
2015-03-03 15:06:39 INFO  FileInputFormat:280 - Total input paths to
process : 1
2015-03-03 15:06:39 WARN  SchemaValidatorDriver:145 -
 * * * hadoopConfiguration: itemSet
2015-03-03 15:06:40 ERROR Executor:96 - Exception in task 0.0 in stage 0.0
(TID 0)
java.io.IOException: Missing configuration value for
multilinejsoninputformat.member
at
com.alexholmes.json.mapreduce.MultiLineJsonInputFormat.createRecordReader(MultiLineJsonInputFormat.java:30)
at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:133)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Since in the exception it refers to a missing configuration
multilinejsoninputformat.member, I think it is about the following line:

   ssc.ssc().sc().hadoopConfiguration().set(multilinejsoninputformat.member
, itemSet);

And this is why I also log the value of it, and as you can see above, just
before it gives the exception in the recovery process, it shows that
multilinejsoninputformat.member
is set to itemSet. But somehow it is not found during the recovery. This
exception happens only when it tries to recover from a previously
interrupted run.

I've also tried moving the above line into the createContext method, but
still had the same exception.

Why is that?

And how can I work around it?

-- 
Emre Sevinç
http://www.bigindustries.be/