Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread VG
ping. Anyone has some suggestions/advice for me .
It will be really helpful.

VG

On Sun, Jul 24, 2016 at 12:19 AM, VG  wrote:

> Sean,
>
> I did this just to test the model. When I do a split of my data as
> training to 80% and test to be 20%
>
> I get a Root-mean-square error = NaN
>
> So I am wondering where I might be going wrong
>
> Regards,
> VG
>
> On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen  wrote:
>
>> No, that's certainly not to be expected. ALS works by computing a much
>> lower-rank representation of the input. It would not reproduce the
>> input exactly, and you don't want it to -- this would be seriously
>> overfit. This is why in general you don't evaluate a model on the
>> training set.
>>
>> On Sat, Jul 23, 2016 at 7:37 PM, VG  wrote:
>> > I am trying to run ml.ALS to compute some recommendations.
>> >
>> > Just to test I am using the same dataset for training using ALSModel
>> and for
>> > predicting the results based on the model .
>> >
>> > When I evaluate the result using RegressionEvaluator I get a
>> > Root-mean-square error = 1.5544064263236066
>> >
>> > I thin this should be 0. Any suggestions what might be going wrong.
>> >
>> > Regards,
>> > Vipul
>>
>
>


Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-23 Thread VG
Any suggestions / ideas here ?



On Sun, Jul 24, 2016 at 12:19 AM, VG  wrote:

> Sean,
>
> I did this just to test the model. When I do a split of my data as
> training to 80% and test to be 20%
>
> I get a Root-mean-square error = NaN
>
> So I am wondering where I might be going wrong
>
> Regards,
> VG
>
> On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen  wrote:
>
>> No, that's certainly not to be expected. ALS works by computing a much
>> lower-rank representation of the input. It would not reproduce the
>> input exactly, and you don't want it to -- this would be seriously
>> overfit. This is why in general you don't evaluate a model on the
>> training set.
>>
>> On Sat, Jul 23, 2016 at 7:37 PM, VG  wrote:
>> > I am trying to run ml.ALS to compute some recommendations.
>> >
>> > Just to test I am using the same dataset for training using ALSModel
>> and for
>> > predicting the results based on the model .
>> >
>> > When I evaluate the result using RegressionEvaluator I get a
>> > Root-mean-square error = 1.5544064263236066
>> >
>> > I thin this should be 0. Any suggestions what might be going wrong.
>> >
>> > Regards,
>> > Vipul
>>
>
>


Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-23 Thread VG
Sean,

I did this just to test the model. When I do a split of my data as training
to 80% and test to be 20%

I get a Root-mean-square error = NaN

So I am wondering where I might be going wrong

Regards,
VG

On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen  wrote:

> No, that's certainly not to be expected. ALS works by computing a much
> lower-rank representation of the input. It would not reproduce the
> input exactly, and you don't want it to -- this would be seriously
> overfit. This is why in general you don't evaluate a model on the
> training set.
>
> On Sat, Jul 23, 2016 at 7:37 PM, VG  wrote:
> > I am trying to run ml.ALS to compute some recommendations.
> >
> > Just to test I am using the same dataset for training using ALSModel and
> for
> > predicting the results based on the model .
> >
> > When I evaluate the result using RegressionEvaluator I get a
> > Root-mean-square error = 1.5544064263236066
> >
> > I thin this should be 0. Any suggestions what might be going wrong.
> >
> > Regards,
> > Vipul
>


Re: Error in collecting RDD as a Map - IOException in collectAsMap

2016-07-23 Thread VG
Hi Pedro,

Based on your suggestion, I deployed this on a aws node and it worked fine.
thanks for your advice.

I am still trying to figure out the issues on the local environment
Anyways thanks again

-VG

On Sat, Jul 23, 2016 at 9:26 PM, Pedro Rodriguez 
wrote:

> Have you changed spark-env.sh or spark-defaults.conf from the default? It
> looks like spark is trying to address local workers based on a network
> address (eg 192.168……) instead of on localhost (localhost, 127.0.0.1,
> 0.0.0.0,…). Additionally, that network address doesn’t resolve correctly.
> You might also check /etc/hosts to make sure that you don’t have anything
> weird going on.
>
> Last thing to try perhaps is that are you running Spark within a VM and/or
> Docker? If networking isn’t setup correctly on those you may also run into
> trouble.
>
> What would be helpful is to know everything about your setup that might
> affect networking.
>
> —
> Pedro Rodriguez
> PhD Student in Large-Scale Machine Learning | CU Boulder
> Systems Oriented Data Scientist
> UC Berkeley AMPLab Alumni
>
> pedrorodriguez.io | 909-353-4423
> github.com/EntilZha | LinkedIn
> <https://www.linkedin.com/in/pedrorodriguezscience>
>
> On July 23, 2016 at 9:10:31 AM, VG (vlin...@gmail.com) wrote:
>
> Hi pedro,
>
> Apologies for not adding this earlier.
>
> This is running on a local cluster set up as follows.
> JavaSparkContext jsc = new JavaSparkContext("local[2]", "DR");
>
> Any suggestions based on this ?
>
> The ports are not blocked by firewall.
>
> Regards,
>
>
>
> On Sat, Jul 23, 2016 at 8:35 PM, Pedro Rodriguez 
> wrote:
>
>> Make sure that you don’t have ports firewalled. You don’t really give
>> much information to work from, but it looks like the master can’t access
>> the worker nodes for some reason. If you give more information on the
>> cluster, networking, etc, it would help.
>>
>> For example, on AWS you can create a security group which allows all
>> traffic to/from itself to itself. If you are using something like ufw on
>> ubuntu then you probably need to know the ip addresses of the worker nodes
>> beforehand.
>>
>> —
>> Pedro Rodriguez
>> PhD Student in Large-Scale Machine Learning | CU Boulder
>> Systems Oriented Data Scientist
>> UC Berkeley AMPLab Alumni
>>
>> pedrorodriguez.io | 909-353-4423
>> github.com/EntilZha | LinkedIn
>> <https://www.linkedin.com/in/pedrorodriguezscience>
>>
>> On July 23, 2016 at 7:38:01 AM, VG (vlin...@gmail.com) wrote:
>>
>> Please suggest if I am doing something wrong or an alternative way of
>> doing this.
>>
>> I have an RDD with two values as follows
>> JavaPairRDD rdd
>>
>> When I execute   rdd..collectAsMap()
>> it always fails with IO exceptions.
>>
>>
>> 16/07/23 19:03:58 ERROR RetryingBlockFetcher: Exception while beginning
>> fetch of 1 outstanding blocks
>> java.io.IOException: Failed to connect to /192.168.1.3:58179
>> at
>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>> at
>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>> at
>> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96)
>> at
>> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>> at
>> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>> at
>> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:105)
>> at
>> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:92)
>> at
>> org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:546)
>> at
>> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:76)
>> at
>> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>> at
>> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793)
>> at
>> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>> at java.lang.Thread.run(Unknown 

Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-23 Thread VG
I am trying to run ml.ALS to compute some recommendations.

Just to test I am using the same dataset for training using ALSModel and
for predicting the results based on the model .

When I evaluate the result using RegressionEvaluator I get a
Root-mean-square error = 1.5544064263236066

I thin this should be 0. Any suggestions what might be going wrong.

Regards,
Vipul


Re: Error in collecting RDD as a Map - IOException in collectAsMap

2016-07-23 Thread VG
Hi pedro,

Apologies for not adding this earlier.

This is running on a local cluster set up as follows.
JavaSparkContext jsc = new JavaSparkContext("local[2]", "DR");

Any suggestions based on this ?

The ports are not blocked by firewall.

Regards,



On Sat, Jul 23, 2016 at 8:35 PM, Pedro Rodriguez 
wrote:

> Make sure that you don’t have ports firewalled. You don’t really give much
> information to work from, but it looks like the master can’t access the
> worker nodes for some reason. If you give more information on the cluster,
> networking, etc, it would help.
>
> For example, on AWS you can create a security group which allows all
> traffic to/from itself to itself. If you are using something like ufw on
> ubuntu then you probably need to know the ip addresses of the worker nodes
> beforehand.
>
> —
> Pedro Rodriguez
> PhD Student in Large-Scale Machine Learning | CU Boulder
> Systems Oriented Data Scientist
> UC Berkeley AMPLab Alumni
>
> pedrorodriguez.io | 909-353-4423
> github.com/EntilZha | LinkedIn
> <https://www.linkedin.com/in/pedrorodriguezscience>
>
> On July 23, 2016 at 7:38:01 AM, VG (vlin...@gmail.com) wrote:
>
> Please suggest if I am doing something wrong or an alternative way of
> doing this.
>
> I have an RDD with two values as follows
> JavaPairRDD rdd
>
> When I execute   rdd..collectAsMap()
> it always fails with IO exceptions.
>
>
> 16/07/23 19:03:58 ERROR RetryingBlockFetcher: Exception while beginning
> fetch of 1 outstanding blocks
> java.io.IOException: Failed to connect to /192.168.1.3:58179
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
> at
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
> at
> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:105)
> at
> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:92)
> at
> org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:546)
> at
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:76)
> at
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
> at
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793)
> at
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
> Caused by: java.net.ConnectException: Connection timed out: no further
> information: /192.168.1.3:58179
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
> at
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
> at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> ... 1 more
> 16/07/23 19:03:58 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 1
> outstanding blocks after 5000 ms
>
>
>
>


Error in collecting RDD as a Map - IOException in collectAsMap

2016-07-23 Thread VG
Please suggest if I am doing something wrong or an alternative way of doing
this.

I have an RDD with two values as follows
JavaPairRDD rdd

When I execute   rdd..collectAsMap()
it always fails with IO exceptions.


16/07/23 19:03:58 ERROR RetryingBlockFetcher: Exception while beginning
fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /192.168.1.3:58179
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96)
at
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at
org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
at
org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:105)
at
org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:92)
at
org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:546)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:76)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793)
at
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.ConnectException: Connection timed out: no further
information: /192.168.1.3:58179
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
at
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
16/07/23 19:03:58 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 1
outstanding blocks after 5000 ms


How to search on a Dataset / RDD

2016-07-22 Thread VG
Any suggestions here  please

I basically need an ability to look up *name -> index* and *index -> name*
in the code

-VG

On Fri, Jul 22, 2016 at 6:40 PM, VG  wrote:

> Hi All,
>
> I am really confused how to proceed further. Please help.
>
> I have a dataset created as follows:
> Dataset b = sqlContext.sql("SELECT bid, name FROM business");
>
> Now I need to map each name with a unique index and I did the following
> JavaPairRDD indexedBId = business.javaRDD()
>.zipWithIndex();
>
> In later part of the code I need to change a datastructure and update name
> with index value generated above .
> I am unable to figure out how to do a look up here..
>
> Please suggest /.
>
> If there is a better way to do this please suggest that.
>
> Regards
> VG
>
>


Re: Error in running JavaALSExample example from spark examples

2016-07-22 Thread VG
Great. thanks a ton for helping out on this Sean.
I somehow messed this up (and was running in loops for last 2 hours )

thanks again

-VG

On Fri, Jul 22, 2016 at 11:28 PM, Sean Owen  wrote:

> You mark these provided, which is correct. If the version of Scala
> provided at runtime differs, you'll have a problem.
>
> In fact you can also see you mixed Scala versions in your dependencies
> here. MLlib is on 2.10.
>
> On Fri, Jul 22, 2016 at 6:49 PM, VG  wrote:
> > Sean,
> >
> > I am only using the maven dependencies for spark in my pom file.
> > I don't have anything else. I guess maven dependency should resolve to
> the
> > correct scala version .. isn;t it ? Any ideas.
> >
> > 
> > org.apache.spark
> > spark-core_2.11
> > 2.0.0-preview
> > provided
> > 
> > 
> > 
> > org.apache.spark
> > spark-sql_2.11
> > 2.0.0-preview
> > provided
> > 
> > 
> > 
> > org.apache.spark
> > spark-streaming_2.11
> > 2.0.0-preview
> > provided
> > 
> > 
> > 
> > org.apache.spark
> > spark-mllib_2.10
> > 2.0.0-preview
> > provided
> > 
> >
> >
> >
> > On Fri, Jul 22, 2016 at 11:16 PM, Sean Owen  wrote:
> >>
> >> -dev
> >> Looks like you are mismatching the version of Spark you deploy on at
> >> runtime then. Sounds like it was built for Scala 2.10
> >>
> >> On Fri, Jul 22, 2016 at 6:43 PM, VG  wrote:
> >> > Using 2.0.0-preview using maven
> >> > So all dependencies should be correct I guess
> >> >
> >> > 
> >> > org.apache.spark
> >> > spark-core_2.11
> >> > 2.0.0-preview
> >> > provided
> >> > 
> >> >
> >> > I see in maven dependencies that this brings in
> >> > scala-reflect-2.11.4
> >> > scala-compiler-2.11.0
> >> >
> >> > and so on
> >> >
> >> >
> >> >
> >> > On Fri, Jul 22, 2016 at 11:04 PM, Aaron Ilovici  >
> >> > wrote:
> >> >>
> >> >> What version of Spark/Scala are you running?
> >> >>
> >> >>
> >> >>
> >> >> -Aaron
> >> >
> >> >
> >
> >
>


Re: Error in running JavaALSExample example from spark examples

2016-07-22 Thread VG
Sean,

I am only using the maven dependencies for spark in my pom file.
I don't have anything else. I guess maven dependency should resolve to the
correct scala version .. isn;t it ? Any ideas.


org.apache.spark
spark-core_2.11
2.0.0-preview
provided



org.apache.spark
spark-sql_2.11
2.0.0-preview
provided



org.apache.spark
spark-streaming_2.11
2.0.0-preview
provided



org.apache.spark
spark-mllib_2.10
2.0.0-preview
provided




On Fri, Jul 22, 2016 at 11:16 PM, Sean Owen  wrote:

> -dev
> Looks like you are mismatching the version of Spark you deploy on at
> runtime then. Sounds like it was built for Scala 2.10
>
> On Fri, Jul 22, 2016 at 6:43 PM, VG  wrote:
> > Using 2.0.0-preview using maven
> > So all dependencies should be correct I guess
> >
> > 
> > org.apache.spark
> > spark-core_2.11
> > 2.0.0-preview
> > provided
> > 
> >
> > I see in maven dependencies that this brings in
> > scala-reflect-2.11.4
> > scala-compiler-2.11.0
> >
> > and so on
> >
> >
> >
> > On Fri, Jul 22, 2016 at 11:04 PM, Aaron Ilovici 
> > wrote:
> >>
> >> What version of Spark/Scala are you running?
> >>
> >>
> >>
> >> -Aaron
> >
> >
>


Re: Error in running JavaALSExample example from spark examples

2016-07-22 Thread VG
Using 2.0.0-preview using maven
So all dependencies should be correct I guess


org.apache.spark
spark-core_2.11
2.0.0-preview
provided


I see in maven dependencies that this brings in
scala-reflect-2.11.4
scala-compiler-2.11.0

and so on



On Fri, Jul 22, 2016 at 11:04 PM, Aaron Ilovici 
wrote:

> What version of Spark/Scala are you running?
>
>
>
> -Aaron
>


Error in running JavaALSExample example from spark examples

2016-07-22 Thread VG
I am getting the following error

Exception in thread "main" java.lang.NoSuchMethodError:
scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
at org.apache.spark.ml.recommendation.ALS.fit(ALS.scala:452)

Any suggestions to resolve this

VG


Re: Dataset , RDD zipWithIndex -- How to use as a map .

2016-07-22 Thread VG
Hi All,

Any suggestions for this

Regards,
VG

On Fri, Jul 22, 2016 at 6:40 PM, VG  wrote:

> Hi All,
>
> I am really confused how to proceed further. Please help.
>
> I have a dataset created as follows:
> Dataset b = sqlContext.sql("SELECT bid, name FROM business");
>
> Now I need to map each name with a unique index and I did the following
> JavaPairRDD indexedBId = business.javaRDD()
>.zipWithIndex();
>
> In later part of the code I need to change a datastructure and update name
> with index value generated above .
> I am unable to figure out how to do a look up here..
>
> Please suggest /.
>
> If there is a better way to do this please suggest that.
>
> Regards
> VG
>
>


Re: ml ALS.fit(..) issue

2016-07-22 Thread VG
Can someone please help here.

I tried both scala 2.10 and 2.11 on the system



On Fri, Jul 22, 2016 at 7:59 PM, VG  wrote:

> I am using version 2.0.0-preview
>
>
>
> On Fri, Jul 22, 2016 at 7:47 PM, VG  wrote:
>
>> I am running into the following error when running ALS
>>
>> Exception in thread "main" java.lang.NoSuchMethodError:
>> scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
>> at org.apache.spark.ml.recommendation.ALS.fit(ALS.scala:452)
>> at yelp.TestUser.main(TestUser.java:101)
>>
>> here line 101 in the above error is the following in code.
>>
>> ALSModel model = als.fit(training);
>>
>>
>> Does anyone has a suggestion what is going on here and where I might be
>> going wrong ?
>> Please suggest
>>
>> -VG
>>
>
>


Re: ml ALS.fit(..) issue

2016-07-22 Thread VG
I am using version 2.0.0-preview



On Fri, Jul 22, 2016 at 7:47 PM, VG  wrote:

> I am running into the following error when running ALS
>
> Exception in thread "main" java.lang.NoSuchMethodError:
> scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
> at org.apache.spark.ml.recommendation.ALS.fit(ALS.scala:452)
> at yelp.TestUser.main(TestUser.java:101)
>
> here line 101 in the above error is the following in code.
>
> ALSModel model = als.fit(training);
>
>
> Does anyone has a suggestion what is going on here and where I might be
> going wrong ?
> Please suggest
>
> -VG
>


ml ALS.fit(..) issue

2016-07-22 Thread VG
I am running into the following error when running ALS

Exception in thread "main" java.lang.NoSuchMethodError:
scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror;
at org.apache.spark.ml.recommendation.ALS.fit(ALS.scala:452)
at yelp.TestUser.main(TestUser.java:101)

here line 101 in the above error is the following in code.

ALSModel model = als.fit(training);


Does anyone has a suggestion what is going on here and where I might be
going wrong ?
Please suggest

-VG


Dataset , RDD zipWithIndex -- How to use as a map .

2016-07-22 Thread VG
Hi All,

I am really confused how to proceed further. Please help.

I have a dataset created as follows:
Dataset b = sqlContext.sql("SELECT bid, name FROM business");

Now I need to map each name with a unique index and I did the following
JavaPairRDD indexedBId = business.javaRDD()
   .zipWithIndex();

In later part of the code I need to change a datastructure and update name
with index value generated above .
I am unable to figure out how to do a look up here..

Please suggest /.

If there is a better way to do this please suggest that.

Regards
VG


Re: MLlib, Java, and DataFrame

2016-07-21 Thread VG
Interesting. thanks for this information.

On Fri, Jul 22, 2016 at 11:26 AM, Bryan Cutler  wrote:

> ML has a DataFrame based API, while MLlib is RDDs and will be deprecated
> as of Spark 2.0.
>
> On Thu, Jul 21, 2016 at 10:41 PM, VG  wrote:
>
>> Why do we have these 2 packages ... ml and mlib?
>> What is the difference in these
>>
>>
>>
>> On Fri, Jul 22, 2016 at 11:09 AM, Bryan Cutler  wrote:
>>
>>> Hi JG,
>>>
>>> If you didn't know this, Spark MLlib has 2 APIs, one of which uses
>>> DataFrames.  Take a look at this example
>>> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java
>>>
>>> This example uses a Dataset, which is type equivalent to a
>>> DataFrame.
>>>
>>>
>>> On Thu, Jul 21, 2016 at 8:41 PM, Jean Georges Perrin 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am looking for some really super basic examples of MLlib (like a
>>>> linear regression over a list of values) in Java. I have found a few, but I
>>>> only saw them using JavaRDD... and not DataFrame.
>>>>
>>>> I was kind of hoping to take my current DataFrame and send them in
>>>> MLlib. Am I too optimistic? Do you know/have any example like that?
>>>>
>>>> Thanks!
>>>>
>>>> jg
>>>>
>>>>
>>>> Jean Georges Perrin
>>>> j...@jgp.net / @jgperrin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>


Re: MLlib, Java, and DataFrame

2016-07-21 Thread VG
Why do we have these 2 packages ... ml and mlib?
What is the difference in these



On Fri, Jul 22, 2016 at 11:09 AM, Bryan Cutler  wrote:

> Hi JG,
>
> If you didn't know this, Spark MLlib has 2 APIs, one of which uses
> DataFrames.  Take a look at this example
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java
>
> This example uses a Dataset, which is type equivalent to a DataFrame.
>
>
> On Thu, Jul 21, 2016 at 8:41 PM, Jean Georges Perrin  wrote:
>
>> Hi,
>>
>> I am looking for some really super basic examples of MLlib (like a linear
>> regression over a list of values) in Java. I have found a few, but I only
>> saw them using JavaRDD... and not DataFrame.
>>
>> I was kind of hoping to take my current DataFrame and send them in MLlib.
>> Am I too optimistic? Do you know/have any example like that?
>>
>> Thanks!
>>
>> jg
>>
>>
>> Jean Georges Perrin
>> j...@jgp.net / @jgperrin
>>
>>
>>
>>
>>
>


Re: spark-xml - xml parsing when rows only have attributes

2016-06-17 Thread VG
Great..  thanks for pointing this out.



On Fri, Jun 17, 2016 at 6:21 PM, Ted Yu  wrote:

> Please see https://github.com/databricks/spark-xml/issues/92
>
> On Fri, Jun 17, 2016 at 5:19 AM, VG  wrote:
>
>> I am using spark-xml for loading data and creating a data frame.
>>
>> If xml element has sub elements and values, then it works fine. Example
>>  if the xml element is like
>>
>> 
>>  test
>> 
>>
>> however if the xml element is bare with just attributes, then it does not
>> work - Any suggestions.
>>   Does not load the data
>>
>>
>>
>> Any suggestions to fix this
>>
>>
>>
>>
>>
>>
>> On Fri, Jun 17, 2016 at 4:28 PM, Siva A  wrote:
>>
>>> Use Spark XML version,0.3.3
>>> 
>>> com.databricks
>>> spark-xml_2.10
>>> 0.3.3
>>> 
>>>
>>> On Fri, Jun 17, 2016 at 4:25 PM, VG  wrote:
>>>
>>>> Hi Siva
>>>>
>>>> This is what i have for jars. Did you manage to run with these or
>>>> different versions ?
>>>>
>>>>
>>>> 
>>>> org.apache.spark
>>>> spark-core_2.10
>>>> 1.6.1
>>>> 
>>>> 
>>>> org.apache.spark
>>>> spark-sql_2.10
>>>> 1.6.1
>>>> 
>>>> 
>>>> com.databricks
>>>> spark-xml_2.10
>>>> 0.2.0
>>>> 
>>>> 
>>>> org.scala-lang
>>>> scala-library
>>>> 2.10.6
>>>> 
>>>>
>>>> Thanks
>>>> VG
>>>>
>>>>
>>>> On Fri, Jun 17, 2016 at 4:16 PM, Siva A 
>>>> wrote:
>>>>
>>>>> Hi Marco,
>>>>>
>>>>> I did run in IDE(Intellij) as well. It works fine.
>>>>> VG, make sure the right jar is in classpath.
>>>>>
>>>>> --Siva
>>>>>
>>>>> On Fri, Jun 17, 2016 at 4:11 PM, Marco Mistroni 
>>>>> wrote:
>>>>>
>>>>>> and  your eclipse path is correct?
>>>>>> i suggest, as Siva did before, to build your jar and run it via
>>>>>> spark-submit  by specifying the --packages option
>>>>>> it's as simple as run this command
>>>>>>
>>>>>> spark-submit   --packages
>>>>>> com.databricks:spark-xml_:   --class >>>>> of
>>>>>> your class containing main> 
>>>>>>
>>>>>> Indeed, if you have only these lines to run, why dont you try them in
>>>>>> spark-shell ?
>>>>>>
>>>>>> hth
>>>>>>
>>>>>> On Fri, Jun 17, 2016 at 11:32 AM, VG  wrote:
>>>>>>
>>>>>>> nopes. eclipse.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 17, 2016 at 3:58 PM, Siva A 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> If you are running from IDE, Are you using Intellij?
>>>>>>>>
>>>>>>>> On Fri, Jun 17, 2016 at 3:20 PM, Siva A 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Can you try to package as a jar and run using spark-submit
>>>>>>>>>
>>>>>>>>> Siva
>>>>>>>>>
>>>>>>>>> On Fri, Jun 17, 2016 at 3:17 PM, VG  wrote:
>>>>>>>>>
>>>>>>>>>> I am trying to run from IDE and everything else is working fine.
>>>>>>>>>> I added spark-xml jar and now I ended up into this dependency
>>>>>>>>>>
>>>>>>>>>> 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
>>>>>>>>>> Exception in thread "main" *java.lang.NoClassDefFoundError:
>>>>>>>>>> scala/collection/GenTraversableOnce$class*
>>>>>>>>>> at
>>>>>>>>>> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
>>>>>>>>>> at
>>>>>>>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
>>>>>>>>>> at
>>>>>>>>>> org.apache.spark.sql.Data

spark-xml - xml parsing when rows only have attributes

2016-06-17 Thread VG
I am using spark-xml for loading data and creating a data frame.

If xml element has sub elements and values, then it works fine. Example  if
the xml element is like


 test


however if the xml element is bare with just attributes, then it does not
work - Any suggestions.
  Does not load the data



Any suggestions to fix this






On Fri, Jun 17, 2016 at 4:28 PM, Siva A  wrote:

> Use Spark XML version,0.3.3
> 
> com.databricks
> spark-xml_2.10
> 0.3.3
> 
>
> On Fri, Jun 17, 2016 at 4:25 PM, VG  wrote:
>
>> Hi Siva
>>
>> This is what i have for jars. Did you manage to run with these or
>> different versions ?
>>
>>
>> 
>> org.apache.spark
>> spark-core_2.10
>> 1.6.1
>> 
>> 
>> org.apache.spark
>> spark-sql_2.10
>> 1.6.1
>> 
>> 
>> com.databricks
>> spark-xml_2.10
>> 0.2.0
>> 
>> 
>> org.scala-lang
>> scala-library
>> 2.10.6
>> 
>>
>> Thanks
>> VG
>>
>>
>> On Fri, Jun 17, 2016 at 4:16 PM, Siva A  wrote:
>>
>>> Hi Marco,
>>>
>>> I did run in IDE(Intellij) as well. It works fine.
>>> VG, make sure the right jar is in classpath.
>>>
>>> --Siva
>>>
>>> On Fri, Jun 17, 2016 at 4:11 PM, Marco Mistroni 
>>> wrote:
>>>
>>>> and  your eclipse path is correct?
>>>> i suggest, as Siva did before, to build your jar and run it via
>>>> spark-submit  by specifying the --packages option
>>>> it's as simple as run this command
>>>>
>>>> spark-submit   --packages
>>>> com.databricks:spark-xml_:   --class >>> your class containing main> 
>>>>
>>>> Indeed, if you have only these lines to run, why dont you try them in
>>>> spark-shell ?
>>>>
>>>> hth
>>>>
>>>> On Fri, Jun 17, 2016 at 11:32 AM, VG  wrote:
>>>>
>>>>> nopes. eclipse.
>>>>>
>>>>>
>>>>> On Fri, Jun 17, 2016 at 3:58 PM, Siva A 
>>>>> wrote:
>>>>>
>>>>>> If you are running from IDE, Are you using Intellij?
>>>>>>
>>>>>> On Fri, Jun 17, 2016 at 3:20 PM, Siva A 
>>>>>> wrote:
>>>>>>
>>>>>>> Can you try to package as a jar and run using spark-submit
>>>>>>>
>>>>>>> Siva
>>>>>>>
>>>>>>> On Fri, Jun 17, 2016 at 3:17 PM, VG  wrote:
>>>>>>>
>>>>>>>> I am trying to run from IDE and everything else is working fine.
>>>>>>>> I added spark-xml jar and now I ended up into this dependency
>>>>>>>>
>>>>>>>> 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
>>>>>>>> Exception in thread "main" *java.lang.NoClassDefFoundError:
>>>>>>>> scala/collection/GenTraversableOnce$class*
>>>>>>>> at
>>>>>>>> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
>>>>>>>> at
>>>>>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
>>>>>>>> at
>>>>>>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>>>>>>>> at
>>>>>>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>>>>>>>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>>>>>>>> Caused by:* java.lang.ClassNotFoundException:
>>>>>>>> scala.collection.GenTraversableOnce$class*
>>>>>>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>>>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>>>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>>>>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>>>>>> ... 5 more
>>>>>>>> 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown
>>>>>>>> hook
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni <
>>>>>>>> mmistr...@gmail.com> wr

Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread VG
It proceeded with the jars I mentioned.
However no data getting loaded into data frame...

sob sob :(

On Fri, Jun 17, 2016 at 4:25 PM, VG  wrote:

> Hi Siva
>
> This is what i have for jars. Did you manage to run with these or
> different versions ?
>
>
> 
> org.apache.spark
> spark-core_2.10
> 1.6.1
> 
> 
> org.apache.spark
> spark-sql_2.10
> 1.6.1
> 
> 
> com.databricks
> spark-xml_2.10
> 0.2.0
> 
> 
> org.scala-lang
> scala-library
> 2.10.6
> 
>
> Thanks
> VG
>
>
> On Fri, Jun 17, 2016 at 4:16 PM, Siva A  wrote:
>
>> Hi Marco,
>>
>> I did run in IDE(Intellij) as well. It works fine.
>> VG, make sure the right jar is in classpath.
>>
>> --Siva
>>
>> On Fri, Jun 17, 2016 at 4:11 PM, Marco Mistroni 
>> wrote:
>>
>>> and  your eclipse path is correct?
>>> i suggest, as Siva did before, to build your jar and run it via
>>> spark-submit  by specifying the --packages option
>>> it's as simple as run this command
>>>
>>> spark-submit   --packages
>>> com.databricks:spark-xml_:   --class >> your class containing main> 
>>>
>>> Indeed, if you have only these lines to run, why dont you try them in
>>> spark-shell ?
>>>
>>> hth
>>>
>>> On Fri, Jun 17, 2016 at 11:32 AM, VG  wrote:
>>>
>>>> nopes. eclipse.
>>>>
>>>>
>>>> On Fri, Jun 17, 2016 at 3:58 PM, Siva A 
>>>> wrote:
>>>>
>>>>> If you are running from IDE, Are you using Intellij?
>>>>>
>>>>> On Fri, Jun 17, 2016 at 3:20 PM, Siva A 
>>>>> wrote:
>>>>>
>>>>>> Can you try to package as a jar and run using spark-submit
>>>>>>
>>>>>> Siva
>>>>>>
>>>>>> On Fri, Jun 17, 2016 at 3:17 PM, VG  wrote:
>>>>>>
>>>>>>> I am trying to run from IDE and everything else is working fine.
>>>>>>> I added spark-xml jar and now I ended up into this dependency
>>>>>>>
>>>>>>> 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
>>>>>>> Exception in thread "main" *java.lang.NoClassDefFoundError:
>>>>>>> scala/collection/GenTraversableOnce$class*
>>>>>>> at
>>>>>>> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
>>>>>>> at
>>>>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
>>>>>>> at
>>>>>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>>>>>>> at
>>>>>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>>>>>>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>>>>>>> Caused by:* java.lang.ClassNotFoundException:
>>>>>>> scala.collection.GenTraversableOnce$class*
>>>>>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>>>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>>>>> ... 5 more
>>>>>>> 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown
>>>>>>> hook
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni >>>>>> > wrote:
>>>>>>>
>>>>>>>> So you are using spark-submit  or spark-shell?
>>>>>>>>
>>>>>>>> you will need to launch either by passing --packages option (like
>>>>>>>> in the example below for spark-csv). you will need to iknow
>>>>>>>>
>>>>>>>> --packages com.databricks:spark-xml_:>>>>>>> version>
>>>>>>>>
>>>>>>>> hth
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 17, 2016 at 10:20 AM, VG  wrote:
>>>>>>>>
>>>>>>>>> Apologies for that.
>>>>>>>>&g

Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread VG
Hi Siva

This is what i have for jars. Did you manage to run with these or different
versions ?



org.apache.spark
spark-core_2.10
1.6.1


org.apache.spark
spark-sql_2.10
1.6.1


com.databricks
spark-xml_2.10
0.2.0


org.scala-lang
scala-library
2.10.6


Thanks
VG


On Fri, Jun 17, 2016 at 4:16 PM, Siva A  wrote:

> Hi Marco,
>
> I did run in IDE(Intellij) as well. It works fine.
> VG, make sure the right jar is in classpath.
>
> --Siva
>
> On Fri, Jun 17, 2016 at 4:11 PM, Marco Mistroni 
> wrote:
>
>> and  your eclipse path is correct?
>> i suggest, as Siva did before, to build your jar and run it via
>> spark-submit  by specifying the --packages option
>> it's as simple as run this command
>>
>> spark-submit   --packages
>> com.databricks:spark-xml_:   --class > your class containing main> 
>>
>> Indeed, if you have only these lines to run, why dont you try them in
>> spark-shell ?
>>
>> hth
>>
>> On Fri, Jun 17, 2016 at 11:32 AM, VG  wrote:
>>
>>> nopes. eclipse.
>>>
>>>
>>> On Fri, Jun 17, 2016 at 3:58 PM, Siva A 
>>> wrote:
>>>
>>>> If you are running from IDE, Are you using Intellij?
>>>>
>>>> On Fri, Jun 17, 2016 at 3:20 PM, Siva A 
>>>> wrote:
>>>>
>>>>> Can you try to package as a jar and run using spark-submit
>>>>>
>>>>> Siva
>>>>>
>>>>> On Fri, Jun 17, 2016 at 3:17 PM, VG  wrote:
>>>>>
>>>>>> I am trying to run from IDE and everything else is working fine.
>>>>>> I added spark-xml jar and now I ended up into this dependency
>>>>>>
>>>>>> 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
>>>>>> Exception in thread "main" *java.lang.NoClassDefFoundError:
>>>>>> scala/collection/GenTraversableOnce$class*
>>>>>> at
>>>>>> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
>>>>>> at
>>>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
>>>>>> at
>>>>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>>>>>> at
>>>>>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>>>>>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>>>>>> Caused by:* java.lang.ClassNotFoundException:
>>>>>> scala.collection.GenTraversableOnce$class*
>>>>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>>>> ... 5 more
>>>>>> 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown
>>>>>> hook
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni 
>>>>>> wrote:
>>>>>>
>>>>>>> So you are using spark-submit  or spark-shell?
>>>>>>>
>>>>>>> you will need to launch either by passing --packages option (like in
>>>>>>> the example below for spark-csv). you will need to iknow
>>>>>>>
>>>>>>> --packages com.databricks:spark-xml_:
>>>>>>>
>>>>>>> hth
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 17, 2016 at 10:20 AM, VG  wrote:
>>>>>>>
>>>>>>>> Apologies for that.
>>>>>>>> I am trying to use spark-xml to load data of a xml file.
>>>>>>>>
>>>>>>>> here is the exception
>>>>>>>>
>>>>>>>> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
>>>>>>>> Exception in thread "main" java.lang.ClassNotFoundException: Failed
>>>>>>>> to find data source: org.apache.spark.xml. Please find packages at
>>>>>>>> http://spark-packages.org
>>>>>>>> at
>>>>>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(Reso

Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread VG
nopes. eclipse.


On Fri, Jun 17, 2016 at 3:58 PM, Siva A  wrote:

> If you are running from IDE, Are you using Intellij?
>
> On Fri, Jun 17, 2016 at 3:20 PM, Siva A  wrote:
>
>> Can you try to package as a jar and run using spark-submit
>>
>> Siva
>>
>> On Fri, Jun 17, 2016 at 3:17 PM, VG  wrote:
>>
>>> I am trying to run from IDE and everything else is working fine.
>>> I added spark-xml jar and now I ended up into this dependency
>>>
>>> 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
>>> Exception in thread "main" *java.lang.NoClassDefFoundError:
>>> scala/collection/GenTraversableOnce$class*
>>> at
>>> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>>> Caused by:* java.lang.ClassNotFoundException:
>>> scala.collection.GenTraversableOnce$class*
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>> ... 5 more
>>> 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown hook
>>>
>>>
>>>
>>> On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni 
>>> wrote:
>>>
>>>> So you are using spark-submit  or spark-shell?
>>>>
>>>> you will need to launch either by passing --packages option (like in
>>>> the example below for spark-csv). you will need to iknow
>>>>
>>>> --packages com.databricks:spark-xml_:
>>>>
>>>> hth
>>>>
>>>>
>>>>
>>>> On Fri, Jun 17, 2016 at 10:20 AM, VG  wrote:
>>>>
>>>>> Apologies for that.
>>>>> I am trying to use spark-xml to load data of a xml file.
>>>>>
>>>>> here is the exception
>>>>>
>>>>> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
>>>>> Exception in thread "main" java.lang.ClassNotFoundException: Failed to
>>>>> find data source: org.apache.spark.xml. Please find packages at
>>>>> http://spark-packages.org
>>>>> at
>>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
>>>>> at
>>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
>>>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>>>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>>>>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>>>>> Caused by: java.lang.ClassNotFoundException:
>>>>> org.apache.spark.xml.DefaultSource
>>>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>>> at
>>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>>>>> at
>>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>>>>> at scala.util.Try$.apply(Try.scala:192)
>>>>> at
>>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>>>>> at
>>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>>>>> at scala.util.Try.orElse(Try.scala:84)
>>>>> at
>>>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
>>>>> ... 4 more
>>>>>
>>>>> Code
>>>>> SQLContext sqlContext = new SQLContext(sc);
>>>>> DataFrame df = sqlContext.read()
>>>>> .format("org.apache.spark.xml")
>>>>> .option("rowTag", "row")
>>>>> .load("A.xml");
>>>>>
>>>>> Any suggestions please ..
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jun 17, 2016 at 2:42 PM, Marco Mistroni 
>>>>> wrote:
>>>>>
>>>>>> too little info
>>>>>> it'll help if you can post the exception and show your sbt file (if
>>>>>> you are using sbt), and provide minimal details on what you are doing
>>>>>> kr
>>>>>>
>>>>>> On Fri, Jun 17, 2016 at 10:08 AM, VG  wrote:
>>>>>>
>>>>>>> Failed to find data source: com.databricks.spark.xml
>>>>>>>
>>>>>>> Any suggestions to resolve this
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread VG
I am trying to run from IDE and everything else is working fine.
I added spark-xml jar and now I ended up into this dependency

6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
Exception in thread "main" *java.lang.NoClassDefFoundError:
scala/collection/GenTraversableOnce$class*
at
org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
Caused by:* java.lang.ClassNotFoundException:
scala.collection.GenTraversableOnce$class*
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 5 more
16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown hook



On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni  wrote:

> So you are using spark-submit  or spark-shell?
>
> you will need to launch either by passing --packages option (like in the
> example below for spark-csv). you will need to iknow
>
> --packages com.databricks:spark-xml_:
>
> hth
>
>
>
> On Fri, Jun 17, 2016 at 10:20 AM, VG  wrote:
>
>> Apologies for that.
>> I am trying to use spark-xml to load data of a xml file.
>>
>> here is the exception
>>
>> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
>> Exception in thread "main" java.lang.ClassNotFoundException: Failed to
>> find data source: org.apache.spark.xml. Please find packages at
>> http://spark-packages.org
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.xml.DefaultSource
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>> at scala.util.Try$.apply(Try.scala:192)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>> at scala.util.Try.orElse(Try.scala:84)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
>> ... 4 more
>>
>> Code
>> SQLContext sqlContext = new SQLContext(sc);
>> DataFrame df = sqlContext.read()
>> .format("org.apache.spark.xml")
>> .option("rowTag", "row")
>> .load("A.xml");
>>
>> Any suggestions please ..
>>
>>
>>
>>
>> On Fri, Jun 17, 2016 at 2:42 PM, Marco Mistroni 
>> wrote:
>>
>>> too little info
>>> it'll help if you can post the exception and show your sbt file (if you
>>> are using sbt), and provide minimal details on what you are doing
>>> kr
>>>
>>> On Fri, Jun 17, 2016 at 10:08 AM, VG  wrote:
>>>
>>>> Failed to find data source: com.databricks.spark.xml
>>>>
>>>> Any suggestions to resolve this
>>>>
>>>>
>>>>
>>>
>>
>


Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread VG
Hi Siva,

I still get a similar exception (See the highlighted section - It is
looking for DataSource)
16/06/17 15:11:37 INFO BlockManagerMaster: Registered BlockManager
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find
data source: xml. Please find packages at http://spark-packages.org
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
*Caused by: java.lang.ClassNotFoundException: xml.DefaultSource*
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
at scala.util.Try$.apply(Try.scala:192)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
at scala.util.Try.orElse(Try.scala:84)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
... 4 more
16/06/17 15:11:38 INFO SparkContext: Invoking stop() from shutdown hook



On Fri, Jun 17, 2016 at 2:56 PM, Siva A  wrote:

> Just try to use "xml" as format like below,
>
> SQLContext sqlContext = new SQLContext(sc);
> DataFrame df = sqlContext.read()
> .format("xml")
> .option("rowTag", "row")
> .load("A.xml");
>
> FYR: https://github.com/databricks/spark-xml
>
> --Siva
>
> On Fri, Jun 17, 2016 at 2:50 PM, VG  wrote:
>
>> Apologies for that.
>> I am trying to use spark-xml to load data of a xml file.
>>
>> here is the exception
>>
>> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
>> Exception in thread "main" java.lang.ClassNotFoundException: Failed to
>> find data source: org.apache.spark.xml. Please find packages at
>> http://spark-packages.org
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.xml.DefaultSource
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
>> at scala.util.Try$.apply(Try.scala:192)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
>> at scala.util.Try.orElse(Try.scala:84)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
>> ... 4 more
>>
>> Code
>> SQLContext sqlContext = new SQLContext(sc);
>> DataFrame df = sqlContext.read()
>> .format("org.apache.spark.xml")
>> .option("rowTag", "row")
>> .load("A.xml");
>>
>> Any suggestions please ..
>>
>>
>>
>>
>> On Fri, Jun 17, 2016 at 2:42 PM, Marco Mistroni 
>> wrote:
>>
>>> too little info
>>> it'll help if you can post the exception and show your sbt file (if you
>>> are using sbt), and provide minimal details on what you are doing
>>> kr
>>>
>>> On Fri, Jun 17, 2016 at 10:08 AM, VG  wrote:
>>>
>>>> Failed to find data source: com.databricks.spark.xml
>>>>
>>>> Any suggestions to resolve this
>>>>
>>>>
>>>>
>>>
>>
>


Re: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread VG
Apologies for that.
I am trying to use spark-xml to load data of a xml file.

here is the exception

16/06/17 14:49:04 INFO BlockManagerMaster: Registered BlockManager
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find
data source: org.apache.spark.xml. Please find packages at
http://spark-packages.org
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
Caused by: java.lang.ClassNotFoundException:
org.apache.spark.xml.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4$$anonfun$apply$1.apply(ResolvedDataSource.scala:62)
at scala.util.Try$.apply(Try.scala:192)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$4.apply(ResolvedDataSource.scala:62)
at scala.util.Try.orElse(Try.scala:84)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
... 4 more

Code
SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.read()
.format("org.apache.spark.xml")
.option("rowTag", "row")
.load("A.xml");

Any suggestions please ..




On Fri, Jun 17, 2016 at 2:42 PM, Marco Mistroni  wrote:

> too little info
> it'll help if you can post the exception and show your sbt file (if you
> are using sbt), and provide minimal details on what you are doing
> kr
>
> On Fri, Jun 17, 2016 at 10:08 AM, VG  wrote:
>
>> Failed to find data source: com.databricks.spark.xml
>>
>> Any suggestions to resolve this
>>
>>
>>
>


java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark-packages.org

2016-06-17 Thread VG
Failed to find data source: com.databricks.spark.xml

Any suggestions to resolve this


Re: ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks

2016-06-15 Thread VG
Any suggestions on this please

On Wed, Jun 15, 2016 at 10:42 PM, VG  wrote:

> I have a very simple driver which loads a textFile and filters a
>> sub-string from each line in the textfile.
>> When the collect action is executed , I am getting an exception.   (The
>> file is only 90 MB - so I am confused what is going on..) I am running on a
>> local standalone cluster
>>
>> 16/06/15 19:45:22 INFO BlockManagerInfo: Removed broadcast_2_piece0 on
>> 192.168.56.1:56413 in memory (size: 2.5 KB, free: 2.4 GB)
>> 16/06/15 19:45:22 INFO BlockManagerInfo: Removed broadcast_1_piece0 on
>> 192.168.56.1:56413 in memory (size: 1900.0 B, free: 2.4 GB)
>> 16/06/15 19:45:22 INFO BlockManagerInfo: Added rdd_2_1 on disk on
>> 192.168.56.1:56413 (size: 2.7 MB)
>> 16/06/15 19:45:22 INFO MemoryStore: Block taskresult_7 stored as bytes in
>> memory (estimated size 2.7 MB, free 2.4 GB)
>> 16/06/15 19:45:22 INFO BlockManagerInfo: Added taskresult_7 in memory on
>> 192.168.56.1:56413 (size: 2.7 MB, free: 2.4 GB)
>> 16/06/15 19:45:22 INFO Executor: Finished task 1.0 in stage 2.0 (TID 7).
>> 2823777 bytes result sent via BlockManager)
>> 16/06/15 19:45:22 INFO TaskSetManager: Starting task 2.0 in stage 2.0
>> (TID 8, localhost, partition 2, PROCESS_LOCAL, 5422 bytes)
>> 16/06/15 19:45:22 INFO Executor: Running task 2.0 in stage 2.0 (TID 8)
>> 16/06/15 19:45:22 INFO HadoopRDD: Input split:
>> file:/C:/Users/i303551/Downloads/ariba-logs/ssws/access.2016.04.26/access.2016.04.26:67108864+25111592
>> 16/06/15 19:45:22 INFO BlockManagerInfo: Added rdd_2_2 on disk on
>> 192.168.56.1:56413 (size: 2.0 MB)
>> 16/06/15 19:45:22 INFO MemoryStore: Block taskresult_8 stored as bytes in
>> memory (estimated size 2.0 MB, free 2.4 GB)
>> 16/06/15 19:45:22 INFO BlockManagerInfo: Added taskresult_8 in memory on
>> 192.168.56.1:56413 (size: 2.0 MB, free: 2.4 GB)
>> 16/06/15 19:45:22 INFO Executor: Finished task 2.0 in stage 2.0 (TID 8).
>> 2143771 bytes result sent via BlockManager)
>> 16/06/15 19:45:43 ERROR RetryingBlockFetcher: Exception while beginning
>> fetch of 1 outstanding blocks
>> java.io.IOException: Failed to connect to /192.168.56.1:56413
>> at
>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>> at
>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>> at
>> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96)
>> at
>> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>> at
>> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>> at
>> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:105)
>> at
>> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:92)
>> at
>> org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:546)
>> at
>> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:76)
>> at
>> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>> at
>> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793)
>> at
>> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>> at java.lang.Thread.run(Unknown Source)
>> Caused by: java.net.ConnectException: Connection timed out: no further
>> information: /192.168.56.1:56413
>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>> at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
>> at
>> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
>> at
>> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
>> at
>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>> at
>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>> at
>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>> at
>> io.netty.util.concurrent.SingleThre

Fwd: ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks

2016-06-15 Thread VG
>
> I have a very simple driver which loads a textFile and filters a
> sub-string from each line in the textfile.
> When the collect action is executed , I am getting an exception.   (The
> file is only 90 MB - so I am confused what is going on..) I am running on a
> local standalone cluster
>
> 16/06/15 19:45:22 INFO BlockManagerInfo: Removed broadcast_2_piece0 on
> 192.168.56.1:56413 in memory (size: 2.5 KB, free: 2.4 GB)
> 16/06/15 19:45:22 INFO BlockManagerInfo: Removed broadcast_1_piece0 on
> 192.168.56.1:56413 in memory (size: 1900.0 B, free: 2.4 GB)
> 16/06/15 19:45:22 INFO BlockManagerInfo: Added rdd_2_1 on disk on
> 192.168.56.1:56413 (size: 2.7 MB)
> 16/06/15 19:45:22 INFO MemoryStore: Block taskresult_7 stored as bytes in
> memory (estimated size 2.7 MB, free 2.4 GB)
> 16/06/15 19:45:22 INFO BlockManagerInfo: Added taskresult_7 in memory on
> 192.168.56.1:56413 (size: 2.7 MB, free: 2.4 GB)
> 16/06/15 19:45:22 INFO Executor: Finished task 1.0 in stage 2.0 (TID 7).
> 2823777 bytes result sent via BlockManager)
> 16/06/15 19:45:22 INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID
> 8, localhost, partition 2, PROCESS_LOCAL, 5422 bytes)
> 16/06/15 19:45:22 INFO Executor: Running task 2.0 in stage 2.0 (TID 8)
> 16/06/15 19:45:22 INFO HadoopRDD: Input split:
> file:/C:/Users/i303551/Downloads/ariba-logs/ssws/access.2016.04.26/access.2016.04.26:67108864+25111592
> 16/06/15 19:45:22 INFO BlockManagerInfo: Added rdd_2_2 on disk on
> 192.168.56.1:56413 (size: 2.0 MB)
> 16/06/15 19:45:22 INFO MemoryStore: Block taskresult_8 stored as bytes in
> memory (estimated size 2.0 MB, free 2.4 GB)
> 16/06/15 19:45:22 INFO BlockManagerInfo: Added taskresult_8 in memory on
> 192.168.56.1:56413 (size: 2.0 MB, free: 2.4 GB)
> 16/06/15 19:45:22 INFO Executor: Finished task 2.0 in stage 2.0 (TID 8).
> 2143771 bytes result sent via BlockManager)
> 16/06/15 19:45:43 ERROR RetryingBlockFetcher: Exception while beginning
> fetch of 1 outstanding blocks
> java.io.IOException: Failed to connect to /192.168.56.1:56413
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
> at
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
> at
> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:105)
> at
> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:92)
> at
> org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:546)
> at
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:76)
> at
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
> at
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793)
> at
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
> Caused by: java.net.ConnectException: Connection timed out: no further
> information: /192.168.56.1:56413
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
> at
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
> at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> ... 1 more
> 16/06/15 19:45:43 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 1
> outstanding blocks after 5000 ms
> 16/06/15 19:46:04 ERROR RetryingBlockFetcher: Exception while beginning
> fetch of 1 outstanding blocks
> java.io.IOException: Failed to connect to /192.168.56.1:56413
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
> at
> org.apache.spark.network.netty.NettyBlockTransferService$$anon