union of multiple twitter streams [spark-streaming-twitter_2.11]

2018-07-02 Thread Imran Rajjad
Hello,

Has anybody tried to union two streams of Twitter Statues? I am
instantiating two twitter streams through two different set of credentials
and passing them through a union function, but the console does not show
any activity neither there are any errors.


--static function that returns JavaReceiverInputDStream--



public static JavaReceiverInputDStream
getTwitterStream(JavaStreamingContext spark, String consumerKey, String
consumerSecret,String accessToken, String accessTokenSecret,String[]
filter) {
  // Enable Oauth
  ConfigurationBuilder cb = new ConfigurationBuilder();
  cb.setDebugEnabled(false)
.setOAuthConsumerKey(consumerKey).setOAuthConsumerSecret(consumerSecret)

.setOAuthAccessToken(accessToken).setOAuthAccessTokenSecret(accessTokenSecret)
.setJSONStoreEnabled(true);
  TwitterFactory tf = new TwitterFactory(cb.build());
  Twitter twitter = tf.getInstance();

  // Create stream
  return TwitterUtils.createStream(spark,
twitter.getAuthorization(),filter);
 }
---trying to union two twitter streams---

JavaStreamingContext jssc = new JavaStreamingContext(conf,
Durations.minutes(5));

jssc.sparkContext().setLogLevel("ERROR");


JavaReceiverInputDStream twitterStreamByHashtag =
TwitterUtil.getTwitterStream(jssc, consumerKey1, consumerSecret1,
accessToken1, accessTokenSecret1,new String[]{"#Twitter"});
  // JavaReceiverInputDStream twitterStreamByUser =
TwitterUtil.getTwitterStream(jssc, consumerKey2, consumerSecret2,
accessToken2, accessTokenSecret2,new String[]{"@Twitter"});


JavaDStream statuses = twitterStreamByHashtag
.union(twitterStreamByUser)
.map(s->{return s.getText();});


regards,
Imran

-- 
I.R


Re: unable to connect to connect to cluster 2.2.0

2017-12-06 Thread Imran Rajjad
thanks

the machine where spark job was being submitted had SPARK_HOME pointing old
2.1.1 directory.



On Wed, Dec 6, 2017 at 1:35 PM, Qiao, Richard <richard.q...@capitalone.com>
wrote:

> Are you now building your app using spark 2.2 or 2.1?
>
>
>
> Best Regards
>
> Richard
>
>
>
>
>
> *From: *Imran Rajjad <raj...@gmail.com>
> *Date: *Wednesday, December 6, 2017 at 2:45 AM
> *To: *"user @spark" <user@spark.apache.org>
> *Subject: *unable to connect to connect to cluster 2.2.0
>
>
>
> Hi,
>
>
>
> Recently upgraded from 2.1.1 to 2.2.0. My Streaming job seems to have
> broken. The submitted application is unable to connect to the cluster, when
> all is running.
>
>
>
> below is my stack trace
>
> Spark Master:spark://192.168.10.207:7077
> Job Arguments:
> -appName orange_watch -directory /u01/watch/stream/
> Spark Configuration:
> [spark.executor.memory, spark.driver.memory, spark.app.name,
> spark.executor.cores]:6g
> [spark.executor.memory, spark.driver.memory, spark.app.name,
> spark.executor.cores]:4g
> [spark.executor.memory, spark.driver.memory, spark.app.name,
> spark.executor.cores]:orange_watch
> [spark.executor.memory, spark.driver.memory, spark.app.name,
> spark.executor.cores]:2
>
> Spark Arguments:
> [--packages]:graphframes:graphframes:0.5.0-spark2.1-s_2.11
>
> Using properties file: /home/my_user/spark-2.2.0-bin-
> hadoop2.7/conf/spark-defaults.conf
> Adding default property: spark.jars.packages=
> graphframes:graphframes:0.5.0-spark2.1-s_2.11
> Parsed arguments:
>   master  spark://192.168.10.207:7077
>   deployMode  null
>   executorMemory  6g
>   executorCores   2
>   totalExecutorCores  null
>   propertiesFile  /home/my_user/spark-2.2.0-bin-
> hadoop2.7/conf/spark-defaults.conf
>   driverMemory4g
>   driverCores null
>   driverExtraClassPathnull
>   driverExtraLibraryPath  null
>   driverExtraJavaOptions  null
>   supervise   false
>   queue   null
>   numExecutorsnull
>   files   null
>   pyFiles null
>   archivesnull
>   mainClass   com.my_user.MainClassWatch
>   primaryResource file:/home/my_user/cluster-testing/job.jar
>   nameorange_watch
>   childArgs   [-watchId 3199 -appName orange_watch -directory
> /u01/watch/stream/]
>   jarsnull
>   packagesgraphframes:graphframes:0.5.0-spark2.1-s_2.11
>   packagesExclusions  null
>   repositoriesnull
>   verbose true
>
> Spark properties used, including those specified through
>  --conf and those from the properties file /home/my_user/spark-2.2.0-bin-
> hadoop2.7/conf/spark-defaults.conf:
>   (spark.driver.memory,4g)
>   (spark.executor.memory,6g)
>   (spark.jars.packages,graphframes:graphframes:0.5.0-spark2.1-s_2.11)
>   (spark.app.name,orange_watch)
>   (spark.executor.cores,2)
>
>
> Ivy Default Cache set to: /home/my_user/.ivy2/cache
> The jars for the packages stored in: /home/my_user/.ivy2/jars
> :: loading settings :: url = jar:file:/home/my_user/spark-
> 2.2.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/
> core/settings/ivysettings.xml
> graphframes#graphframes added as a dependency
> :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
> confs: [default]
> found graphframes#graphframes;0.5.0-spark2.1-s_2.11 in spark-list
> found com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 in
> central
> found com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2
> in central
> found org.scala-lang#scala-reflect;2.11.0 in central
> found org.slf4j#slf4j-api;1.7.7 in spark-list
> :: resolution report :: resolve 191ms :: artifacts dl 5ms
> :: modules in use:
> com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 from
> central in [default]
> com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 from
> central in [default]
> graphframes#graphframes;0.5.0-spark2.1-s_2.11 from spark-list in
> [default]
> org.scala-lang#scala-reflect;2.11.0 from central in [default]
> org.slf4j#slf4j-api;1.7.7 from spark-list in [default]
> 
> -
> |  |modules||
> artifacts   |
> |   conf   | number| search|dwnlded|evicted||
> number|dwnlded|
> 
> 

unable to connect to connect to cluster 2.2.0

2017-12-05 Thread Imran Rajjad
Hi,

Recently upgraded from 2.1.1 to 2.2.0. My Streaming job seems to have
broken. The submitted application is unable to connect to the cluster, when
all is running.

below is my stack trace
Spark Master:spark://192.168.10.207:7077
Job Arguments:
-appName orange_watch -directory /u01/watch/stream/
Spark Configuration:
[spark.executor.memory, spark.driver.memory, spark.app.name,
spark.executor.cores]:6g
[spark.executor.memory, spark.driver.memory, spark.app.name,
spark.executor.cores]:4g
[spark.executor.memory, spark.driver.memory, spark.app.name,
spark.executor.cores]:orange_watch
[spark.executor.memory, spark.driver.memory, spark.app.name,
spark.executor.cores]:2
Spark Arguments:
[--packages]:graphframes:graphframes:0.5.0-spark2.1-s_2.11
Using properties file:
/home/my_user/spark-2.2.0-bin-hadoop2.7/conf/spark-defaults.conf
Adding default property:
spark.jars.packages=graphframes:graphframes:0.5.0-spark2.1-s_2.11
Parsed arguments:
  master  spark://192.168.10.207:7077
  deployMode  null
  executorMemory  6g
  executorCores   2
  totalExecutorCores  null
  propertiesFile
/home/my_user/spark-2.2.0-bin-hadoop2.7/conf/spark-defaults.conf
  driverMemory4g
  driverCores null
  driverExtraClassPathnull
  driverExtraLibraryPath  null
  driverExtraJavaOptions  null
  supervise   false
  queue   null
  numExecutorsnull
  files   null
  pyFiles null
  archivesnull
  mainClass   com.my_user.MainClassWatch
  primaryResource file:/home/my_user/cluster-testing/job.jar
  nameorange_watch
  childArgs   [-watchId 3199 -appName orange_watch -directory
/u01/watch/stream/]
  jarsnull
  packagesgraphframes:graphframes:0.5.0-spark2.1-s_2.11
  packagesExclusions  null
  repositoriesnull
  verbose true
Spark properties used, including those specified through
 --conf and those from the properties file
/home/my_user/spark-2.2.0-bin-hadoop2.7/conf/spark-defaults.conf:
  (spark.driver.memory,4g)
  (spark.executor.memory,6g)
  (spark.jars.packages,graphframes:graphframes:0.5.0-spark2.1-s_2.11)
  (spark.app.name,orange_watch)
  (spark.executor.cores,2)

Ivy Default Cache set to: /home/my_user/.ivy2/cache
The jars for the packages stored in: /home/my_user/.ivy2/jars
:: loading settings :: url =
jar:file:/home/my_user/spark-2.2.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found graphframes#graphframes;0.5.0-spark2.1-s_2.11 in spark-list
found com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 in
central
found com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 in
central
found org.scala-lang#scala-reflect;2.11.0 in central
found org.slf4j#slf4j-api;1.7.7 in spark-list
:: resolution report :: resolve 191ms :: artifacts dl 5ms
:: modules in use:
com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 from
central in [default]
com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 from
central in [default]
graphframes#graphframes;0.5.0-spark2.1-s_2.11 from spark-list in
[default]
org.scala-lang#scala-reflect;2.11.0 from central in [default]
org.slf4j#slf4j-api;1.7.7 from spark-list in [default]

-
|  |modules||   artifacts
|
|   conf   | number| search|dwnlded|evicted||
number|dwnlded|

-
|  default |   5   |   0   |   0   |   0   ||   5   |   0
|

-
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 5 already retrieved (0kB/7ms)
Main class:
com.my_user.MainClassWatch
Arguments:
-watchId
3199
-appName
orange_watch
-directory
/u01/watch/stream/
System properties:
(spark.executor.memory,6g)
(spark.driver.memory,4g)
(SPARK_SUBMIT,true)
(spark.jars.packages,graphframes:graphframes:0.5.0-spark2.1-s_2.11)
(spark.app.name,orange_watch)
(spark.jars,file:/home/my_user/.ivy2/jars/graphframes_graphframes-0.5.0-spark2.1-s_2.11.jar,file:/home/my_user/.ivy2/jars/com.typesafe.scala-logging_scala-logging-api_2.11-2.1.2.jar,file:/home/my_user/.ivy2/jars/com.typesafe.scala-logging_scala-logging-slf4j_2.11-2.1.2.jar,file:/home/my_user/.ivy2/jars/org.scala-lang_scala-reflect-2.11.0.jar,file:/home/my_user/.ivy2/jars/org.slf4j_slf4j-api-1.7.7.jar,file:/home/my_user/cluster-testing/job.jar)
(spark.submit.deployMode,client)
(spark.master,spark://192.168.10.207:7077)

spark strucured csv file stream not detecting new files

2017-11-15 Thread Imran Rajjad
Greetings,
I am running a unit test designed to stream a folder where I am manually
copying csv files. The files do not always get picked up. They only get
detected when the job starts with the files already in the folder.

I even tried using the option of fileNameOnly newly included in 2.2.0. Have
I missed something in the documentation. This problem does not seem to
occur in DStreams examples


DataStreamReader reader =  spark.readStream().option("fileNameOnly",
true).option("header",true)
.schema(userSchema);
  ;

DatasetcsvDF= reader.csv(watchDir)

Dataset results = csvDF.groupBy("myCol").count();
MyForEach forEachObj=new MyForEach();
query = results
.writeStream()
.foreach(forEachObj) // for each never gets called
.outputMode("complete")
.start();

-- 
I.R


spark-stream memory table global?

2017-11-10 Thread Imran Rajjad
Hi,

Does the memory table in which spark-structured streaming results are
sinked into, is available to other spark applications on the cluster? Is it
by default global or will only be available to context where streaming is
being done

thanks
Imran

-- 
I.R


unable to run spark streaming example

2017-11-03 Thread Imran Rajjad
I am trying out the network word count example and my unit test is
producing the blow console output with an exception

Exception in thread "dispatcher-event-loop-5"
java.lang.NoClassDefFoundError:
scala/runtime/AbstractPartialFunction$mcVL$sp
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
 at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint.receive(ReceiverTracker.scala:476)
 at
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
 at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
 at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
 at
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException:
scala.runtime.AbstractPartialFunction$mcVL$sp
 at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 ... 20 more
---
Time: 1509716745000 ms
---
---
Time: 1509716746000 ms
---

>From DOS I am pushing a text file through netcat with following command

nc -l -p  < license.txt

...

below are my spark related maven dependencies

2.1.1


   org.apache.spark
   spark-launcher_2.10
   ${spark.version}
  
  
   org.apache.spark
   spark-core_2.11
   ${spark.version}
   provided
  
  
   org.apache.spark
   spark-graphx_2.11
   ${spark.version}
   provided
  
  
   org.apache.spark
   spark-sql_2.11
   ${spark.version}
   provided
  
  
   graphframes
   graphframes
   0.5.0-spark2.1-s_2.11
  

  
org.apache.spark
 spark-mllib_2.10
 ${spark.version}
 provided
 

 
org.apache.spark
spark-streaming_2.10
${spark.version}
provided


-- 
I.R


Re: parition by multiple columns/keys

2017-10-23 Thread Imran Rajjad
strangely this is working only for very small dataset of rows.. for very
large datasets apparently the partitioning is not working. is there a limit
to the number of columns or rows when repartitioning according to multiple
columns?

regards,
Imran

On Wed, Oct 18, 2017 at 11:00 AM, Imran Rajjad <raj...@gmail.com> wrote:

> yes..I think I figured out something like below
>
> Serialized Java Class
> -
> public class MyMapPartition implements Serializable,MapPartitionsFunction{
>  @Override
>  public Iterator call(Iterator iter) throws Exception {
>   ArrayList list = new ArrayList();
>   // ArrayNode array = mapper.createArrayNode();
>   Row row=null;
>   System.out.println("");
>   while(iter.hasNext()){
>
>row=(Row) iter.next();
>System.out.println(row);
>list.add(row);
>   }
>   System.out.println(">>>>");
>   return list.iterator();
>  }
> }
>
> Unit Test
> ---
> JavaRDD rdd = jsc.parallelize(Arrays.asList(
> RowFactory.create(11L,21L,1L)
>   ,RowFactory.create(11L,22L,2L)
>   ,RowFactory.create(11L,22L,1L)
>   ,RowFactory.create(12L,23L,3L)
>   ,RowFactory.create(12L,24L,3L)
>   ,RowFactory.create(12L,22L,4L)
>   ,RowFactory.create(13L,22L,4L)
>   ,RowFactory.create(14L,22L,4L)
> ));
>   StructType structType = new StructType();
>   structType = structType.add("a", DataTypes.LongType, false)
> .add("b", DataTypes.LongType, false)
> .add("c", DataTypes.LongType, false);
>   ExpressionEncoder encoder = RowEncoder.apply(structType);
>
>
>   Dataset ds = spark.createDataFrame(rdd, encoder.schema());
>   ds.show();
>
>   MyMapPartition mp = new MyMapPartition ();
> //Iterator
>   //.repartition(new Column("a"),new Column("b"))
>Dataset grouped = ds.groupBy("a", "b","c")
> .count()
> .repartition(new Column("a"),new Column("b"))
> .mapPartitions(mp,encoder);
>
>   grouped.count();
>
> ---
>
> output
> 
> 
> [12,23,3,1]
> >>>>
> 
> [14,22,4,1]
> >>>>
> 
> [12,24,3,1]
> >>>>
> 
> [12,22,4,1]
> >>>>
> 
> [11,22,1,1]
> [11,22,2,1]
> >>>>
> 
> [11,21,1,1]
> >>>>
> 
> [13,22,4,1]
> >>>>
>
>
> On Wed, Oct 18, 2017 at 10:29 AM, ayan guha <guha.a...@gmail.com> wrote:
>
>> How or what you want to achieve? Ie are planning to do some aggregation
>> on group by c1,c2?
>>
>> On Wed, 18 Oct 2017 at 4:13 pm, Imran Rajjad <raj...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a set of rows that are a result of a
>>> groupBy(col1,col2,col3).count().
>>>
>>> Is it possible to map rows belong to unique combination inside an
>>> iterator?
>>>
>>> e.g
>>>
>>> col1 col2 col3
>>> a  1  a1
>>> a  1  a2
>>> b  2  b1
>>> b  2  b2
>>>
>>> how can I separate rows with col1 and col2 = (a,1) and (b,2)?
>>>
>>> regards,
>>> Imran
>>>
>>> --
>>> I.R
>>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> I.R
>



-- 
I.R


Re: jar file problem

2017-10-19 Thread Imran Rajjad
Simple way is to have a network volume mounted with same name to make
things easy

On Thu, 19 Oct 2017 at 8:24 PM Uğur Sopaoğlu  wrote:

> Hello,
>
> I have a very easy problem. How I run a spark job, I must copy jar file to
> all worker nodes. Is there any way to do simple?.
>
>
> --
> Uğur Sopaoğlu
>
>
> --
Sent from Gmail Mobile


Re: parition by multiple columns/keys

2017-10-18 Thread Imran Rajjad
yes..I think I figured out something like below

Serialized Java Class
-
public class MyMapPartition implements Serializable,MapPartitionsFunction{
 @Override
 public Iterator call(Iterator iter) throws Exception {
  ArrayList list = new ArrayList();
  // ArrayNode array = mapper.createArrayNode();
  Row row=null;
  System.out.println("");
  while(iter.hasNext()){

   row=(Row) iter.next();
   System.out.println(row);
   list.add(row);
  }
  System.out.println(">>>>");
  return list.iterator();
 }
}

Unit Test
---
JavaRDD rdd =
jsc.parallelize(Arrays.asList(RowFactory.create(11L,21L,1L)
  ,RowFactory.create(11L,22L,2L)
  ,RowFactory.create(11L,22L,1L)
  ,RowFactory.create(12L,23L,3L)
  ,RowFactory.create(12L,24L,3L)
  ,RowFactory.create(12L,22L,4L)
  ,RowFactory.create(13L,22L,4L)
  ,RowFactory.create(14L,22L,4L)
));
  StructType structType = new StructType();
  structType = structType.add("a", DataTypes.LongType, false)
.add("b", DataTypes.LongType, false)
.add("c", DataTypes.LongType, false);
  ExpressionEncoder encoder = RowEncoder.apply(structType);


  Dataset ds = spark.createDataFrame(rdd, encoder.schema());
  ds.show();

  MyMapPartition mp = new MyMapPartition ();
//Iterator
  //.repartition(new Column("a"),new Column("b"))
   Dataset grouped = ds.groupBy("a", "b","c")
.count()
.repartition(new Column("a"),new Column("b"))
.mapPartitions(mp,encoder);

  grouped.count();

---

output


[12,23,3,1]
>>>>

[14,22,4,1]
>>>>

[12,24,3,1]
>>>>

[12,22,4,1]
>>>>

[11,22,1,1]
[11,22,2,1]
>>>>

[11,21,1,1]
>>>>

[13,22,4,1]
>>>>


On Wed, Oct 18, 2017 at 10:29 AM, ayan guha <guha.a...@gmail.com> wrote:

> How or what you want to achieve? Ie are planning to do some aggregation on
> group by c1,c2?
>
> On Wed, 18 Oct 2017 at 4:13 pm, Imran Rajjad <raj...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a set of rows that are a result of a groupBy(col1,col2,col3).count(
>> ).
>>
>> Is it possible to map rows belong to unique combination inside an
>> iterator?
>>
>> e.g
>>
>> col1 col2 col3
>> a  1  a1
>> a  1  a2
>> b  2  b1
>> b  2  b2
>>
>> how can I separate rows with col1 and col2 = (a,1) and (b,2)?
>>
>> regards,
>> Imran
>>
>> --
>> I.R
>>
> --
> Best Regards,
> Ayan Guha
>



-- 
I.R


Re: No space left on device

2017-10-17 Thread Imran Rajjad
don't think so. check out the documentation for this method

On Wed, Oct 18, 2017 at 10:11 AM, Mina Aslani <aslanim...@gmail.com> wrote:

> I have not tried rdd.unpersist(), I thought using rdd = null is the same,
> is it not?
>
> On Wed, Oct 18, 2017 at 1:07 AM, Imran Rajjad <raj...@gmail.com> wrote:
>
>> did you try calling rdd.unpersist()
>>
>> On Wed, Oct 18, 2017 at 10:04 AM, Mina Aslani <aslanim...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I get "No space left on device" error in my spark worker:
>>>
>>> Error writing stream to file /usr/spark-2.2.0/work/app-.../0/stderr
>>> java.io.IOException: No space left on device
>>>
>>> In my spark cluster, I have one worker and one master.
>>> My program consumes stream of data from kafka and publishes the result
>>> into kafka. I set my RDD = null after I finish working, so that
>>> intermediate shuffle files are removed quickly.
>>>
>>> How can I avoid "No space left on device"?
>>>
>>> Best regards,
>>> Mina
>>>
>>
>>
>>
>> --
>> I.R
>>
>
>


-- 
I.R


parition by multiple columns/keys

2017-10-17 Thread Imran Rajjad
Hi,

I have a set of rows that are a result of a groupBy(col1,col2,col3).count().

Is it possible to map rows belong to unique combination inside an iterator?

e.g

col1 col2 col3
a  1  a1
a  1  a2
b  2  b1
b  2  b2

how can I separate rows with col1 and col2 = (a,1) and (b,2)?

regards,
Imran

-- 
I.R


Re: No space left on device

2017-10-17 Thread Imran Rajjad
did you try calling rdd.unpersist()

On Wed, Oct 18, 2017 at 10:04 AM, Mina Aslani  wrote:

> Hi,
>
> I get "No space left on device" error in my spark worker:
>
> Error writing stream to file /usr/spark-2.2.0/work/app-.../0/stderr
> java.io.IOException: No space left on device
>
> In my spark cluster, I have one worker and one master.
> My program consumes stream of data from kafka and publishes the result
> into kafka. I set my RDD = null after I finish working, so that
> intermediate shuffle files are removed quickly.
>
> How can I avoid "No space left on device"?
>
> Best regards,
> Mina
>



-- 
I.R


task not serializable on simple operations

2017-10-16 Thread Imran Rajjad
Is there a way around to implement a separate Java class that implements
serializable interface for even small petty arithmetic operations?

below is code from simple decision tree example

Double testMSE =  predictionAndLabel.map(new Function, Double>() {
   @Override
  public Double call(Tuple2 pl) {
Double diff = pl._1() - pl._2();
return diff * diff;
  }
}).reduce(new Function2() {

   @Override
  public Double call(Double a, Double b) {
return a + b;
  }
}) / testDataCount;

there are no complex java objects involved here and simple arithmetic
operations.

regards,
Imran
-- 
I.R


Re: Apache Spark-Subtract two datasets

2017-10-12 Thread Imran Rajjad
if the datasets hold objects of different classes, then you will have to
convert both of them to rdd and then rename the columns befrore you call
rdd1.subtract(rdd2)

On Thu, Oct 12, 2017 at 10:16 PM, Shashikant Kulkarni <
shashikant.kulka...@gmail.com> wrote:

> Hello,
>
> I have 2 datasets, Dataset and other is Dataset. I want
> the list of records which are in Dataset but not in
> Dataset. How can I do this in Apache Spark using Java Connector? I
> am using Apache Spark 2.2.0
>
> Thank you
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
I.R


Re: best spark spatial lib?

2017-10-11 Thread Imran Rajjad
Thanks guy for the response.

Basically I am migrating an oracle pl/sql procedure to spark-java. In
oracle I have a table with geometry column, on which I am able to do a

"where col = 1 and geom.within(another_geom)"

I am looking for a less complicated port in to spark for which queries. I
will give these libraries a shot.

@Ram .. Magellan I gues does not support Java

regards,
Imran

On Wed, Oct 11, 2017 at 12:07 AM, Ram Sriharsha <sriharsha@gmail.com>
wrote:

> why can't you do this in Magellan?
> Can you post a sample query that you are trying to run that has spatial
> and logical operators combined? Maybe I am not understanding the issue
> properly
>
> Ram
>
> On Tue, Oct 10, 2017 at 2:21 AM, Imran Rajjad <raj...@gmail.com> wrote:
>
>> I need to have a location column inside my Dataframe so that I can do
>> spatial queries and geometry operations. Are there any third-party packages
>> that perform this kind of operations. I have seen a few like Geospark and
>> megalan but they don't support operations where spatial and logical
>> operators can be combined.
>>
>> regards,
>> Imran
>>
>> --
>> I.R
>>
>
>


-- 
I.R


best spark spatial lib?

2017-10-10 Thread Imran Rajjad
I need to have a location column inside my Dataframe so that I can do
spatial queries and geometry operations. Are there any third-party packages
that perform this kind of operations. I have seen a few like Geospark and
megalan but they don't support operations where spatial and logical
operators can be combined.

regards,
Imran

-- 
I.R


Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread Imran Rajjad
Try tachyon.. its less fuss


On Fri, 29 Sep 2017 at 8:32 PM lucas.g...@gmail.com 
wrote:

> We use S3, there are caveats and issues with that but it can be made to
> work.
>
> If interested let me know and I'll show you our workarounds.  I wouldn't
> do it naively though, there's lots of potential problems.  If you already
> have HDFS use that, otherwise all things told it's probably less effort to
> use S3.
>
> Gary
>
> On 29 September 2017 at 05:03, Arun Rai  wrote:
>
>> Or you can try mounting that drive to all node.
>>
>> On Fri, Sep 29, 2017 at 6:14 AM Jörn Franke  wrote:
>>
>>> You should use a distributed filesystem such as HDFS. If you want to use
>>> the local filesystem then you have to copy each file to each node.
>>>
>>>
>>>
>>>
>>>
>>> > On 29. Sep 2017, at 12:05, Gaurav1809  wrote:
>>>
>>>
>>> >
>>>
>>>
>>> > Hi All,
>>>
>>>
>>> >
>>>
>>>
>>> > I have multi node architecture of (1 master,2 workers) Spark cluster,
>>> the
>>>
>>>
>>> > job runs to read CSV file data and it works fine when run on local mode
>>>
>>>
>>> > (Local(*)).
>>>
>>>
>>> > However, when the same job is ran in cluster mode(Spark://HOST:PORT),
>>> it is
>>>
>>>
>>> > not able to read it.
>>>
>>>
>>> > I want to know how to reference the files Or where to store them?
>>> Currently
>>>
>>>
>>> > the CSV data file is on master(from where the job is submitted).
>>>
>>>
>>> >
>>>
>>>
>>> > Following code works fine in local mode but not in cluster mode.
>>>
>>>
>>> >
>>>
>>>
>>> > val spark = SparkSession
>>>
>>>
>>> >  .builder()
>>>
>>>
>>> >  .appName("SampleFlightsApp")
>>>
>>>
>>> >  .master("spark://masterIP:7077") // change it to
>>> .master("local[*])
>>>
>>>
>>> > for local mode
>>>
>>>
>>> >  .getOrCreate()
>>>
>>>
>>> >
>>>
>>>
>>> >val flightDF =
>>>
>>>
>>> > spark.read.option("header",true).csv("/home/username/sampleflightdata")
>>>
>>>
>>> >flightDF.printSchema()
>>>
>>>
>>> >
>>>
>>>
>>> > Error: FileNotFoundException: File
>>> file:/home/username/sampleflightdata does
>>>
>>>
>>> > not exist
>>>
>>>
>>> >
>>>
>>>
>>> >
>>>
>>>
>>> >
>>>
>>>
>>> > --
>>>
>>>
>>> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>>
>>> >
>>>
>>>
>>> > -
>>>
>>>
>>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>> >
>>>
>>>
>>>
>>>
>>>
>>> -
>>>
>>>
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
> --
Sent from Gmail Mobile


Re: graphframes on cluster

2017-09-22 Thread Imran Rajjad
sorry for posting without complete information

I am connecting to spark cluster with the driver program as the backend of
web application. This is intended to listen to job progress and some other
work. Below is how I am connecting to the cluster

sparkConf = new SparkConf().setAppName("isolated test")
   .setMaster("spark://master:7077")
.set("spark.executor.memory","6g")
.set("spark.driver.memory","6g")
.set("spark.driver.maxResultSize","2g")
.set("spark.executor.extrajavaoptions","-Xmx8g")

.set("spark.jars.packages","graphframes:graphframes:0.5.0-spark2.1-s_2.11")
.set("spark.jars","/home/usr/jobs.jar"); //this is shared location
Linux machines and has the required java classes

the crash occurs at

gFrame.connectedComponents().setBroadcastThreshold(2).run();

with exception

Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 0.0 (TID 5, 10.112.29.80):
java.lang.ClassCastException: cannot assign instance of
scala.collection.immutable.List$SerializationProxy to field
org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type
scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
at
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2024)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

after googling around..this appears to be related to dependencies but I
don't have much dependencies apart from a few POJOs which have been
included through context

regards,
Imran




On Wed, Sep 20, 2017 at 9:00 PM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> Could you include the code where it fails?
> Generally the best way to use gf is to use the --packages options with
> spark-submit command
>
> --
> *From:* Imran Rajjad <raj...@gmail.com>
> *Sent:* Wednesday, September 20, 2017 5:47:27 AM
> *To:* user @spark
> *Subject:* graphframes on cluster
>
> Trying to run graph frames on a spark cluster. Do I need to include the
> package in spark context settings? or the only the driver program is
> suppose to have the graphframe libraries in its class path? Currently the
> job is crashing when action function is invoked on graphframe classes.
>
> regards,
> Imran
>
> --
> I.R
>



-- 
I.R


graphframes on cluster

2017-09-20 Thread Imran Rajjad
Trying to run graph frames on a spark cluster. Do I need to include the
package in spark context settings? or the only the driver program is
suppose to have the graphframe libraries in its class path? Currently the
job is crashing when action function is invoked on graphframe classes.

regards,
Imran

-- 
I.R


Re: graphframe out of memory

2017-09-08 Thread Imran Rajjad
No I did not, I thought Spark would take care of that itself since I have
put in the arguments.

On Thu, Sep 7, 2017 at 9:26 PM, Lukas Bradley <lukasbrad...@gmail.com>
wrote:

> Did you also increase the size of the heap of the Java app that is
> starting Spark?
>
> https://alvinalexander.com/blog/post/java/java-xmx-xms-
> memory-heap-size-control
>
> On Thu, Sep 7, 2017 at 12:16 PM, Imran Rajjad <raj...@gmail.com> wrote:
>
>> I am getting Out of Memory error while running connectedComponents job on
>> graph with around 12000 vertices and 134600 edges.
>> I am running spark in embedded mode in a standalone Java application and
>> have tried to increase the memory but it seems that its not taking any
>> effect
>>
>> sparkConf = new SparkConf().setAppName("SOME APP
>> NAME").setMaster("local[10]")
>> .set("spark.executor.memory","5g")
>> .set("spark.driver.memory","8g")
>> .set("spark.driver.maxResultSize","1g")
>> .set("spark.sql.warehouse.dir", "file:///d:/spark/tmp")
>> .set("hadoop.home.dir", "file:///D:/spark-2.1.0-bin-hadoop2.7/bin");
>>
>>   spark = SparkSession.builder().config(sparkConf).getOrCreate();
>>   spark.sparkContext().setLogLevel("ERROR");
>>   spark.sparkContext().setCheckpointDir("D:/spark/tmp");
>>
>> the stack trace
>> java.lang.OutOfMemoryError: Java heap space
>>  at java.util.Arrays.copyOf(Arrays.java:3332)
>>  at java.lang.AbstractStringBuilder.ensureCapacityInternal(Abstr
>> actStringBuilder.java:124)
>>  at java.lang.AbstractStringBuilder.append(AbstractStringBuilder
>> .java:448)
>>  at java.lang.StringBuilder.append(StringBuilder.java:136)
>>  at scala.StringContext.standardInterpolator(StringContext.scala:126)
>>  at scala.StringContext.s(StringContext.scala:95)
>>  at org.apache.spark.sql.execution.QueryExecution.toString(
>> QueryExecution.scala:230)
>>  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutio
>> nId(SQLExecution.scala:54)
>>  at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2788)
>>  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$e
>> xecute$1(Dataset.scala:2385)
>>  at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$D
>> ataset$$collect$1.apply(Dataset.scala:2390)
>>  at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$D
>> ataset$$collect$1.apply(Dataset.scala:2390)
>>  at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2801)
>>  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$c
>> ollect(Dataset.scala:2390)
>>  at org.apache.spark.sql.Dataset.collect(Dataset.scala:2366)
>>  at org.graphframes.lib.ConnectedComponents$.skewedJoin(Connecte
>> dComponents.scala:239)
>>  at org.graphframes.lib.ConnectedComponents$.org$graphframes$
>> lib$ConnectedComponents$$run(ConnectedComponents.scala:308)
>>  at org.graphframes.lib.ConnectedComponents.run(ConnectedCompone
>> nts.scala:139)
>>
>> GraphFrame version is 0.5.0 and Spark version is 2.1.1
>>
>> regards,
>> Imran
>>
>> --
>> I.R
>>
>
>


-- 
I.R


graphframe out of memory

2017-09-07 Thread Imran Rajjad
I am getting Out of Memory error while running connectedComponents job on
graph with around 12000 vertices and 134600 edges.
I am running spark in embedded mode in a standalone Java application and
have tried to increase the memory but it seems that its not taking any
effect

sparkConf = new SparkConf().setAppName("SOME APP
NAME").setMaster("local[10]")
.set("spark.executor.memory","5g")
.set("spark.driver.memory","8g")
.set("spark.driver.maxResultSize","1g")
.set("spark.sql.warehouse.dir", "file:///d:/spark/tmp")
.set("hadoop.home.dir", "file:///D:/spark-2.1.0-bin-hadoop2.7/bin");

  spark = SparkSession.builder().config(sparkConf).getOrCreate();
  spark.sparkContext().setLogLevel("ERROR");
  spark.sparkContext().setCheckpointDir("D:/spark/tmp");

the stack trace
java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:3332)
 at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
 at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
 at java.lang.StringBuilder.append(StringBuilder.java:136)
 at scala.StringContext.standardInterpolator(StringContext.scala:126)
 at scala.StringContext.s(StringContext.scala:95)
 at
org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:230)
 at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:54)
 at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2788)
 at org.apache.spark.sql.Dataset.org
$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2385)
 at
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2390)
 at
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2390)
 at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2801)
 at org.apache.spark.sql.Dataset.org
$apache$spark$sql$Dataset$$collect(Dataset.scala:2390)
 at org.apache.spark.sql.Dataset.collect(Dataset.scala:2366)
 at
org.graphframes.lib.ConnectedComponents$.skewedJoin(ConnectedComponents.scala:239)
 at
org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$ConnectedComponents$$run(ConnectedComponents.scala:308)
 at
org.graphframes.lib.ConnectedComponents.run(ConnectedComponents.scala:139)

GraphFrame version is 0.5.0 and Spark version is 2.1.1

regards,
Imran

-- 
I.R


unable to import graphframes

2017-08-29 Thread Imran Rajjad
Dear list,

I am following the documentation of graphframe and have started the scala
shell using following command


D:\spark-2.1.0-bin-hadoop2.7\bin>spark-shell --master local[2] --packages
graphframes:graphframes:0.5.0-spark2.1-s_2.10

Ivy Default Cache set to: C:\Users\user\.ivy2\cache
The jars for the packages stored in: C:\Users\user\.ivy2\jars
:: loading settings :: url =
jar:file:/D:/spark-2.1.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found graphframes#graphframes;0.5.0-spark2.1-s_2.10 in
spark-packages
found com.typesafe.scala-logging#scala-logging-api_2.10;2.1.2 in
central
found com.typesafe.scala-logging#scala-logging-slf4j_2.10;2.1.2 in
central
found org.scala-lang#scala-reflect;2.10.4 in central
found org.slf4j#slf4j-api;1.7.7 in local-m2-cache
:: resolution report :: resolve 288ms :: artifacts dl 7ms
:: modules in use:
com.typesafe.scala-logging#scala-logging-api_2.10;2.1.2 from
central in [default]
com.typesafe.scala-logging#scala-logging-slf4j_2.10;2.1.2 from
central in [default]
graphframes#graphframes;0.5.0-spark2.1-s_2.10 from spark-packages
in [default]
org.scala-lang#scala-reflect;2.10.4 from central in [default]
org.slf4j#slf4j-api;1.7.7 from local-m2-cache in [default]

-
|  |modules||   artifacts
|
|   conf   | number| search|dwnlded|evicted||
number|dwnlded|

-
|  default |   5   |   0   |   0   |   0   ||   5   |   0
|

-
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 5 already retrieved (0kB/7ms)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
2017-08-29 12:10:23,089 [main] WARN  NativeCodeLoader  - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2017-08-29 12:10:25,128 [main] WARN  General  - Plugin (Bundle)
"org.datanucleus.store.rdbms" is already registered. Ensure you dont have
multiple JAR versions of the same plugin in the classpath. The URL
"file:/D:/spark-2.1.0-bin-hadoop2.7/jars/datanucleus-rdbms-3.2.9.jar" is
already registered, and you are trying to register an identical plugin
located at URL
"file:/D:/spark-2.1.0-bin-hadoop2.7/bin/../jars/datanucleus-rdbms-3.2.9.jar."
2017-08-29 12:10:25,137 [main] WARN  General  - Plugin (Bundle)
"org.datanucleus" is already registered. Ensure you dont have multiple JAR
versions of the same plugin in the classpath. The URL
"file:/D:/spark-2.1.0-bin-hadoop2.7/jars/datanucleus-core-3.2.10.jar" is
already registered, and you are trying to register an identical plugin
located at URL
"file:/D:/spark-2.1.0-bin-hadoop2.7/bin/../jars/datanucleus-core-3.2.10.jar."
2017-08-29 12:10:25,141 [main] WARN  General  - Plugin (Bundle)
"org.datanucleus.api.jdo" is already registered. Ensure you dont have
multiple JAR versions of the same plugin in the classpath. The URL
"file:/D:/spark-2.1.0-bin-hadoop2.7/bin/../jars/datanucleus-api-jdo-3.2.6.jar"
is already registered, and you are trying to register an identical plugin
located at URL
"file:/D:/spark-2.1.0-bin-hadoop2.7/jars/datanucleus-api-jdo-3.2.6.jar."
2017-08-29 12:10:27,744 [main] WARN  ObjectStore  - Failed to get database
global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.10.60:4040
Spark context available as 'sc' (master = local[2], app id =
local-1503990623864).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_112)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
scala> import org.graphframes._
:23: error: object graphframes is not a member of package org
   import org.graphframes._

is there something missing?

regards,
Imran

-- 
I.R


Re: Oracle Table not resolved [Spark 2.1.1]

2017-08-28 Thread Imran Rajjad
the jdbc url is invalid, but strangely it should have thrown ORA- exception

On Mon, Aug 28, 2017 at 4:55 PM, Naga G <gudurun...@gmail.com> wrote:

> Not able to find the database name.
> ora is the database in the below url ?
>
> Sent from Naga iPad
>
> > On Aug 28, 2017, at 4:06 AM, Imran Rajjad <raj...@gmail.com> wrote:
> >
> > Hello,
> >
> > I am trying to retrieve an oracle table into Dataset using
> following code
> >
> > String url = "jdbc:oracle@localhost:1521:ora";
> >   Dataset jdbcDF = spark.read()
> >   .format("jdbc")
> >   .option("driver", "oracle.jdbc.driver.OracleDriver")
> >   .option("url", url)
> >   .option("dbtable", "INCIDENTS")
> >   .option("user", "user1")
> >   .option("password", "pass1")
> >   .load();
> >
> >   System.out.println(jdbcDF.count());
> >
> > below is the stack trace
> >
> > java.lang.NullPointerException
> >  at org.apache.spark.sql.execution.datasources.jdbc.
> JDBCRDD$.resolveTable(JDBCRDD.scala:72)
> >  at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(
> JDBCRelation.scala:113)
> >  at org.apache.spark.sql.execution.datasources.jdbc.
> JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
> >  at org.apache.spark.sql.execution.datasources.
> DataSource.resolveRelation(DataSource.scala:330)
> >  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
> >  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
> >  at com.elogic.hazelcast_test.JDBCTest.test(JDBCTest.java:56)
> >  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >  at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
> >  at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> >  at java.lang.reflect.Method.invoke(Method.java:498)
> >  at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(
> FrameworkMethod.java:50)
> >  at org.junit.internal.runners.model.ReflectiveCallable.run(
> ReflectiveCallable.java:12)
> >  at org.junit.runners.model.FrameworkMethod.invokeExplosively(
> FrameworkMethod.java:47)
> >  at org.junit.internal.runners.statements.InvokeMethod.
> evaluate(InvokeMethod.java:17)
> >  at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> >  at org.junit.runners.BlockJUnit4ClassRunner.runChild(
> BlockJUnit4ClassRunner.java:78)
> >  at org.junit.runners.BlockJUnit4ClassRunner.runChild(
> BlockJUnit4ClassRunner.java:57)
> >  at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> >  at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> >  at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> >  at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> >  at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> >  at org.junit.internal.runners.statements.RunBefores.
> evaluate(RunBefores.java:26)
> >  at org.junit.internal.runners.statements.RunAfters.evaluate(
> RunAfters.java:27)
> >  at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> >  at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(
> JUnit4TestReference.java:86)
> >  at org.eclipse.jdt.internal.junit.runner.TestExecution.
> run(TestExecution.java:38)
> >  at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.
> runTests(RemoteTestRunner.java:459)
> >  at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.
> runTests(RemoteTestRunner.java:678)
> >  at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.
> run(RemoteTestRunner.java:382)
> >  at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.
> main(RemoteTestRunner.java:192)
> >
> > Apparently the connection is made but Table is not being detected. Any
> ideas whats wrong with the code?
> >
> > regards,
> > Imran
> > --
> > I.R
>



-- 
I.R


Re: Spark SQL vs HiveQL

2017-08-28 Thread Imran Rajjad
If reading directly from file then Spark SQL should be your choice


On Mon, Aug 28, 2017 at 10:25 PM Michael Artz 
wrote:

> Just to be clear, I'm referring to having Spark reading from a file, not
> from a Hive table.  And it will have tungsten engine off heap serialization
> after 2.1, so if it was a test with like 1.63 it won't be as helpful.
>
>
> On Mon, Aug 28, 2017 at 10:50 AM, Michael Artz 
> wrote:
>
>> Hi,
>>   There isn't any good source to answer the question if Hive as an
>> SQL-On-Hadoop engine just as fast as Spark SQL now? I just want to know if
>> there has been a comparison done lately for HiveQL vs Spark SQL on Spark
>> versions 2.1 or later.  I have a large ETL process, with many table joins
>> and some string manipulation. I don't think anyone has done this kind of
>> testing in a while.  With Hive LLAP being so performant, I am trying to
>> make the case for using Spark and some of the architects are light on
>> experience so they are scared of Scala.
>>
>> Thanks
>>
>>
>>
>
>
> --
Sent from Gmail Mobile


Oracle Table not resolved [Spark 2.1.1]

2017-08-28 Thread Imran Rajjad
Hello,

I am trying to retrieve an oracle table into Dataset using following
code

String url = "jdbc:oracle@localhost:1521:ora";
  Dataset jdbcDF = spark.read()
  .format("jdbc")
  .option("driver", "oracle.jdbc.driver.OracleDriver")
  .option("url", url)
  .option("dbtable", "INCIDENTS")
  .option("user", "user1")
  .option("password", "pass1")
  .load();

  System.out.println(jdbcDF.count());

below is the stack trace

java.lang.NullPointerException
 at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:72)
 at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113)
 at
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
 at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
 at com.elogic.hazelcast_test.JDBCTest.test(JDBCTest.java:56)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
 at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
 at
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
 at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
 at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
 at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
 at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
 at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
 at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
 at
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
 at
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
 at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
 at
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
 at
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
 at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
 at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
 at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
 at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)

Apparently the connection is made but Table is not being detected. Any
ideas whats wrong with the code?

regards,
Imran
-- 
I.R


Thrift-Server JDBC ResultSet Cursor Reset or Previous

2017-08-16 Thread Imran Rajjad
Dear List,

Are there any future plans to implement cursor reset or previous record
functionality in Thrift Server`s JDBC driver? Are there any other
alternatives?

java.sql.SQLException: Method not supported
at
org.apache.hive.jdbc.HiveBaseResultSet.previous(HiveBaseResultSet.java:643)

regards
Imran

-- 
I.R


solr data source not working

2017-07-20 Thread Imran Rajjad
I am unable to register the Solr Cloud as data source in Spark 2.1.0.
Following the documentation at
https://github.com/lucidworks/spark-solr#import-jar-file-via-spark-shell, I
have used the 3.0.0.beta3 version.

The system path is displaying the added jar as
spark://172.31.208.1:55730/jars/spark-solr-3.0.0-beta3-shaded.jar Added By
User

OS:Win10
Hadoop : 2.7 (x64 winutils)
Spark:2.1.0
Solr-Spark:3.0.0-beta3

same was tried with spark 2.2.0 with solr-spark:3.1.0

ERROR
scala> val df = spark.read.format("solr").options(Map("collection" ->
"cdr1","zkhost" -> "localhost:9983")).load
java.lang.ClassNotFoundException: Failed to find data source: solr. Please
find packages at http://spark.apache.org/third-party-projects.html
  at
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:569)
  at
org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86)
  at
org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86)
  at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
  ... 48 elided
Caused by: java.lang.ClassNotFoundException: solr.DefaultSource
  at
scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25$$anonfun$apply$13.apply(DataSource.scala:554)
  at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25$$anonfun$apply$13.apply(DataSource.scala:554)
  at scala.util.Try$.apply(Try.scala:192)
  at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25.apply(DataSource.scala:554)
  at
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25.apply(DataSource.scala:554)
  at scala.util.Try.orElse(Try.scala:84)
  at
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:554)



-- 
I.R


Slow responce on Solr Cloud with Spark

2017-07-19 Thread Imran Rajjad
Greetings,

We are trying out Spark 2 + ThriftServer to join multiple
collections from a Solr Cloud (6.4.x). I have followed this blog
https://lucidworks.com/2015/08/20/solr-spark-sql-datasource/

I understand that initially spark populates the temporary table with 18633014
records and takes its due time, however any following SQLs on the temporary
table take the same amount of time . It seems the temporary tables is not
being re-used or cached. The fields in the solr collection do not have the
docValue enabled, could that be the reason? Apparently I have missed a trick

regards,
Imran

-- 
I.R