union of multiple twitter streams [spark-streaming-twitter_2.11]
Hello, Has anybody tried to union two streams of Twitter Statues? I am instantiating two twitter streams through two different set of credentials and passing them through a union function, but the console does not show any activity neither there are any errors. --static function that returns JavaReceiverInputDStream-- public static JavaReceiverInputDStream getTwitterStream(JavaStreamingContext spark, String consumerKey, String consumerSecret,String accessToken, String accessTokenSecret,String[] filter) { // Enable Oauth ConfigurationBuilder cb = new ConfigurationBuilder(); cb.setDebugEnabled(false) .setOAuthConsumerKey(consumerKey).setOAuthConsumerSecret(consumerSecret) .setOAuthAccessToken(accessToken).setOAuthAccessTokenSecret(accessTokenSecret) .setJSONStoreEnabled(true); TwitterFactory tf = new TwitterFactory(cb.build()); Twitter twitter = tf.getInstance(); // Create stream return TwitterUtils.createStream(spark, twitter.getAuthorization(),filter); } ---trying to union two twitter streams--- JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.minutes(5)); jssc.sparkContext().setLogLevel("ERROR"); JavaReceiverInputDStream twitterStreamByHashtag = TwitterUtil.getTwitterStream(jssc, consumerKey1, consumerSecret1, accessToken1, accessTokenSecret1,new String[]{"#Twitter"}); // JavaReceiverInputDStream twitterStreamByUser = TwitterUtil.getTwitterStream(jssc, consumerKey2, consumerSecret2, accessToken2, accessTokenSecret2,new String[]{"@Twitter"}); JavaDStream statuses = twitterStreamByHashtag .union(twitterStreamByUser) .map(s->{return s.getText();}); regards, Imran -- I.R
Re: unable to connect to connect to cluster 2.2.0
thanks the machine where spark job was being submitted had SPARK_HOME pointing old 2.1.1 directory. On Wed, Dec 6, 2017 at 1:35 PM, Qiao, Richard <richard.q...@capitalone.com> wrote: > Are you now building your app using spark 2.2 or 2.1? > > > > Best Regards > > Richard > > > > > > *From: *Imran Rajjad <raj...@gmail.com> > *Date: *Wednesday, December 6, 2017 at 2:45 AM > *To: *"user @spark" <user@spark.apache.org> > *Subject: *unable to connect to connect to cluster 2.2.0 > > > > Hi, > > > > Recently upgraded from 2.1.1 to 2.2.0. My Streaming job seems to have > broken. The submitted application is unable to connect to the cluster, when > all is running. > > > > below is my stack trace > > Spark Master:spark://192.168.10.207:7077 > Job Arguments: > -appName orange_watch -directory /u01/watch/stream/ > Spark Configuration: > [spark.executor.memory, spark.driver.memory, spark.app.name, > spark.executor.cores]:6g > [spark.executor.memory, spark.driver.memory, spark.app.name, > spark.executor.cores]:4g > [spark.executor.memory, spark.driver.memory, spark.app.name, > spark.executor.cores]:orange_watch > [spark.executor.memory, spark.driver.memory, spark.app.name, > spark.executor.cores]:2 > > Spark Arguments: > [--packages]:graphframes:graphframes:0.5.0-spark2.1-s_2.11 > > Using properties file: /home/my_user/spark-2.2.0-bin- > hadoop2.7/conf/spark-defaults.conf > Adding default property: spark.jars.packages= > graphframes:graphframes:0.5.0-spark2.1-s_2.11 > Parsed arguments: > master spark://192.168.10.207:7077 > deployMode null > executorMemory 6g > executorCores 2 > totalExecutorCores null > propertiesFile /home/my_user/spark-2.2.0-bin- > hadoop2.7/conf/spark-defaults.conf > driverMemory4g > driverCores null > driverExtraClassPathnull > driverExtraLibraryPath null > driverExtraJavaOptions null > supervise false > queue null > numExecutorsnull > files null > pyFiles null > archivesnull > mainClass com.my_user.MainClassWatch > primaryResource file:/home/my_user/cluster-testing/job.jar > nameorange_watch > childArgs [-watchId 3199 -appName orange_watch -directory > /u01/watch/stream/] > jarsnull > packagesgraphframes:graphframes:0.5.0-spark2.1-s_2.11 > packagesExclusions null > repositoriesnull > verbose true > > Spark properties used, including those specified through > --conf and those from the properties file /home/my_user/spark-2.2.0-bin- > hadoop2.7/conf/spark-defaults.conf: > (spark.driver.memory,4g) > (spark.executor.memory,6g) > (spark.jars.packages,graphframes:graphframes:0.5.0-spark2.1-s_2.11) > (spark.app.name,orange_watch) > (spark.executor.cores,2) > > > Ivy Default Cache set to: /home/my_user/.ivy2/cache > The jars for the packages stored in: /home/my_user/.ivy2/jars > :: loading settings :: url = jar:file:/home/my_user/spark- > 2.2.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/ > core/settings/ivysettings.xml > graphframes#graphframes added as a dependency > :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 > confs: [default] > found graphframes#graphframes;0.5.0-spark2.1-s_2.11 in spark-list > found com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 in > central > found com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 > in central > found org.scala-lang#scala-reflect;2.11.0 in central > found org.slf4j#slf4j-api;1.7.7 in spark-list > :: resolution report :: resolve 191ms :: artifacts dl 5ms > :: modules in use: > com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 from > central in [default] > com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 from > central in [default] > graphframes#graphframes;0.5.0-spark2.1-s_2.11 from spark-list in > [default] > org.scala-lang#scala-reflect;2.11.0 from central in [default] > org.slf4j#slf4j-api;1.7.7 from spark-list in [default] > > - > | |modules|| > artifacts | > | conf | number| search|dwnlded|evicted|| > number|dwnlded| > >
unable to connect to connect to cluster 2.2.0
Hi, Recently upgraded from 2.1.1 to 2.2.0. My Streaming job seems to have broken. The submitted application is unable to connect to the cluster, when all is running. below is my stack trace Spark Master:spark://192.168.10.207:7077 Job Arguments: -appName orange_watch -directory /u01/watch/stream/ Spark Configuration: [spark.executor.memory, spark.driver.memory, spark.app.name, spark.executor.cores]:6g [spark.executor.memory, spark.driver.memory, spark.app.name, spark.executor.cores]:4g [spark.executor.memory, spark.driver.memory, spark.app.name, spark.executor.cores]:orange_watch [spark.executor.memory, spark.driver.memory, spark.app.name, spark.executor.cores]:2 Spark Arguments: [--packages]:graphframes:graphframes:0.5.0-spark2.1-s_2.11 Using properties file: /home/my_user/spark-2.2.0-bin-hadoop2.7/conf/spark-defaults.conf Adding default property: spark.jars.packages=graphframes:graphframes:0.5.0-spark2.1-s_2.11 Parsed arguments: master spark://192.168.10.207:7077 deployMode null executorMemory 6g executorCores 2 totalExecutorCores null propertiesFile /home/my_user/spark-2.2.0-bin-hadoop2.7/conf/spark-defaults.conf driverMemory4g driverCores null driverExtraClassPathnull driverExtraLibraryPath null driverExtraJavaOptions null supervise false queue null numExecutorsnull files null pyFiles null archivesnull mainClass com.my_user.MainClassWatch primaryResource file:/home/my_user/cluster-testing/job.jar nameorange_watch childArgs [-watchId 3199 -appName orange_watch -directory /u01/watch/stream/] jarsnull packagesgraphframes:graphframes:0.5.0-spark2.1-s_2.11 packagesExclusions null repositoriesnull verbose true Spark properties used, including those specified through --conf and those from the properties file /home/my_user/spark-2.2.0-bin-hadoop2.7/conf/spark-defaults.conf: (spark.driver.memory,4g) (spark.executor.memory,6g) (spark.jars.packages,graphframes:graphframes:0.5.0-spark2.1-s_2.11) (spark.app.name,orange_watch) (spark.executor.cores,2) Ivy Default Cache set to: /home/my_user/.ivy2/cache The jars for the packages stored in: /home/my_user/.ivy2/jars :: loading settings :: url = jar:file:/home/my_user/spark-2.2.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml graphframes#graphframes added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found graphframes#graphframes;0.5.0-spark2.1-s_2.11 in spark-list found com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 in central found com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 in central found org.scala-lang#scala-reflect;2.11.0 in central found org.slf4j#slf4j-api;1.7.7 in spark-list :: resolution report :: resolve 191ms :: artifacts dl 5ms :: modules in use: com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 from central in [default] com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 from central in [default] graphframes#graphframes;0.5.0-spark2.1-s_2.11 from spark-list in [default] org.scala-lang#scala-reflect;2.11.0 from central in [default] org.slf4j#slf4j-api;1.7.7 from spark-list in [default] - | |modules|| artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | default | 5 | 0 | 0 | 0 || 5 | 0 | - :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 5 already retrieved (0kB/7ms) Main class: com.my_user.MainClassWatch Arguments: -watchId 3199 -appName orange_watch -directory /u01/watch/stream/ System properties: (spark.executor.memory,6g) (spark.driver.memory,4g) (SPARK_SUBMIT,true) (spark.jars.packages,graphframes:graphframes:0.5.0-spark2.1-s_2.11) (spark.app.name,orange_watch) (spark.jars,file:/home/my_user/.ivy2/jars/graphframes_graphframes-0.5.0-spark2.1-s_2.11.jar,file:/home/my_user/.ivy2/jars/com.typesafe.scala-logging_scala-logging-api_2.11-2.1.2.jar,file:/home/my_user/.ivy2/jars/com.typesafe.scala-logging_scala-logging-slf4j_2.11-2.1.2.jar,file:/home/my_user/.ivy2/jars/org.scala-lang_scala-reflect-2.11.0.jar,file:/home/my_user/.ivy2/jars/org.slf4j_slf4j-api-1.7.7.jar,file:/home/my_user/cluster-testing/job.jar) (spark.submit.deployMode,client) (spark.master,spark://192.168.10.207:7077)
spark strucured csv file stream not detecting new files
Greetings, I am running a unit test designed to stream a folder where I am manually copying csv files. The files do not always get picked up. They only get detected when the job starts with the files already in the folder. I even tried using the option of fileNameOnly newly included in 2.2.0. Have I missed something in the documentation. This problem does not seem to occur in DStreams examples DataStreamReader reader = spark.readStream().option("fileNameOnly", true).option("header",true) .schema(userSchema); ; DatasetcsvDF= reader.csv(watchDir) Dataset results = csvDF.groupBy("myCol").count(); MyForEach forEachObj=new MyForEach(); query = results .writeStream() .foreach(forEachObj) // for each never gets called .outputMode("complete") .start(); -- I.R
spark-stream memory table global?
Hi, Does the memory table in which spark-structured streaming results are sinked into, is available to other spark applications on the cluster? Is it by default global or will only be available to context where streaming is being done thanks Imran -- I.R
unable to run spark streaming example
I am trying out the network word count example and my unit test is producing the blow console output with an exception Exception in thread "dispatcher-event-loop-5" java.lang.NoClassDefFoundError: scala/runtime/AbstractPartialFunction$mcVL$sp at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint.receive(ReceiverTracker.scala:476) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: scala.runtime.AbstractPartialFunction$mcVL$sp at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 20 more --- Time: 1509716745000 ms --- --- Time: 1509716746000 ms --- >From DOS I am pushing a text file through netcat with following command nc -l -p < license.txt ... below are my spark related maven dependencies 2.1.1 org.apache.spark spark-launcher_2.10 ${spark.version} org.apache.spark spark-core_2.11 ${spark.version} provided org.apache.spark spark-graphx_2.11 ${spark.version} provided org.apache.spark spark-sql_2.11 ${spark.version} provided graphframes graphframes 0.5.0-spark2.1-s_2.11 org.apache.spark spark-mllib_2.10 ${spark.version} provided org.apache.spark spark-streaming_2.10 ${spark.version} provided -- I.R
Re: parition by multiple columns/keys
strangely this is working only for very small dataset of rows.. for very large datasets apparently the partitioning is not working. is there a limit to the number of columns or rows when repartitioning according to multiple columns? regards, Imran On Wed, Oct 18, 2017 at 11:00 AM, Imran Rajjad <raj...@gmail.com> wrote: > yes..I think I figured out something like below > > Serialized Java Class > - > public class MyMapPartition implements Serializable,MapPartitionsFunction{ > @Override > public Iterator call(Iterator iter) throws Exception { > ArrayList list = new ArrayList(); > // ArrayNode array = mapper.createArrayNode(); > Row row=null; > System.out.println(""); > while(iter.hasNext()){ > >row=(Row) iter.next(); >System.out.println(row); >list.add(row); > } > System.out.println(">>>>"); > return list.iterator(); > } > } > > Unit Test > --- > JavaRDD rdd = jsc.parallelize(Arrays.asList( > RowFactory.create(11L,21L,1L) > ,RowFactory.create(11L,22L,2L) > ,RowFactory.create(11L,22L,1L) > ,RowFactory.create(12L,23L,3L) > ,RowFactory.create(12L,24L,3L) > ,RowFactory.create(12L,22L,4L) > ,RowFactory.create(13L,22L,4L) > ,RowFactory.create(14L,22L,4L) > )); > StructType structType = new StructType(); > structType = structType.add("a", DataTypes.LongType, false) > .add("b", DataTypes.LongType, false) > .add("c", DataTypes.LongType, false); > ExpressionEncoder encoder = RowEncoder.apply(structType); > > > Dataset ds = spark.createDataFrame(rdd, encoder.schema()); > ds.show(); > > MyMapPartition mp = new MyMapPartition (); > //Iterator > //.repartition(new Column("a"),new Column("b")) >Dataset grouped = ds.groupBy("a", "b","c") > .count() > .repartition(new Column("a"),new Column("b")) > .mapPartitions(mp,encoder); > > grouped.count(); > > --- > > output > > > [12,23,3,1] > >>>> > > [14,22,4,1] > >>>> > > [12,24,3,1] > >>>> > > [12,22,4,1] > >>>> > > [11,22,1,1] > [11,22,2,1] > >>>> > > [11,21,1,1] > >>>> > > [13,22,4,1] > >>>> > > > On Wed, Oct 18, 2017 at 10:29 AM, ayan guha <guha.a...@gmail.com> wrote: > >> How or what you want to achieve? Ie are planning to do some aggregation >> on group by c1,c2? >> >> On Wed, 18 Oct 2017 at 4:13 pm, Imran Rajjad <raj...@gmail.com> wrote: >> >>> Hi, >>> >>> I have a set of rows that are a result of a >>> groupBy(col1,col2,col3).count(). >>> >>> Is it possible to map rows belong to unique combination inside an >>> iterator? >>> >>> e.g >>> >>> col1 col2 col3 >>> a 1 a1 >>> a 1 a2 >>> b 2 b1 >>> b 2 b2 >>> >>> how can I separate rows with col1 and col2 = (a,1) and (b,2)? >>> >>> regards, >>> Imran >>> >>> -- >>> I.R >>> >> -- >> Best Regards, >> Ayan Guha >> > > > > -- > I.R > -- I.R
Re: jar file problem
Simple way is to have a network volume mounted with same name to make things easy On Thu, 19 Oct 2017 at 8:24 PM Uğur Sopaoğluwrote: > Hello, > > I have a very easy problem. How I run a spark job, I must copy jar file to > all worker nodes. Is there any way to do simple?. > > > -- > Uğur Sopaoğlu > > > -- Sent from Gmail Mobile
Re: parition by multiple columns/keys
yes..I think I figured out something like below Serialized Java Class - public class MyMapPartition implements Serializable,MapPartitionsFunction{ @Override public Iterator call(Iterator iter) throws Exception { ArrayList list = new ArrayList(); // ArrayNode array = mapper.createArrayNode(); Row row=null; System.out.println(""); while(iter.hasNext()){ row=(Row) iter.next(); System.out.println(row); list.add(row); } System.out.println(">>>>"); return list.iterator(); } } Unit Test --- JavaRDD rdd = jsc.parallelize(Arrays.asList(RowFactory.create(11L,21L,1L) ,RowFactory.create(11L,22L,2L) ,RowFactory.create(11L,22L,1L) ,RowFactory.create(12L,23L,3L) ,RowFactory.create(12L,24L,3L) ,RowFactory.create(12L,22L,4L) ,RowFactory.create(13L,22L,4L) ,RowFactory.create(14L,22L,4L) )); StructType structType = new StructType(); structType = structType.add("a", DataTypes.LongType, false) .add("b", DataTypes.LongType, false) .add("c", DataTypes.LongType, false); ExpressionEncoder encoder = RowEncoder.apply(structType); Dataset ds = spark.createDataFrame(rdd, encoder.schema()); ds.show(); MyMapPartition mp = new MyMapPartition (); //Iterator //.repartition(new Column("a"),new Column("b")) Dataset grouped = ds.groupBy("a", "b","c") .count() .repartition(new Column("a"),new Column("b")) .mapPartitions(mp,encoder); grouped.count(); --- output [12,23,3,1] >>>> [14,22,4,1] >>>> [12,24,3,1] >>>> [12,22,4,1] >>>> [11,22,1,1] [11,22,2,1] >>>> [11,21,1,1] >>>> [13,22,4,1] >>>> On Wed, Oct 18, 2017 at 10:29 AM, ayan guha <guha.a...@gmail.com> wrote: > How or what you want to achieve? Ie are planning to do some aggregation on > group by c1,c2? > > On Wed, 18 Oct 2017 at 4:13 pm, Imran Rajjad <raj...@gmail.com> wrote: > >> Hi, >> >> I have a set of rows that are a result of a groupBy(col1,col2,col3).count( >> ). >> >> Is it possible to map rows belong to unique combination inside an >> iterator? >> >> e.g >> >> col1 col2 col3 >> a 1 a1 >> a 1 a2 >> b 2 b1 >> b 2 b2 >> >> how can I separate rows with col1 and col2 = (a,1) and (b,2)? >> >> regards, >> Imran >> >> -- >> I.R >> > -- > Best Regards, > Ayan Guha > -- I.R
Re: No space left on device
don't think so. check out the documentation for this method On Wed, Oct 18, 2017 at 10:11 AM, Mina Aslani <aslanim...@gmail.com> wrote: > I have not tried rdd.unpersist(), I thought using rdd = null is the same, > is it not? > > On Wed, Oct 18, 2017 at 1:07 AM, Imran Rajjad <raj...@gmail.com> wrote: > >> did you try calling rdd.unpersist() >> >> On Wed, Oct 18, 2017 at 10:04 AM, Mina Aslani <aslanim...@gmail.com> >> wrote: >> >>> Hi, >>> >>> I get "No space left on device" error in my spark worker: >>> >>> Error writing stream to file /usr/spark-2.2.0/work/app-.../0/stderr >>> java.io.IOException: No space left on device >>> >>> In my spark cluster, I have one worker and one master. >>> My program consumes stream of data from kafka and publishes the result >>> into kafka. I set my RDD = null after I finish working, so that >>> intermediate shuffle files are removed quickly. >>> >>> How can I avoid "No space left on device"? >>> >>> Best regards, >>> Mina >>> >> >> >> >> -- >> I.R >> > > -- I.R
parition by multiple columns/keys
Hi, I have a set of rows that are a result of a groupBy(col1,col2,col3).count(). Is it possible to map rows belong to unique combination inside an iterator? e.g col1 col2 col3 a 1 a1 a 1 a2 b 2 b1 b 2 b2 how can I separate rows with col1 and col2 = (a,1) and (b,2)? regards, Imran -- I.R
Re: No space left on device
did you try calling rdd.unpersist() On Wed, Oct 18, 2017 at 10:04 AM, Mina Aslaniwrote: > Hi, > > I get "No space left on device" error in my spark worker: > > Error writing stream to file /usr/spark-2.2.0/work/app-.../0/stderr > java.io.IOException: No space left on device > > In my spark cluster, I have one worker and one master. > My program consumes stream of data from kafka and publishes the result > into kafka. I set my RDD = null after I finish working, so that > intermediate shuffle files are removed quickly. > > How can I avoid "No space left on device"? > > Best regards, > Mina > -- I.R
task not serializable on simple operations
Is there a way around to implement a separate Java class that implements serializable interface for even small petty arithmetic operations? below is code from simple decision tree example Double testMSE = predictionAndLabel.map(new Function, Double>() { @Override public Double call(Tuple2 pl) { Double diff = pl._1() - pl._2(); return diff * diff; } }).reduce(new Function2 () { @Override public Double call(Double a, Double b) { return a + b; } }) / testDataCount; there are no complex java objects involved here and simple arithmetic operations. regards, Imran -- I.R
Re: Apache Spark-Subtract two datasets
if the datasets hold objects of different classes, then you will have to convert both of them to rdd and then rename the columns befrore you call rdd1.subtract(rdd2) On Thu, Oct 12, 2017 at 10:16 PM, Shashikant Kulkarni < shashikant.kulka...@gmail.com> wrote: > Hello, > > I have 2 datasets, Dataset and other is Dataset. I want > the list of records which are in Dataset but not in > Dataset. How can I do this in Apache Spark using Java Connector? I > am using Apache Spark 2.2.0 > > Thank you > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- I.R
Re: best spark spatial lib?
Thanks guy for the response. Basically I am migrating an oracle pl/sql procedure to spark-java. In oracle I have a table with geometry column, on which I am able to do a "where col = 1 and geom.within(another_geom)" I am looking for a less complicated port in to spark for which queries. I will give these libraries a shot. @Ram .. Magellan I gues does not support Java regards, Imran On Wed, Oct 11, 2017 at 12:07 AM, Ram Sriharsha <sriharsha@gmail.com> wrote: > why can't you do this in Magellan? > Can you post a sample query that you are trying to run that has spatial > and logical operators combined? Maybe I am not understanding the issue > properly > > Ram > > On Tue, Oct 10, 2017 at 2:21 AM, Imran Rajjad <raj...@gmail.com> wrote: > >> I need to have a location column inside my Dataframe so that I can do >> spatial queries and geometry operations. Are there any third-party packages >> that perform this kind of operations. I have seen a few like Geospark and >> megalan but they don't support operations where spatial and logical >> operators can be combined. >> >> regards, >> Imran >> >> -- >> I.R >> > > -- I.R
best spark spatial lib?
I need to have a location column inside my Dataframe so that I can do spatial queries and geometry operations. Are there any third-party packages that perform this kind of operations. I have seen a few like Geospark and megalan but they don't support operations where spatial and logical operators can be combined. regards, Imran -- I.R
Re: [Spark-Submit] Where to store data files while running job in cluster mode?
Try tachyon.. its less fuss On Fri, 29 Sep 2017 at 8:32 PM lucas.g...@gmail.comwrote: > We use S3, there are caveats and issues with that but it can be made to > work. > > If interested let me know and I'll show you our workarounds. I wouldn't > do it naively though, there's lots of potential problems. If you already > have HDFS use that, otherwise all things told it's probably less effort to > use S3. > > Gary > > On 29 September 2017 at 05:03, Arun Rai wrote: > >> Or you can try mounting that drive to all node. >> >> On Fri, Sep 29, 2017 at 6:14 AM Jörn Franke wrote: >> >>> You should use a distributed filesystem such as HDFS. If you want to use >>> the local filesystem then you have to copy each file to each node. >>> >>> >>> >>> >>> >>> > On 29. Sep 2017, at 12:05, Gaurav1809 wrote: >>> >>> >>> > >>> >>> >>> > Hi All, >>> >>> >>> > >>> >>> >>> > I have multi node architecture of (1 master,2 workers) Spark cluster, >>> the >>> >>> >>> > job runs to read CSV file data and it works fine when run on local mode >>> >>> >>> > (Local(*)). >>> >>> >>> > However, when the same job is ran in cluster mode(Spark://HOST:PORT), >>> it is >>> >>> >>> > not able to read it. >>> >>> >>> > I want to know how to reference the files Or where to store them? >>> Currently >>> >>> >>> > the CSV data file is on master(from where the job is submitted). >>> >>> >>> > >>> >>> >>> > Following code works fine in local mode but not in cluster mode. >>> >>> >>> > >>> >>> >>> > val spark = SparkSession >>> >>> >>> > .builder() >>> >>> >>> > .appName("SampleFlightsApp") >>> >>> >>> > .master("spark://masterIP:7077") // change it to >>> .master("local[*]) >>> >>> >>> > for local mode >>> >>> >>> > .getOrCreate() >>> >>> >>> > >>> >>> >>> >val flightDF = >>> >>> >>> > spark.read.option("header",true).csv("/home/username/sampleflightdata") >>> >>> >>> >flightDF.printSchema() >>> >>> >>> > >>> >>> >>> > Error: FileNotFoundException: File >>> file:/home/username/sampleflightdata does >>> >>> >>> > not exist >>> >>> >>> > >>> >>> >>> > >>> >>> >>> > >>> >>> >>> > -- >>> >>> >>> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >>> >>> >>> > >>> >>> >>> > - >>> >>> >>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >>> > >>> >>> >>> >>> >>> >>> - >>> >>> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >>> >>> >>> >>> >> >> > > > -- Sent from Gmail Mobile
Re: graphframes on cluster
sorry for posting without complete information I am connecting to spark cluster with the driver program as the backend of web application. This is intended to listen to job progress and some other work. Below is how I am connecting to the cluster sparkConf = new SparkConf().setAppName("isolated test") .setMaster("spark://master:7077") .set("spark.executor.memory","6g") .set("spark.driver.memory","6g") .set("spark.driver.maxResultSize","2g") .set("spark.executor.extrajavaoptions","-Xmx8g") .set("spark.jars.packages","graphframes:graphframes:0.5.0-spark2.1-s_2.11") .set("spark.jars","/home/usr/jobs.jar"); //this is shared location Linux machines and has the required java classes the crash occurs at gFrame.connectedComponents().setBroadcastThreshold(2).run(); with exception Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 5, 10.112.29.80): java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133) at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2024) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) after googling around..this appears to be related to dependencies but I don't have much dependencies apart from a few POJOs which have been included through context regards, Imran On Wed, Sep 20, 2017 at 9:00 PM, Felix Cheung <felixcheun...@hotmail.com> wrote: > Could you include the code where it fails? > Generally the best way to use gf is to use the --packages options with > spark-submit command > > -- > *From:* Imran Rajjad <raj...@gmail.com> > *Sent:* Wednesday, September 20, 2017 5:47:27 AM > *To:* user @spark > *Subject:* graphframes on cluster > > Trying to run graph frames on a spark cluster. Do I need to include the > package in spark context settings? or the only the driver program is > suppose to have the graphframe libraries in its class path? Currently the > job is crashing when action function is invoked on graphframe classes. > > regards, > Imran > > -- > I.R > -- I.R
graphframes on cluster
Trying to run graph frames on a spark cluster. Do I need to include the package in spark context settings? or the only the driver program is suppose to have the graphframe libraries in its class path? Currently the job is crashing when action function is invoked on graphframe classes. regards, Imran -- I.R
Re: graphframe out of memory
No I did not, I thought Spark would take care of that itself since I have put in the arguments. On Thu, Sep 7, 2017 at 9:26 PM, Lukas Bradley <lukasbrad...@gmail.com> wrote: > Did you also increase the size of the heap of the Java app that is > starting Spark? > > https://alvinalexander.com/blog/post/java/java-xmx-xms- > memory-heap-size-control > > On Thu, Sep 7, 2017 at 12:16 PM, Imran Rajjad <raj...@gmail.com> wrote: > >> I am getting Out of Memory error while running connectedComponents job on >> graph with around 12000 vertices and 134600 edges. >> I am running spark in embedded mode in a standalone Java application and >> have tried to increase the memory but it seems that its not taking any >> effect >> >> sparkConf = new SparkConf().setAppName("SOME APP >> NAME").setMaster("local[10]") >> .set("spark.executor.memory","5g") >> .set("spark.driver.memory","8g") >> .set("spark.driver.maxResultSize","1g") >> .set("spark.sql.warehouse.dir", "file:///d:/spark/tmp") >> .set("hadoop.home.dir", "file:///D:/spark-2.1.0-bin-hadoop2.7/bin"); >> >> spark = SparkSession.builder().config(sparkConf).getOrCreate(); >> spark.sparkContext().setLogLevel("ERROR"); >> spark.sparkContext().setCheckpointDir("D:/spark/tmp"); >> >> the stack trace >> java.lang.OutOfMemoryError: Java heap space >> at java.util.Arrays.copyOf(Arrays.java:3332) >> at java.lang.AbstractStringBuilder.ensureCapacityInternal(Abstr >> actStringBuilder.java:124) >> at java.lang.AbstractStringBuilder.append(AbstractStringBuilder >> .java:448) >> at java.lang.StringBuilder.append(StringBuilder.java:136) >> at scala.StringContext.standardInterpolator(StringContext.scala:126) >> at scala.StringContext.s(StringContext.scala:95) >> at org.apache.spark.sql.execution.QueryExecution.toString( >> QueryExecution.scala:230) >> at org.apache.spark.sql.execution.SQLExecution$.withNewExecutio >> nId(SQLExecution.scala:54) >> at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2788) >> at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$e >> xecute$1(Dataset.scala:2385) >> at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$D >> ataset$$collect$1.apply(Dataset.scala:2390) >> at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$D >> ataset$$collect$1.apply(Dataset.scala:2390) >> at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2801) >> at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$c >> ollect(Dataset.scala:2390) >> at org.apache.spark.sql.Dataset.collect(Dataset.scala:2366) >> at org.graphframes.lib.ConnectedComponents$.skewedJoin(Connecte >> dComponents.scala:239) >> at org.graphframes.lib.ConnectedComponents$.org$graphframes$ >> lib$ConnectedComponents$$run(ConnectedComponents.scala:308) >> at org.graphframes.lib.ConnectedComponents.run(ConnectedCompone >> nts.scala:139) >> >> GraphFrame version is 0.5.0 and Spark version is 2.1.1 >> >> regards, >> Imran >> >> -- >> I.R >> > > -- I.R
graphframe out of memory
I am getting Out of Memory error while running connectedComponents job on graph with around 12000 vertices and 134600 edges. I am running spark in embedded mode in a standalone Java application and have tried to increase the memory but it seems that its not taking any effect sparkConf = new SparkConf().setAppName("SOME APP NAME").setMaster("local[10]") .set("spark.executor.memory","5g") .set("spark.driver.memory","8g") .set("spark.driver.maxResultSize","1g") .set("spark.sql.warehouse.dir", "file:///d:/spark/tmp") .set("hadoop.home.dir", "file:///D:/spark-2.1.0-bin-hadoop2.7/bin"); spark = SparkSession.builder().config(sparkConf).getOrCreate(); spark.sparkContext().setLogLevel("ERROR"); spark.sparkContext().setCheckpointDir("D:/spark/tmp"); the stack trace java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) at java.lang.StringBuilder.append(StringBuilder.java:136) at scala.StringContext.standardInterpolator(StringContext.scala:126) at scala.StringContext.s(StringContext.scala:95) at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:230) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:54) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2788) at org.apache.spark.sql.Dataset.org $apache$spark$sql$Dataset$$execute$1(Dataset.scala:2385) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2390) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2390) at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2801) at org.apache.spark.sql.Dataset.org $apache$spark$sql$Dataset$$collect(Dataset.scala:2390) at org.apache.spark.sql.Dataset.collect(Dataset.scala:2366) at org.graphframes.lib.ConnectedComponents$.skewedJoin(ConnectedComponents.scala:239) at org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$ConnectedComponents$$run(ConnectedComponents.scala:308) at org.graphframes.lib.ConnectedComponents.run(ConnectedComponents.scala:139) GraphFrame version is 0.5.0 and Spark version is 2.1.1 regards, Imran -- I.R
unable to import graphframes
Dear list, I am following the documentation of graphframe and have started the scala shell using following command D:\spark-2.1.0-bin-hadoop2.7\bin>spark-shell --master local[2] --packages graphframes:graphframes:0.5.0-spark2.1-s_2.10 Ivy Default Cache set to: C:\Users\user\.ivy2\cache The jars for the packages stored in: C:\Users\user\.ivy2\jars :: loading settings :: url = jar:file:/D:/spark-2.1.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml graphframes#graphframes added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found graphframes#graphframes;0.5.0-spark2.1-s_2.10 in spark-packages found com.typesafe.scala-logging#scala-logging-api_2.10;2.1.2 in central found com.typesafe.scala-logging#scala-logging-slf4j_2.10;2.1.2 in central found org.scala-lang#scala-reflect;2.10.4 in central found org.slf4j#slf4j-api;1.7.7 in local-m2-cache :: resolution report :: resolve 288ms :: artifacts dl 7ms :: modules in use: com.typesafe.scala-logging#scala-logging-api_2.10;2.1.2 from central in [default] com.typesafe.scala-logging#scala-logging-slf4j_2.10;2.1.2 from central in [default] graphframes#graphframes;0.5.0-spark2.1-s_2.10 from spark-packages in [default] org.scala-lang#scala-reflect;2.10.4 from central in [default] org.slf4j#slf4j-api;1.7.7 from local-m2-cache in [default] - | |modules|| artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | default | 5 | 0 | 0 | 0 || 5 | 0 | - :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 5 already retrieved (0kB/7ms) Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 2017-08-29 12:10:23,089 [main] WARN NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2017-08-29 12:10:25,128 [main] WARN General - Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/D:/spark-2.1.0-bin-hadoop2.7/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/D:/spark-2.1.0-bin-hadoop2.7/bin/../jars/datanucleus-rdbms-3.2.9.jar." 2017-08-29 12:10:25,137 [main] WARN General - Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/D:/spark-2.1.0-bin-hadoop2.7/jars/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/D:/spark-2.1.0-bin-hadoop2.7/bin/../jars/datanucleus-core-3.2.10.jar." 2017-08-29 12:10:25,141 [main] WARN General - Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/D:/spark-2.1.0-bin-hadoop2.7/bin/../jars/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/D:/spark-2.1.0-bin-hadoop2.7/jars/datanucleus-api-jdo-3.2.6.jar." 2017-08-29 12:10:27,744 [main] WARN ObjectStore - Failed to get database global_temp, returning NoSuchObjectException Spark context Web UI available at http://192.168.10.60:4040 Spark context available as 'sc' (master = local[2], app id = local-1503990623864). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112) Type in expressions to have them evaluated. Type :help for more information. scala> scala> import org.graphframes._ :23: error: object graphframes is not a member of package org import org.graphframes._ is there something missing? regards, Imran -- I.R
Re: Oracle Table not resolved [Spark 2.1.1]
the jdbc url is invalid, but strangely it should have thrown ORA- exception On Mon, Aug 28, 2017 at 4:55 PM, Naga G <gudurun...@gmail.com> wrote: > Not able to find the database name. > ora is the database in the below url ? > > Sent from Naga iPad > > > On Aug 28, 2017, at 4:06 AM, Imran Rajjad <raj...@gmail.com> wrote: > > > > Hello, > > > > I am trying to retrieve an oracle table into Dataset using > following code > > > > String url = "jdbc:oracle@localhost:1521:ora"; > > Dataset jdbcDF = spark.read() > > .format("jdbc") > > .option("driver", "oracle.jdbc.driver.OracleDriver") > > .option("url", url) > > .option("dbtable", "INCIDENTS") > > .option("user", "user1") > > .option("password", "pass1") > > .load(); > > > > System.out.println(jdbcDF.count()); > > > > below is the stack trace > > > > java.lang.NullPointerException > > at org.apache.spark.sql.execution.datasources.jdbc. > JDBCRDD$.resolveTable(JDBCRDD.scala:72) > > at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.( > JDBCRelation.scala:113) > > at org.apache.spark.sql.execution.datasources.jdbc. > JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45) > > at org.apache.spark.sql.execution.datasources. > DataSource.resolveRelation(DataSource.scala:330) > > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) > > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125) > > at com.elogic.hazelcast_test.JDBCTest.test(JDBCTest.java:56) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at sun.reflect.NativeMethodAccessorImpl.invoke( > NativeMethodAccessorImpl.java:62) > > at sun.reflect.DelegatingMethodAccessorImpl.invoke( > DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > > at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall( > FrameworkMethod.java:50) > > at org.junit.internal.runners.model.ReflectiveCallable.run( > ReflectiveCallable.java:12) > > at org.junit.runners.model.FrameworkMethod.invokeExplosively( > FrameworkMethod.java:47) > > at org.junit.internal.runners.statements.InvokeMethod. > evaluate(InvokeMethod.java:17) > > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > > at org.junit.runners.BlockJUnit4ClassRunner.runChild( > BlockJUnit4ClassRunner.java:78) > > at org.junit.runners.BlockJUnit4ClassRunner.runChild( > BlockJUnit4ClassRunner.java:57) > > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > > at org.junit.internal.runners.statements.RunBefores. > evaluate(RunBefores.java:26) > > at org.junit.internal.runners.statements.RunAfters.evaluate( > RunAfters.java:27) > > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > > at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run( > JUnit4TestReference.java:86) > > at org.eclipse.jdt.internal.junit.runner.TestExecution. > run(TestExecution.java:38) > > at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner. > runTests(RemoteTestRunner.java:459) > > at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner. > runTests(RemoteTestRunner.java:678) > > at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner. > run(RemoteTestRunner.java:382) > > at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner. > main(RemoteTestRunner.java:192) > > > > Apparently the connection is made but Table is not being detected. Any > ideas whats wrong with the code? > > > > regards, > > Imran > > -- > > I.R > -- I.R
Re: Spark SQL vs HiveQL
If reading directly from file then Spark SQL should be your choice On Mon, Aug 28, 2017 at 10:25 PM Michael Artzwrote: > Just to be clear, I'm referring to having Spark reading from a file, not > from a Hive table. And it will have tungsten engine off heap serialization > after 2.1, so if it was a test with like 1.63 it won't be as helpful. > > > On Mon, Aug 28, 2017 at 10:50 AM, Michael Artz > wrote: > >> Hi, >> There isn't any good source to answer the question if Hive as an >> SQL-On-Hadoop engine just as fast as Spark SQL now? I just want to know if >> there has been a comparison done lately for HiveQL vs Spark SQL on Spark >> versions 2.1 or later. I have a large ETL process, with many table joins >> and some string manipulation. I don't think anyone has done this kind of >> testing in a while. With Hive LLAP being so performant, I am trying to >> make the case for using Spark and some of the architects are light on >> experience so they are scared of Scala. >> >> Thanks >> >> >> > > > -- Sent from Gmail Mobile
Oracle Table not resolved [Spark 2.1.1]
Hello, I am trying to retrieve an oracle table into Dataset using following code String url = "jdbc:oracle@localhost:1521:ora"; Dataset jdbcDF = spark.read() .format("jdbc") .option("driver", "oracle.jdbc.driver.OracleDriver") .option("url", url) .option("dbtable", "INCIDENTS") .option("user", "user1") .option("password", "pass1") .load(); System.out.println(jdbcDF.count()); below is the stack trace java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:72) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125) at com.elogic.hazelcast_test.JDBCTest.test(JDBCTest.java:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192) Apparently the connection is made but Table is not being detected. Any ideas whats wrong with the code? regards, Imran -- I.R
Thrift-Server JDBC ResultSet Cursor Reset or Previous
Dear List, Are there any future plans to implement cursor reset or previous record functionality in Thrift Server`s JDBC driver? Are there any other alternatives? java.sql.SQLException: Method not supported at org.apache.hive.jdbc.HiveBaseResultSet.previous(HiveBaseResultSet.java:643) regards Imran -- I.R
solr data source not working
I am unable to register the Solr Cloud as data source in Spark 2.1.0. Following the documentation at https://github.com/lucidworks/spark-solr#import-jar-file-via-spark-shell, I have used the 3.0.0.beta3 version. The system path is displaying the added jar as spark://172.31.208.1:55730/jars/spark-solr-3.0.0-beta3-shaded.jar Added By User OS:Win10 Hadoop : 2.7 (x64 winutils) Spark:2.1.0 Solr-Spark:3.0.0-beta3 same was tried with spark 2.2.0 with solr-spark:3.1.0 ERROR scala> val df = spark.read.format("solr").options(Map("collection" -> "cdr1","zkhost" -> "localhost:9983")).load java.lang.ClassNotFoundException: Failed to find data source: solr. Please find packages at http://spark.apache.org/third-party-projects.html at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:569) at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86) at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125) ... 48 elided Caused by: java.lang.ClassNotFoundException: solr.DefaultSource at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25$$anonfun$apply$13.apply(DataSource.scala:554) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25$$anonfun$apply$13.apply(DataSource.scala:554) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25.apply(DataSource.scala:554) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$25.apply(DataSource.scala:554) at scala.util.Try.orElse(Try.scala:84) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:554) -- I.R
Slow responce on Solr Cloud with Spark
Greetings, We are trying out Spark 2 + ThriftServer to join multiple collections from a Solr Cloud (6.4.x). I have followed this blog https://lucidworks.com/2015/08/20/solr-spark-sql-datasource/ I understand that initially spark populates the temporary table with 18633014 records and takes its due time, however any following SQLs on the temporary table take the same amount of time . It seems the temporary tables is not being re-used or cached. The fields in the solr collection do not have the docValue enabled, could that be the reason? Apparently I have missed a trick regards, Imran -- I.R