unsubscribe
unsubscribe
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark jdbc postgres numeric array
Hi, I also filed a jira yesterday: https://issues.apache.org/jira/browse/SPARK-26538 Looks like one needs to be closed as duplicate. Sorry for the late update. Best regards -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Spark jdbc postgres numeric array
Hi, I came across strange behavior when dealing with postgres columns of type numeric[] using Spark 2.3.2, PostgreSQL 10.4, 9.6.9. Consider the following table definition: create table test1 ( v numeric[], d numeric ); insert into test1 values('{.222,.332}', 222.4555); When reading the table into a Dataframe, I get the following schema: root |-- v: array (nullable = true) ||-- element: decimal(0,0) (containsNull = true) |-- d: decimal(38,18) (nullable = true) Notice that for both columns precision and scale were not specified, but in case of the array element I got both set to 0, while in the other case defaults were set. Later, when I try to read the Dataframe, I get the following error: java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 0 at scala.Predef$.require(Predef.scala:224) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:453) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$16$$anonfun$apply$6$$anonfun$apply$7.apply(JdbcUtils.scala:474) ... I would expect to get array elements of type decimal(38,18) and no error when reading in this case. Should this be considered a bug? Is there a workaround other than changing the column array type definition to include explicit precision and scale? Best regards, Alexey -- реклама --- Поторопись зарегистрировать самый короткий почтовый адрес @i.ua https://mail.i.ua/reg - и получи 1Gb для хранения писем - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
mapWithState() without data checkpointing
Hello! I would like to avoid data checkpointing when processing a DStream. Basically, we do not care if the intermediate data are lost. Is there a way to achieve that? Is there an extension point or class embedding all associated activities? Thanks! Sincerely yours, — Alexey Kharlamov - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: VectorUDT with spark.ml.linalg.Vector
Hi Yanbo, Thanks for your reply. I will keep an eye on that pull request. For now, I decided to just put my code inside org.apache.spark.ml to be able to access private classes. Thanks, Alexey On Tue, Aug 16, 2016 at 11:13 PM, Yanbo Liang <yblia...@gmail.com> wrote: > It seams that VectorUDT is private and can not be accessed out of Spark > currently. It should be public but we need to do some refactor before make > it public. You can refer the discussion at https://github.com/apache/ > spark/pull/12259 . > > Thanks > Yanbo > > 2016-08-16 9:48 GMT-07:00 alexeys <alex...@princeton.edu>: > >> I am writing an UDAF to be applied to a data frame column of type Vector >> (spark.ml.linalg.Vector). I rely on spark/ml/linalg so that I do not have >> to >> go back and forth between dataframe and RDD. >> >> Inside the UDAF, I have to specify a data type for the input, buffer, and >> output (as usual). VectorUDT is what I would use with >> spark.mllib.linalg.Vector: >> https://github.com/apache/spark/blob/master/mllib/src/main/ >> scala/org/apache/spark/mllib/linalg/Vectors.scala >> >> However, when I try to import it from spark.ml instead: import >> org.apache.spark.ml.linalg.VectorUDT >> I get a runtime error (no errors during the build): >> >> class VectorUDT in package linalg cannot be accessed in package >> org.apache.spark.ml.linalg >> >> Is it expected/can you suggest a workaround? >> >> I am using Spark 2.0.0 >> >> Thanks, >> Alexey >> >> >> >> -- >> View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/VectorUDT-with-spark-ml-linalg-Vector-tp27542.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >
Re: Ideas to put a Spark ML model in production
>From my personal experience - we're reading the metadata of the features column in the dataframe to extract mapping of the feature indices to the original feature name, and use this mapping to translate the model coefficients into a JSON string that maps the original feature names to their weights. The production environment has a simple code that evaluates a logistic model based on this JSON string and the real inputs. I would be very interested to find a more straight-forward approach to export the model into a format that's readable by systems without Spark installed on them. On Sat, Jul 2, 2016 at 10:45 AM, Yanbo Liangwrote: > Let's suppose you have trained a LogisticRegressionModel and saved it at > "/tmp/lr-model". You can copy the directory to production environment and > use it to make prediction on users new data. You can refer the following > code snippets: > > val model = LogisiticRegressionModel.load("/tmp/lr-model") > val data = newDataset > val prediction = model.transform(data) > > However, usually we save/load PipelineModel which include necessary > feature transformers and model training process rather than the single > model, but they are similar operations. > > Thanks > Yanbo > > 2016-06-23 10:54 GMT-07:00 Saurabh Sardeshpande : > >> Hi all, >> >> How do you reliably deploy a spark model in production? Let's say I've >> done a lot of analysis and come up with a model that performs great. I have >> this "model file" and I'm not sure what to do with it. I want to build some >> kind of service around it that takes some inputs, converts them into a >> feature, runs the equivalent of 'transform', i.e. predict the output and >> return the output. >> >> At the Spark Summit I heard a lot of talk about how this will be easy to >> do in Spark 2.0, but I'm looking for an solution sooner. PMML support is >> limited and the model I have can't be exported in that format. >> >> I would appreciate any ideas around this, especially from personal >> experiences. >> >> Regards, >> Saurabh >> > >
Re: cache datframe
What's the reason for your first cache call? It looks like you've used the data only once to transform it without reusing the data, so there's no reason for the first cache call, and you need only the second call (and that also depends on the rest of your code). On Thu, Jun 16, 2016 at 3:17 PM, pseudo oduespwrote: > hi, > if i cache same data frame and transforme and add collumns i should cache > second times > > df.cache() > > transforamtion > add new columns > > df.cache() > ? > >
sliding Top N window
Good day, I have a following task: a stream of “page vies” coming to kafka topic. Each view contains list of product Ids from a visited page. The task: to have in “real time” Top N product. I am interested in some solution that would require minimum intermediate writes … So need to build a sliding window for top N product, where the product counters dynamically changes and window should present the TOP product for the specified period of time. I believe there is no way to avoid maintaining all product counters counters in memory/storage. But at least I would like to do all logic, all calculation on a fly, in memory, not spilling multiple RDD from memory to disk. So I believe I see one way of doing it: Take, msg from kafka take and line up, all elementary action (increase by 1 the counter for the product PID ) Each action will be implemented as a call to HTable.increment() // or easier, with incrementColumnValue()… After each increment I can apply my own operation “offer” would provide that only top N products with counters are kept in another Hbase table (also with atomic operations). But there is another stream of events: decreasing product counters when view expires the legth of sliding window…. So my question: does anybody know/have and can share the piece code/ know how: how to implement “sliding Top N window” better. If nothing will be offered, I will share what I will do myself. Thank you Alexey This message, including any attachments, is the property of Sears Holdings Corporation and/or one of its subsidiaries. It is confidential and may contain proprietary or legally privileged information. If you are not the intended recipient, please delete it without reading the contents. Thank you.
[streaming] reading Kafka direct stream throws kafka.common.OffsetOutOfRangeException
Hi I have simple spark-streaming job(8 executors 1 core - on 8 node cluster) - read from Kafka topic( 3 brokers with 8 partitions) and save to Cassandra. The problem is that when I increase number of incoming messages in topic the job is starting to fail with kafka.common.OffsetOutOfRangeException. Job fails starting from 100 events per second. Thanks in advance - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: [streaming] DStream with window performance issue
Ok. Spark 1.4.1 on yarn Here is my application I have 4 different Kafka topics(different object streams) type Edge = (String,String) val a = KafkaUtils.createDirectStream[...](sc,"A",params).filter( nonEmpty ).map( toEdge ) val b = KafkaUtils.createDirectStream[...](sc,"B",params).filter( nonEmpty ).map( toEdge ) val c = KafkaUtils.createDirectStream[...](sc,"C",params).filter( nonEmpty ).map( toEdge ) val u = a union b union c val source = u.window(Seconds(600), Seconds(10)) val z = KafkaUtils.createDirectStream[...](sc,"Z",params).filter( nonEmpty ).map( toEdge ) val joinResult = source.rightOuterJoin( z ) joinResult.foreachRDD { rdd=> rdd.foreachPartition { partition => // save to result topic in kafka } } The 'window' function in the code above is constantly growing, no matter how many events appeared in corresponding kafka topics but if I change one line from val source = u.window(Seconds(600), Seconds(10)) to val partitioner = ssc.sparkContext.broadcast(new HashPartitioner(8)) val source = u.transform(_.partitionBy(partitioner.value) ).window(Seconds(600), Seconds(10)) Everything works perfect. Perhaps the problem was in WindowedDStream I forced to use PartitionerAwareUnionRDD( partitionBy the same partitioner ) instead of UnionRDD. Nonetheless I did not see any hints about such a bahaviour in doc. Is it a bug or absolutely normal behaviour? 08.09.2015, 17:03, "Cody Koeninger" <c...@koeninger.org>: > Can you provide more info (what version of spark, code example)? > > On Tue, Sep 8, 2015 at 8:18 AM, Alexey Ponkin <alexey.pon...@ya.ru> wrote: >> Hi, >> >> I have an application with 2 streams, which are joined together. >> Stream1 - is simple DStream(relativly small size batch chunks) >> Stream2 - is a windowed DStream(with duration for example 60 seconds) >> >> Stream1 and Stream2 are Kafka direct stream. >> The problem is that according to logs window operation is constantly >> increasing(> href="http://en.zimagez.com/zimage/screenshot2015-09-0816-06-55.php;>screen). >> And also I see gap in pocessing window(> href="http://en.zimagez.com/zimage/screenshot2015-09-0816-10-22.php;>screen) >> in logs there are no events in that period. >> So what is happen in that gap and why window is constantly insreasing? >> >> Thank you in advance >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[streaming] DStream with window performance issue
Hi, I have an application with 2 streams, which are joined together. Stream1 - is simple DStream(relativly small size batch chunks) Stream2 - is a windowed DStream(with duration for example 60 seconds) Stream1 and Stream2 are Kafka direct stream. The problem is that according to logs window operation is constantly increasing(http://en.zimagez.com/zimage/screenshot2015-09-0816-06-55.php;>screen). And also I see gap in pocessing window(http://en.zimagez.com/zimage/screenshot2015-09-0816-10-22.php;>screen) in logs there are no events in that period. So what is happen in that gap and why window is constantly insreasing? Thank you in advance - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
[streaming] Using org.apache.spark.Logging will silently break task execution
Hi, I have the following code object MyJob extends org.apache.spark.Logging{ ... val source: DStream[SomeType] ... source.foreachRDD { rdd => logInfo(s"""+++ForEachRDD+++""") rdd.foreachPartition { partitionOfRecords => logInfo(s"""+++ForEachPartition+++""") } } I was expecting to see both log messages in job log. But unfortunately you will never see string '+++ForEachPartition+++' in logs, cause block foreachPartition will never execute. And also there is no error message or something in logs. I wonder is this a bug or known behavior? I know that org.apache.spark.Logging is DeveloperAPI, but why it is silently fails with no messages? What to use instead of org.apache.spark.Logging? in spark-streaming jobs? P.S. running spark 1.4.1 (on yarn) Thanks in advance - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Getting number of physical machines in Spark
There's no canonical way to do this as I understand. For instance, when running under YARN, you have completely no idea where your containers would be started. Moreover, if one of the containers would fail, it might be restarted on another machine so the machine number might change at runtime To check the current number of machines you can do something like this (python): import socket machines = sc.parallelize(xrange(1000)).mapPartitions(lambda x: [socket.gethostname()]).distinct().collect() On Fri, Aug 28, 2015 at 9:09 PM, Jason ja...@jasonknight.us wrote: I've wanted similar functionality too: when network IO bound (for me I was trying to pull things from s3 to hdfs) I wish there was a `.mapMachines` api where I wouldn't have to try guess at the proper partitioning of a 'driver' RDD for `sc.parallelize(1 to N, N).map( i= pull the i'th chunk from S3 )`. On Thu, Aug 27, 2015 at 10:01 AM Young, Matthew T matthew.t.yo...@intel.com wrote: What’s the canonical way to find out the number of physical machines in a cluster at runtime in Spark? I believe SparkContext.defaultParallelism will give me the number of cores, but I’m interested in the number of NICs. I’m writing a Spark streaming application to ingest from Kafka with the Receiver API and want to create one DStream per physical machine for read parallelism’s sake. How can I figure out at run time how many machines there are so I know how many DStreams to create? -- Best regards, Alexey Grishchenko phone: +353 (87) 262-2154 email: programme...@gmail.com web: http://0x0fff.com
Re: Help Explain Tasks in WebUI:4040
It really depends on the code. I would say that the easiest way is to restart the problematic action, find the straggler task and analyze whats happening with it with jstack / make a heap dump and analyze locally. For example, there might be the case that your tasks are connecting to some external resource and this resource is timing out under the pressure. Also call toDebugString on the problematic RDD before calling an action that triggers the calculations, this would give you an understanding what your execution tasks are really doing On Fri, Aug 28, 2015 at 7:47 PM, Muler mulugeta.abe...@gmail.com wrote: I have a 7 node cluster running in standalone mode (1 executor per node, 100g/executor, 18 cores/executor) Attached is the Task status for two of my nodes. I'm not clear why some of my tasks are taking too long: 1. [node sk5, green] task 197 took 35 mins while task 218 took less than 2 mins. But if you look into the size of output size/records they have almost same size. Even more strange, the size of shuffle spill for memory and disk is 0 for task 197 and yet it is taking a long time Same issue for my other node (sk3, red) Can you please explain what is going on? Thanks, - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Alexey Grishchenko, http://0x0fff.com
Re: Any quick method to sample rdd based on one filed?
In my opinion aggragate+flatMap would work faster as it would make less passes through the data. Would work like this: import random def agg(x,y): x[0] += 1 if not y[1] else 0 x[1] += 1 if y[1] else 0 return x # Source data rdd = sc.parallelize(xrange(10), 5) rdd2 = rdd.map(lambda x: (x, random.choice([True, False]))).cache() # Calculate counts for True and False counts = rdd2.aggregate([0,0], agg, lambda x,y: (x[0]+y[0],x[1]+y[1])) # If filtering is needed if counts[0]*10 counts[1]: # Calculate sampling ratio prob0 = float(counts[1])/10.0 / float(counts[0]) # Filter falses rdd2 = rdd2.flatMap(lambda x: [x] if (x[1] or x[0] and random.random() prob0) else []) # Count True and False again for validation - falses should be 10% of trues rdd2.aggregate([0,0], agg, lambda x,y: (x[0]+y[0],x[1]+y[1])) On Fri, Aug 28, 2015 at 6:39 PM, Sonal Goyal sonalgoy...@gmail.com wrote: Filter into true rdd and false rdd. Union true rdd and sample of false rdd. On Aug 28, 2015 2:57 AM, Gavin Yue yue.yuany...@gmail.com wrote: Hey, I have a RDD[(String,Boolean)]. I want to keep all Boolean: True rows and randomly keep some Boolean:false rows. And hope in the final result, the negative ones could be 10 times more than positive ones. What would be most efficient way to do this? Thanks, -- Best regards, Alexey Grishchenko phone: +353 (87) 262-2154 email: programme...@gmail.com web: http://0x0fff.com
Re: Calculating Min and Max Values using Spark Transformations?
If the data is already in RDD, the easiest way to calculate min/max for each column would be an aggregate() function. It takes 2 functions as arguments - first is used to aggregate RDD values to your accumulator, the second is used to merge two accumulators. This way both min and max for all the columns in your RDD would be calculated in a single pass over it. Here's an example in Python: def agg1(x,y): if len(x) == 0: x = [y,y] return [map(min,zip(x[0],y)),map(max,zip(x[1],y))] def agg2(x,y): if len(x) == 0: x = y return [map(min,zip(x[0],y[0])),map(max,zip(x[1],y[1]))] rdd = sc.parallelize(xrange(10), 5) rdd2 = rdd.map(lambda x: ([random.randint(1,100) for _ in xrange(15)])) rdd2.aggregate([], agg1, agg2) What personally I would do in your case depends on what else you want to do with the data. If you plan to run some more business logic on top of it and you're more comfortable with SQL, it might worth registering this DataFrame as a table and generating SQL query to it (generate a string with a series of min-max calls). But to solve your specific problem I'd load your file with textFile(), use map() transformation to split the string by comma and convert it to the array of doubles, and call aggregate() on top of it just like I've shown in the example above On Fri, Aug 28, 2015 at 6:15 PM, Burak Yavuz brk...@gmail.com wrote: Or you can just call describe() on the dataframe? In addition to min-max, you'll also get the mean, and count of non-null and non-NA elements as well. Burak On Fri, Aug 28, 2015 at 10:09 AM, java8964 java8...@hotmail.com wrote: Or RDD.max() and RDD.min() won't work for you? Yong -- Subject: Re: Calculating Min and Max Values using Spark Transformations? To: as...@wso2.com CC: user@spark.apache.org From: jfc...@us.ibm.com Date: Fri, 28 Aug 2015 09:28:43 -0700 If you already loaded csv data into a dataframe, why not register it as a table, and use Spark SQL to find max/min or any other aggregates? SELECT MAX(column_name) FROM dftable_name ... seems natural. *JESSE CHEN* Big Data Performance | IBM Analytics Office: 408 463 2296 Mobile: 408 828 9068 Email: jfc...@us.ibm.com [image: Inactive hide details for ashensw ---08/28/2015 05:40:07 AM---Hi all, I have a dataset which consist of large number of feature]ashensw ---08/28/2015 05:40:07 AM---Hi all, I have a dataset which consist of large number of features(columns). It is From: ashensw as...@wso2.com To: user@spark.apache.org Date: 08/28/2015 05:40 AM Subject: Calculating Min and Max Values using Spark Transformations? -- Hi all, I have a dataset which consist of large number of features(columns). It is in csv format. So I loaded it into a spark dataframe. Then I converted it into a JavaRDDRow Then using a spark transformation I converted that into JavaRDDString[]. Then again converted it into a JavaRDDdouble[]. So now I have a JavaRDDdouble[]. So is there any method to calculate max and min values of each columns in this JavaRDDdouble[] ? Or Is there any way to access the array if I store max and min values to a array inside the spark transformation class? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Calculating-Min-and-Max-Values-using-Spark-Transformations-tp24491.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Best regards, Alexey Grishchenko phone: +353 (87) 262-2154 email: programme...@gmail.com web: http://0x0fff.com
Unsupported major.minor version 51.0
I found some discussions online, but it all cpome to advice to use JDF 1.7 (or 1.8). Well, I use JDK 1.7 on OS X Yosemite . Both java –verion // java version 1.7.0_80 Java(TM) SE Runtime Environment (build 1.7.0_80-b15) Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode) and echo $JAVA_HOME// /Library/Java/JavaVirtualMachines/jdk1.7.0_80.jdk/Contents/Home show JDK 1.7. But for the Spark 1.4.1. (and for Spark 1.2.2, downloaded 07/10/2015, I have the same error when build with maven () (as sudo mvn -DskipTests -X clean package abra.txt) Exception in thread main java.lang.UnsupportedClassVersionError: org/apache/maven/cli/MavenCli : Unsupported major.minor version 51.0 Please help how to build the thing. Thanks Alexey This message, including any attachments, is the property of Sears Holdings Corporation and/or one of its subsidiaries. It is confidential and may contain proprietary or legally privileged information. If you are not the intended recipient, please delete it without reading the contents. Thank you.
Can't build Spark 1.3
\ I downloaded the latest Spark (1.3.) from github. Then I tried to build it. First for scala 2.10 (and hadoop 2.4): build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package That resulted in hangup after printing bunch of line like [INFO] Dependency-reduced POM written at …… INFO] Dependency-reduced - Then I tried for scala 2.11 mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package That resulted in multiple compilation errors. What I actually want is: mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-0.12.0 -Phive-thriftserver -DskipTests clean package Is it only me, who can’t build Spark 1.3? And, is there any site to download Spark prebuilt for Hadoop 2.5 and Hive? Thank you for any help. Alexey This message, including any attachments, is the property of Sears Holdings Corporation and/or one of its subsidiaries. It is confidential and may contain proprietary or legally privileged information. If you are not the intended recipient, please delete it without reading the contents. Thank you.
Re: Is it possible to use json4s 3.2.11 with Spark 1.3.0?
I figured out that Logging is a DeveloperApi and it should not be used outside Spark code, so everything is fine now. Thanks again, Marcelo. On 24 Mar 2015, at 20:06, Marcelo Vanzin van...@cloudera.com wrote: From the exception it seems like your app is also repackaging Scala classes somehow. Can you double check that and remove the Scala classes from your app if they're there? On Mon, Mar 23, 2015 at 10:07 PM, Alexey Zinoviev alexey.zinov...@gmail.com wrote: Thanks Marcelo, this options solved the problem (I'm using 1.3.0), but it works only if I remove extends Logging from the object, with extends Logging it return: Exception in thread main java.lang.LinkageError: loader constraint violation in interface itable initialization: when resolving method App1$.logInfo(Lscala/Function0;Ljava/lang/Throwable;)V the class loader (instance of org/apache/spark/util/ChildFirstURLClassLoader) of the current class, App1$, and the class loader (instance of sun/misc/Launcher$AppClassLoader) for interface org/apache/spark/Logging have different Class objects for the type scala/Function0 used in the signature at App1.main(App1.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Do you have any idea what's wrong with Logging? PS: I'm running it with spark-1.3.0/bin/spark-submit --class App1 --conf spark.driver.userClassPathFirst=true --conf spark.executor.userClassPathFirst=true $HOME/projects/sparkapp/target/scala-2.10/sparkapp-assembly-1.0.jar Thanks, Alexey On Tue, Mar 24, 2015 at 5:03 AM, Marcelo Vanzin van...@cloudera.com wrote: You could build a far jar for your application containing both your code and the json4s library, and then run Spark with these two options: spark.driver.userClassPathFirst=true spark.executor.userClassPathFirst=true Both only work in 1.3. (1.2 has spark.files.userClassPathFirst, but that only works for executors.) On Mon, Mar 23, 2015 at 2:12 PM, Alexey Zinoviev alexey.zinov...@gmail.com wrote: Spark has a dependency on json4s 3.2.10, but this version has several bugs and I need to use 3.2.11. I added json4s-native 3.2.11 dependency to build.sbt and everything compiled fine. But when I spark-submit my JAR it provides me with 3.2.10. build.sbt import sbt.Keys._ name := sparkapp version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided libraryDependencies += org.json4s %% json4s-native % 3.2.11` plugins.sbt logLevel := Level.Warn resolvers += Resolver.url(artifactory, url(http://scalasbt.artifactoryonline.com/scalasbt/sbt-plugin-releases;))(Resolver.ivyStylePatterns) addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0) App1.scala import org.apache.spark.SparkConf import org.apache.spark.rdd.RDD import org.apache.spark.{Logging, SparkConf, SparkContext} import org.apache.spark.SparkContext._ object App1 extends Logging { def main(args: Array[String]) = { val conf = new SparkConf().setAppName(App1) val sc = new SparkContext(conf) println(sjson4s version: ${org.json4s.BuildInfo.version.toString}) } } sbt 0.13.7, sbt-assembly 0.13.0, Scala 2.10.4 Is it possible to force 3.2.11 version usage? Thanks, Alexey -- Marcelo -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is it possible to use json4s 3.2.11 with Spark 1.3.0?
Thanks Ted, I'll try, hope there's no transitive dependencies on 3.2.10. On Tue, Mar 24, 2015 at 4:21 AM, Ted Yu yuzhih...@gmail.com wrote: Looking at core/pom.xml : dependency groupIdorg.json4s/groupId artifactIdjson4s-jackson_${scala.binary.version}/artifactId version3.2.10/version /dependency The version is hard coded. You can rebuild Spark 1.3.0 with json4s 3.2.11 Cheers On Mon, Mar 23, 2015 at 2:12 PM, Alexey Zinoviev alexey.zinov...@gmail.com wrote: Spark has a dependency on json4s 3.2.10, but this version has several bugs and I need to use 3.2.11. I added json4s-native 3.2.11 dependency to build.sbt and everything compiled fine. But when I spark-submit my JAR it provides me with 3.2.10. build.sbt import sbt.Keys._ name := sparkapp version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided libraryDependencies += org.json4s %% json4s-native % 3.2.11` plugins.sbt logLevel := Level.Warn resolvers += Resolver.url(artifactory, url( http://scalasbt.artifactoryonline.com/scalasbt/sbt-plugin-releases ))(Resolver.ivyStylePatterns) addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0) App1.scala import org.apache.spark.SparkConf import org.apache.spark.rdd.RDD import org.apache.spark.{Logging, SparkConf, SparkContext} import org.apache.spark.SparkContext._ object App1 extends Logging { def main(args: Array[String]) = { val conf = new SparkConf().setAppName(App1) val sc = new SparkContext(conf) println(sjson4s version: ${org.json4s.BuildInfo.version.toString}) } } sbt 0.13.7, sbt-assembly 0.13.0, Scala 2.10.4 Is it possible to force 3.2.11 version usage? Thanks, Alexey
Re: Is it possible to use json4s 3.2.11 with Spark 1.3.0?
Thanks Marcelo, this options solved the problem (I'm using 1.3.0), but it works only if I remove extends Logging from the object, with extends Logging it return: Exception in thread main java.lang.LinkageError: loader constraint violation in interface itable initialization: when resolving method App1$.logInfo(Lscala/Function0;Ljava/lang/Throwable;)V the class loader (instance of org/apache/spark/util/ChildFirstURLClassLoader) of the current class, App1$, and the class loader (instance of sun/misc/Launcher$AppClassLoader) for interface org/apache/spark/Logging have different Class objects for the type scala/Function0 used in the signature at App1.main(App1.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Do you have any idea what's wrong with Logging? PS: I'm running it with spark-1.3.0/bin/spark-submit --class App1 --conf spark.driver.userClassPathFirst=true --conf spark.executor.userClassPathFirst=true $HOME/projects/sparkapp/target/scala-2.10/sparkapp-assembly-1.0.jar Thanks, Alexey On Tue, Mar 24, 2015 at 5:03 AM, Marcelo Vanzin van...@cloudera.com wrote: You could build a far jar for your application containing both your code and the json4s library, and then run Spark with these two options: spark.driver.userClassPathFirst=true spark.executor.userClassPathFirst=true Both only work in 1.3. (1.2 has spark.files.userClassPathFirst, but that only works for executors.) On Mon, Mar 23, 2015 at 2:12 PM, Alexey Zinoviev alexey.zinov...@gmail.com wrote: Spark has a dependency on json4s 3.2.10, but this version has several bugs and I need to use 3.2.11. I added json4s-native 3.2.11 dependency to build.sbt and everything compiled fine. But when I spark-submit my JAR it provides me with 3.2.10. build.sbt import sbt.Keys._ name := sparkapp version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided libraryDependencies += org.json4s %% json4s-native % 3.2.11` plugins.sbt logLevel := Level.Warn resolvers += Resolver.url(artifactory, url(http://scalasbt.artifactoryonline.com/scalasbt/sbt-plugin-releases ))(Resolver.ivyStylePatterns) addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0) App1.scala import org.apache.spark.SparkConf import org.apache.spark.rdd.RDD import org.apache.spark.{Logging, SparkConf, SparkContext} import org.apache.spark.SparkContext._ object App1 extends Logging { def main(args: Array[String]) = { val conf = new SparkConf().setAppName(App1) val sc = new SparkContext(conf) println(sjson4s version: ${org.json4s.BuildInfo.version.toString}) } } sbt 0.13.7, sbt-assembly 0.13.0, Scala 2.10.4 Is it possible to force 3.2.11 version usage? Thanks, Alexey -- Marcelo
Is it possible to use json4s 3.2.11 with Spark 1.3.0?
Spark has a dependency on json4s 3.2.10, but this version has several bugs and I need to use 3.2.11. I added json4s-native 3.2.11 dependency to build.sbt and everything compiled fine. But when I spark-submit my JAR it provides me with 3.2.10. build.sbt import sbt.Keys._ name := sparkapp version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided libraryDependencies += org.json4s %% json4s-native % 3.2.11` plugins.sbt logLevel := Level.Warn resolvers += Resolver.url(artifactory, url( http://scalasbt.artifactoryonline.com/scalasbt/sbt-plugin-releases ))(Resolver.ivyStylePatterns) addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0) App1.scala import org.apache.spark.SparkConf import org.apache.spark.rdd.RDD import org.apache.spark.{Logging, SparkConf, SparkContext} import org.apache.spark.SparkContext._ object App1 extends Logging { def main(args: Array[String]) = { val conf = new SparkConf().setAppName(App1) val sc = new SparkContext(conf) println(sjson4s version: ${org.json4s.BuildInfo.version.toString}) } } sbt 0.13.7, sbt-assembly 0.13.0, Scala 2.10.4 Is it possible to force 3.2.11 version usage? Thanks, Alexey
Re: Mathematical functions in spark sql
I have tried select ceil(2/3), but got key not found: floor On Tue, Jan 27, 2015 at 11:05 AM, Ted Yu yuzhih...@gmail.com wrote: Have you tried floor() or ceil() functions ? According to http://spark.apache.org/sql/, Spark SQL is compatible with Hive SQL. Cheers On Mon, Jan 26, 2015 at 8:29 PM, 1esha alexey.romanc...@gmail.com wrote: Hello everyone! I try execute select 2/3 and I get 0.. Is there any way to cast double to int or something similar? Also it will be cool to get list of functions supported by spark sql. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mathematical-functions-in-spark-sql-tp21383.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down
Any ideas? Anyone got the same error? On Mon, Dec 1, 2014 at 2:37 PM, Alexey Romanchuk alexey.romanc...@gmail.com wrote: Hello spark users! I found lots of strange messages in driver log. Here it is: 2014-12-01 11:54:23,849 [sparkDriver-akka.actor.default-dispatcher-25] ERROR akka.remote.EndpointWriter[akka://sparkDriver/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FsparkExecutor%40data1.hadoop%3A17372-5/endpointWriter] - AssociationError [akka.tcp://sparkDriver@10.54.87.173:55034] - [akka.tcp://sparkExecutor@data1.hadoop:17372]: Error [Shut down address: akka.tcp://sparkExecutor@data1.hadoop:17372] [ akka.remote.ShutDownAssociation: Shut down address: akka.tcp://sparkExecutor@data1.hadoop:17372 Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down. ] I got this message for every worker twice. First - for driverPropsFetcher and next for sparkExecutor. Looks like spark shutdown remote akka system incorrectly or there is some race condition in this process and driver sent some data to worker, but worker's actor system already in shutdown state. Except for this message everything works fine. But this is ERROR level message and I found it in my ERROR only log. Do you have any idea is it configuration issue, bug in spark or akka or something else? Thanks!
akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down
Hello spark users! I found lots of strange messages in driver log. Here it is: 2014-12-01 11:54:23,849 [sparkDriver-akka.actor.default-dispatcher-25] ERROR akka.remote.EndpointWriter[akka://sparkDriver/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2FsparkExecutor%40data1.hadoop%3A17372-5/endpointWriter] - AssociationError [akka.tcp://sparkDriver@10.54.87.173:55034] - [akka.tcp://sparkExecutor@data1.hadoop:17372]: Error [Shut down address: akka.tcp://sparkExecutor@data1.hadoop:17372] [ akka.remote.ShutDownAssociation: Shut down address: akka.tcp://sparkExecutor@data1.hadoop:17372 Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down. ] I got this message for every worker twice. First - for driverPropsFetcher and next for sparkExecutor. Looks like spark shutdown remote akka system incorrectly or there is some race condition in this process and driver sent some data to worker, but worker's actor system already in shutdown state. Except for this message everything works fine. But this is ERROR level message and I found it in my ERROR only log. Do you have any idea is it configuration issue, bug in spark or akka or something else? Thanks!
Delayed hotspot optimizations in Spark
Hello spark users and developers! I am using hdfs + spark sql + hive schema + parquet as storage format. I have lot of parquet files - one files fits one hdfs block for one day. The strange thing is very slow first query for spark sql. To reproduce situation I use only one core and I have 97sec for first time and only 13sec for all next queries. Sure I query for different data, but it has same structure and size. The situation can be reproduced after restart thrift server. Here it information about parquet files reading from worker node: Slow one: Oct 10, 2014 2:26:53 PM INFO: parquet.hadoop.InternalParquetRecordReader: Assembled and processed 1560251 records from 30 columns in 11686 ms: 133.51454 rec/ms, 4005.4363 cell/ms Fast one: Oct 10, 2014 2:31:30 PM INFO: parquet.hadoop.InternalParquetRecordReader: Assembled and processed 1568899 records from 1 columns in 1373 ms: 1142.6796 rec/ms, 1142.6796 cell/ms As you can see second reading is 10x times faster then first. Most of the query time spent to work with parquet file. This problem is really annoying, because most of my spark task contains just 1 sql query and data processing and to speedup my jobs I put special warmup query in from of any job. My assumption is that it is hotspot optimizations that used due first reading. Do you have any idea how to confirm/solve this performance problem? Thanks for advice! p.s. I have billion hotspot optimization showed with -XX:+PrintCompilation but can not figure out what are important and what are not.
Re: Delayed hotspot optimizations in Spark
Hey Sean and spark users! Thanks for reply. I try -Xcomp right now and start time was about few minutes (as expected), but I got first query slow as before: Oct 10, 2014 3:03:41 PM INFO: parquet.hadoop.InternalParquetRecordReader: Assembled and processed 1568899 records from 30 columns in 12897 ms: 121.64837 rec/ms, 3649.451 cell/ms and next Oct 10, 2014 3:05:03 PM INFO: parquet.hadoop.InternalParquetRecordReader: Assembled and processed 1568899 records from 1 columns in 1757 ms: 892.94196 rec/ms, 892.94196 cell/ms I have no idea about caching or other stuff because CPU load is 100% on worker and jstack show that worker is reading from parquet file. Any ideas? Thanks! On Fri, Oct 10, 2014 at 2:55 PM, Sean Owen so...@cloudera.com wrote: You could try setting -Xcomp for executors to force JIT compilation upfront. I don't know if it's a good idea overall but might show whether the upfront compilation really helps. I doubt it. However is this almost surely due to caching somewhere, in Spark SQL or HDFS? I really doubt hotspot makes a difference compared to these much larger factors. On Fri, Oct 10, 2014 at 8:49 AM, Alexey Romanchuk alexey.romanc...@gmail.com wrote: Hello spark users and developers! I am using hdfs + spark sql + hive schema + parquet as storage format. I have lot of parquet files - one files fits one hdfs block for one day. The strange thing is very slow first query for spark sql. To reproduce situation I use only one core and I have 97sec for first time and only 13sec for all next queries. Sure I query for different data, but it has same structure and size. The situation can be reproduced after restart thrift server. Here it information about parquet files reading from worker node: Slow one: Oct 10, 2014 2:26:53 PM INFO: parquet.hadoop.InternalParquetRecordReader: Assembled and processed 1560251 records from 30 columns in 11686 ms: 133.51454 rec/ms, 4005.4363 cell/ms Fast one: Oct 10, 2014 2:31:30 PM INFO: parquet.hadoop.InternalParquetRecordReader: Assembled and processed 1568899 records from 1 columns in 1373 ms: 1142.6796 rec/ms, 1142.6796 cell/ms As you can see second reading is 10x times faster then first. Most of the query time spent to work with parquet file. This problem is really annoying, because most of my spark task contains just 1 sql query and data processing and to speedup my jobs I put special warmup query in from of any job. My assumption is that it is hotspot optimizations that used due first reading. Do you have any idea how to confirm/solve this performance problem? Thanks for advice! p.s. I have billion hotspot optimization showed with -XX:+PrintCompilation but can not figure out what are important and what are not.
Re: Log hdfs blocks sending
Hello Andrew! Thanks for reply. Which logs and on what level should I check? Driver, master or worker? I found this on master node, but there is only ANY locality requirement. Here it is the driver (spark sql) log - https://gist.github.com/13h3r/c91034307caa33139001 and one of the workers log - https://gist.github.com/13h3r/6e5053cf0dbe33f2 Do you have any idea where to look at? Thanks! On Fri, Sep 26, 2014 at 10:35 AM, Andrew Ash and...@andrewash.com wrote: Hi Alexey, You should see in the logs a locality measure like NODE_LOCAL, PROCESS_LOCAL, ANY, etc. If your Spark workers each have an HDFS data node on them and you're reading out of HDFS, then you should be seeing almost all NODE_LOCAL accesses. One cause I've seen for mismatches is if Spark uses short hostnames and Hadoop uses FQDNs -- in that case Spark doesn't think the data is local and does remote reads which really kills performance. Hope that helps! Andrew On Thu, Sep 25, 2014 at 12:09 AM, Alexey Romanchuk alexey.romanc...@gmail.com wrote: Hello again spark users and developers! I have standalone spark cluster (1.1.0) and spark sql running on it. My cluster consists of 4 datanodes and replication factor of files is 3. I use thrift server to access spark sql and have 1 table with 30+ partitions. When I run query on whole table (something simple like select count(*) from t) spark produces a lot of network activity filling all available 1gb link. Looks like spark sent data by network instead of local reading. Is it any way to log which blocks were accessed locally and which are not? Thanks!
Log hdfs blocks sending
Hello again spark users and developers! I have standalone spark cluster (1.1.0) and spark sql running on it. My cluster consists of 4 datanodes and replication factor of files is 3. I use thrift server to access spark sql and have 1 table with 30+ partitions. When I run query on whole table (something simple like select count(*) from t) spark produces a lot of network activity filling all available 1gb link. Looks like spark sent data by network instead of local reading. Is it any way to log which blocks were accessed locally and which are not? Thanks!