Re: question on SPARK_WORKER_CORES
I found in some previous CDH versions that Spark starts only one executor per Spark slave, and DECREASING the executor-cores in standalone makes the total # of executors go up. Just my 2¢. On Fri, Feb 17, 2017 at 5:20 PM, kant kodali wrote: > Hi Satish, > > I am using spark 2.0.2. And no I have not passed those variables because > I didn't want to shoot in the dark. According to the documentation it looks > like SPARK_WORKER_CORES is the one which should do it. If not, can you > please explain how these variables inter play together? > > --num-executors > --executor-cores > –total-executor-cores > SPARK_WORKER_CORES > > Thanks! > > > On Fri, Feb 17, 2017 at 5:13 PM, Satish Lalam > wrote: > >> Have you tried passing --executor-cores or –total-executor-cores as >> arguments, , depending on the spark version? >> >> >> >> >> >> *From:* kant kodali [mailto:kanth...@gmail.com] >> *Sent:* Friday, February 17, 2017 5:03 PM >> *To:* Alex Kozlov >> *Cc:* user @spark >> *Subject:* Re: question on SPARK_WORKER_CORES >> >> >> >> Standalone. >> >> >> >> On Fri, Feb 17, 2017 at 5:01 PM, Alex Kozlov wrote: >> >> What Spark mode are you running the program in? >> >> >> >> On Fri, Feb 17, 2017 at 4:55 PM, kant kodali wrote: >> >> when I submit a job using spark shell I get something like this >> >> >> >> [Stage 0:>(36814 + 4) / 220129] >> >> >> >> Now all I want is I want to increase number of parallel tasks running >> from 4 to 16 so I exported an env variable called SPARK_WORKER_CORES=16 in >> conf/spark-env.sh. I though that should do it but it doesn't. It still >> shows me 4. any idea? >> >> >> >> Thanks much! >> >> >> >> >> >> -- >> >> Alex Kozlov >> (408) 507-4987 >> (650) 887-2135 efax >> ale...@gmail.com >> >> >> > > -- Alex Kozlov (408) 507-4987 (650) 887-2135 efax ale...@gmail.com
Re: question on SPARK_WORKER_CORES
What Spark mode are you running the program in? On Fri, Feb 17, 2017 at 4:55 PM, kant kodali wrote: > when I submit a job using spark shell I get something like this > > [Stage 0:>(36814 + 4) / 220129] > > > Now all I want is I want to increase number of parallel tasks running from > 4 to 16 so I exported an env variable called SPARK_WORKER_CORES=16 in > conf/spark-env.sh. I though that should do it but it doesn't. It still > shows me 4. any idea? > > > Thanks much! > > > -- Alex Kozlov (408) 507-4987 (650) 887-2135 efax ale...@gmail.com
Re: Processing millions of messages in milliseconds -- Architecture guide required
This is too big of a topic. For starters, what is the latency between you obtain the data and the data is available for analysis? Obviously if this is < 5 minutes, you probably need a streaming solution. How fast the "micro batches of seconds" need to be available for analysis? Can the problem be easily partitioned and how flexible are you in the # of machines for your solution? Are you OK with availability fat tails? Another question, how big is an individual message in bytes? XML/JSON are extremely inefficient and with "10 mils of messages" you might hit other bottlenecks like network unless you convert them into something more machine-like like Protobuf, Avro or Thrift. >From the top, look at Kafka, Flume, Storm. To "serve the historical data in milliseconds (may be upto 30 days of data)" you'll need to cache data in memory. The question, again, is how often the data change. You might look into Lambda architectures. On Mon, Apr 18, 2016 at 10:21 PM, Prashant Sharma wrote: > Hello Deepak, > > It is not clear what you want to do. Are you talking about spark streaming > ? It is possible to process historical data in Spark batch mode too. You > can add a timestamp field in xml/json. Spark documentation is at > spark.apache.org. Spark has good inbuilt features to process json and > xml[1] messages. > > Thanks, > Prashant Sharma > > 1. https://github.com/databricks/spark-xml > > On Tue, Apr 19, 2016 at 10:31 AM, Deepak Sharma > wrote: > >> Hi all, >> I am looking for an architecture to ingest 10 mils of messages in the >> micro batches of seconds. >> If anyone has worked on similar kind of architecture , can you please >> point me to any documentation around the same like what should be the >> architecture , which all components/big data ecosystem tools should i >> consider etc. >> The messages has to be in xml/json format , a preprocessor engine or >> message enhancer and then finally a processor. >> I thought about using data cache as well for serving the data >> The data cache should have the capability to serve the historical data >> in milliseconds (may be upto 30 days of data) >> -- >> Thanks >> Deepak >> www.bigdatabig.com >> >> -- Alex Kozlov ale...@gmail.com
Re: sparkR issues ?
Hi Roni, you can probably rename the as.data.frame in $SPARK_HOME/R/pkg/R/DataFrame.R and re-install SparkR by running install-dev.sh On Tue, Mar 15, 2016 at 8:46 AM, roni wrote: > Hi , > Is there a work around for this? > Do i need to file a bug for this? > Thanks > -R > > On Tue, Mar 15, 2016 at 12:28 AM, Sun, Rui wrote: > >> It seems as.data.frame() defined in SparkR convers the versions in R base >> package. >> >> We can try to see if we can change the implementation of as.data.frame() >> in SparkR to avoid such covering. >> >> >> >> *From:* Alex Kozlov [mailto:ale...@gmail.com] >> *Sent:* Tuesday, March 15, 2016 2:59 PM >> *To:* roni >> *Cc:* user@spark.apache.org >> *Subject:* Re: sparkR issues ? >> >> >> >> This seems to be a very unfortunate name collision. SparkR defines it's >> own DataFrame class which shadows what seems to be your own definition. >> >> >> >> Is DataFrame something you define? Can you rename it? >> >> >> >> On Mon, Mar 14, 2016 at 10:44 PM, roni wrote: >> >> Hi, >> >> I am working with bioinformatics and trying to convert some scripts to >> sparkR to fit into other spark jobs. >> >> >> >> I tries a simple example from a bioinf lib and as soon as I start sparkR >> environment it does not work. >> >> >> >> code as follows - >> >> countData <- matrix(1:100,ncol=4) >> >> condition <- factor(c("A","A","B","B")) >> >> dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~ >> condition) >> >> >> >> Works if i dont initialize the sparkR environment. >> >> if I do library(SparkR) and sqlContext <- sparkRSQL.init(sc) it gives >> following error >> >> >> >> > dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~ >> condition) >> >> Error in DataFrame(colData, row.names = rownames(colData)) : >> >> cannot coerce class "data.frame" to a DataFrame >> >> >> >> I am really stumped. I am not using any spark function , so i would >> expect it to work as a simple R code. >> >> why it does not work? >> >> >> >> Appreciate the help >> >> -R >> >> >> >> >> >> >> >> -- >> >> Alex Kozlov >> (408) 507-4987 >> (650) 887-2135 efax >> ale...@gmail.com >> > > -- Alex Kozlov (408) 507-4987 (650) 887-2135 efax ale...@gmail.com
Re: sparkR issues ?
This seems to be a very unfortunate name collision. SparkR defines it's own DataFrame class which shadows what seems to be your own definition. Is DataFrame something you define? Can you rename it? On Mon, Mar 14, 2016 at 10:44 PM, roni wrote: > Hi, > I am working with bioinformatics and trying to convert some scripts to > sparkR to fit into other spark jobs. > > I tries a simple example from a bioinf lib and as soon as I start sparkR > environment it does not work. > > code as follows - > countData <- matrix(1:100,ncol=4) > condition <- factor(c("A","A","B","B")) > dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~ condition) > > Works if i dont initialize the sparkR environment. > if I do library(SparkR) and sqlContext <- sparkRSQL.init(sc) it gives > following error > > > dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~ > condition) > Error in DataFrame(colData, row.names = rownames(colData)) : > cannot coerce class "data.frame" to a DataFrame > > I am really stumped. I am not using any spark function , so i would expect > it to work as a simple R code. > why it does not work? > > Appreciate the help > -R > > -- Alex Kozlov (408) 507-4987 (650) 887-2135 efax ale...@gmail.com
Re: Spark on RAID
Parallel disk IO? But the effect should be less noticeable compared to Hadoop which reads/writes a lot. Much depends on how often Spark persists on disk. Depends on the specifics of the RAID controller as well. If you write to HDFS as opposed to local file system this may be a big factor as well. On Tue, Mar 8, 2016 at 8:34 AM, Eddie Esquivel wrote: > Hello All, > In the Spark documentation under "Hardware Requirements" it very clearly > states: > > We recommend having *4-8 disks* per node, configured *without* RAID (just > as separate mount points) > > My question is why not raid? What is the argument\reason for not using > Raid? > > Thanks! > -Eddie > -- Alex Kozlov
Re: How to query a hive table from inside a map in Spark
While this is possible via jdbc calls, it is not the best practice: you should probably use variable broadcasting <http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables> instead. On Sun, Feb 14, 2016 at 8:40 PM, SRK wrote: > Hi, > > Is it possible to query a hive table which has data stored in the form of a > parquet file from inside map/partitions in Spark? My requirement is that I > have a User table in Hive/hdfs and for each record inside a sessions RDD, I > should be able to query the User table and if the User table already has a > record for that userId, query the record and do further processing. > > > Thanks! > > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-query-a-hive-table-from-inside-a-map-in-Spark-tp26224.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Alex Kozlov (408) 507-4987 (650) 887-2135 efax ale...@gmail.com
Re: Best practises of share Spark cluster over few applications
Praveen, the mode in which you run spark (standalone, yarn, mesos) is determined when you create SparkContext <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext>. You are right that spark-submit and spark-shell create different SparkContexts. In general, resource sharing is the task of the cluster scheduler and there are only 3 choices by default right now. On Sun, Feb 14, 2016 at 9:13 PM, praveen S wrote: > Even i was trying to launch spark jobs from webservice : > > But I thought you could run spark jobs in yarn mode only through > spark-submit. Is my understanding not correct? > > Regards, > Praveen > On 15 Feb 2016 08:29, "Sabarish Sasidharan" < > sabarish.sasidha...@manthan.com> wrote: > >> Yes you can look at using the capacity scheduler or the fair scheduler >> with YARN. Both allow using full cluster when idle. And both allow >> considering cpu plus memory when allocating resources which is sort of >> necessary with Spark. >> >> Regards >> Sab >> On 13-Feb-2016 10:11 pm, "Eugene Morozov" >> wrote: >> >>> Hi, >>> >>> I have several instances of the same web-service that is running some ML >>> algos on Spark (both training and prediction) and do some Spark unrelated >>> job. Each web-service instance creates their on JavaSparkContext, thus >>> they're seen as separate applications by Spark, thus they're configured >>> with separate limits of resources such as cores (I'm not concerned about >>> the memory as much as about cores). >>> >>> With this set up, say 3 web service instances, each of them has just 1/3 >>> of cores. But it might happen, than only one instance is going to use >>> Spark, while others are busy with Spark unrelated. I'd like in this case >>> all Spark cores be available for the one that's in need. >>> >>> Ideally I'd like Spark cores just be available in total and the first >>> app who needs it, takes as much as required from the available at the >>> moment. Is it possible? I believe Mesos is able to set resources free if >>> they're not in use. Is it possible with YARN? >>> >>> I'd appreciate if you could share your thoughts or experience on the >>> subject. >>> >>> Thanks. >>> -- >>> Be well! >>> Jean Morozov >>> >> -- Alex Kozlov ale...@gmail.com
Re: How can I disable logging when running local[*]?
Hmm, clearly the parameter is not passed to the program. This should be an activator issue. I wonder how do you specify the other parameters, like driver memory, num cores, etc.? Just out of curiosity, can you run a program: import org.apache.spark.SparkConf val out=new SparkConf(true).get("spark.driver.extraJavaOptions") in your env and see what the output is? Also, make sure spark-defaults.conf is on your classpath. On Tue, Oct 6, 2015 at 11:19 AM, Jeff Jones wrote: > Here’s an example. I echoed JAVA_OPTS so that you can see what I’ve got. > Then I call ‘activator run’ in the project directory. > > > jjones-mac:analyzer-perf jjones$ echo $JAVA_OPTS > > -Xmx4g -Xmx4g > -Dlog4j.configuration=file:/Users/jjones/src/adaptive/adaptiveobjects/analyzer-perf/conf/log4j.properties > > jjones-mac:analyzer-perf jjones$ activator run > > [info] Loading project definition from > /Users/jjones/src/adaptive/adaptiveobjects/analyzer-perf/project > > [info] Set current project to analyzer-perf (in build > file:/Users/jjones/src/adaptive/adaptiveobjects/analyzer-perf/) > > [info] Running com.adaptive.analyzer.perf.AnalyzerPerf > > 11:15:24.066 [run-main-0] INFO org.apache.spark.SparkContext - Running > Spark version 1.4.1 > > 11:15:24.150 [run-main-0] DEBUG o.a.h.m.lib.MutableMetricsFactory - field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess > with annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, > always=false, sampleName=Ops, type=DEFAULT, value=[Rate of successful > kerberos logins and latency (milliseconds)], valueName=Time) > > 11:15:24.156 [run-main-0] DEBUG o.a.h.m.lib.MutableMetricsFactory - field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure > with annotation @org.apache.hadoop.metrics2.annotation.Metric(about=, > always=false, sampleName=Ops, type=DEFAULT, value=[Rate of failed kerberos > logins and latency (milliseconds)], valueName=Time) > > As I mentioned below but repeated for completeness, I also have this in my > code. > > import org.apache.log4j.PropertyConfigurator > > PropertyConfigurator.configure("conf/log4j.properties") > Logger.getRootLogger().setLevel(Level.OFF) > Logger.getLogger("org").setLevel(Level.OFF) > Logger.getLogger("akka").setLevel(Level.OFF) > > And here’s my log4j.properties (note, I’ve also tried setting the level to > OFF): > > # Set everything to be logged to the console > > log4j.rootCategory=WARN > > log4j.appender.console=org.apache.log4j.ConsoleAppender > > log4j.appender.console.layout=org.apache.log4j.PatternLayout > > log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p > %c{1}: %m%n > > > # Change this to set Spark log level > > log4j.logger.org.apache.spark=WARN > > > # Silence akka remoting > > log4j.logger.Remoting=WARN > > > # Ignore messages below warning level from Jetty, because it's a bit > verbose > > log4j.logger.org.eclipse.jetty=WARN > > > spark.log.threshold=OFF > > spark.root.logger=OFF,DRFA > > > From: Alex Kozlov > Date: Tuesday, October 6, 2015 at 10:50 AM > > To: Jeff Jones > Cc: "user@spark.apache.org" > Subject: Re: How can I disable logging when running local[*]? > > Try > > JAVA_OPTS='-Dlog4j.configuration=file:/' > > Internally, this is just spark.driver.extraJavaOptions, which you should > be able to set in conf/spark-defaults.conf > > Can you provide more details how you invoke the driver? > > On Tue, Oct 6, 2015 at 9:48 AM, Jeff Jones > wrote: > >> Thanks. Any chance you know how to pass this to a Scala app that is run >> via TypeSafe activator? >> >> I tried putting it $JAVA_OPTS but I get: >> >> Unrecognized option: --driver-java-options >> >> Error: Could not create the Java Virtual Machine. >> >> Error: A fatal exception has occurred. Program will exit. >> >> >> I tried a bunch of different quoting but nothing produced a good result. >> I also tried passing it directly to activator using –jvm but it still >> produces the same results with verbose logging. Is there a way I can tell >> if it’s picking up my file? >> >> >> >> From: Alex Kozlov >> Date: Monday, October 5, 2015 at 8:34 PM >> To: Jeff Jones >> Cc: "user@spark.apache.org" >> Subject: Re: How can I disable logging when running local[*]? >> >> Did you try “--driver-java-options >> '-Dlog4j.configuration=file:/'” and setting the >> log4j.rootLog
Re: How can I disable logging when running local[*]?
Try JAVA_OPTS='-Dlog4j.configuration=file:/' Internally, this is just spark.driver.extraJavaOptions, which you should be able to set in conf/spark-defaults.conf Can you provide more details how you invoke the driver? On Tue, Oct 6, 2015 at 9:48 AM, Jeff Jones wrote: > Thanks. Any chance you know how to pass this to a Scala app that is run > via TypeSafe activator? > > I tried putting it $JAVA_OPTS but I get: > > Unrecognized option: --driver-java-options > > Error: Could not create the Java Virtual Machine. > > Error: A fatal exception has occurred. Program will exit. > > > I tried a bunch of different quoting but nothing produced a good result. I > also tried passing it directly to activator using –jvm but it still > produces the same results with verbose logging. Is there a way I can tell > if it’s picking up my file? > > > > From: Alex Kozlov > Date: Monday, October 5, 2015 at 8:34 PM > To: Jeff Jones > Cc: "user@spark.apache.org" > Subject: Re: How can I disable logging when running local[*]? > > Did you try “--driver-java-options > '-Dlog4j.configuration=file:/'” and setting the > log4j.rootLogger=FATAL,console? > > On Mon, Oct 5, 2015 at 8:19 PM, Jeff Jones > wrote: > >> I’ve written an application that hosts the Spark driver in-process using >> “local[*]”. I’ve turned off logging in my conf/log4j.properties file. I’ve >> also tried putting the following code prior to creating my SparkContext. >> These were coupled together from various posts I’ve. None of these steps >> have worked. I’m still getting a ton of logging to the console. Anything >> else I can try? >> >> Thanks, >> Jeff >> >> private def disableLogging(): Unit = { >> import org.apache.log4j.PropertyConfigurator >> >> PropertyConfigurator.configure("conf/log4j.properties") >> Logger.getRootLogger().setLevel(Level.OFF) >> Logger.getLogger("org").setLevel(Level.OFF) >> Logger.getLogger("akka").setLevel(Level.OFF) >> } >> >> >> >> This message (and any attachments) is intended only for the designated >> recipient(s). It >> may contain confidential or proprietary information, or have other >> limitations on use as >> indicated by the sender. If you are not a designated recipient, you may >> not review, use, >> copy or distribute this message. If you received this in error, please >> notify the sender by >> reply e-mail and delete this message. >> > > > > -- > Alex Kozlov > (408) 507-4987 > (408) 830-9982 fax > (650) 887-2135 efax > ale...@gmail.com > > > This message (and any attachments) is intended only for the designated > recipient(s). It > may contain confidential or proprietary information, or have other > limitations on use as > indicated by the sender. If you are not a designated recipient, you may > not review, use, > copy or distribute this message. If you received this in error, please > notify the sender by > reply e-mail and delete this message. >
Re: How can I disable logging when running local[*]?
Did you try “--driver-java-options '-Dlog4j.configuration=file:/'” and setting the log4j.rootLogger=FATAL,console? On Mon, Oct 5, 2015 at 8:19 PM, Jeff Jones wrote: > I’ve written an application that hosts the Spark driver in-process using > “local[*]”. I’ve turned off logging in my conf/log4j.properties file. I’ve > also tried putting the following code prior to creating my SparkContext. > These were coupled together from various posts I’ve. None of these steps > have worked. I’m still getting a ton of logging to the console. Anything > else I can try? > > Thanks, > Jeff > > private def disableLogging(): Unit = { > import org.apache.log4j.PropertyConfigurator > > PropertyConfigurator.configure("conf/log4j.properties") > Logger.getRootLogger().setLevel(Level.OFF) > Logger.getLogger("org").setLevel(Level.OFF) > Logger.getLogger("akka").setLevel(Level.OFF) > } > > > > This message (and any attachments) is intended only for the designated > recipient(s). It > may contain confidential or proprietary information, or have other > limitations on use as > indicated by the sender. If you are not a designated recipient, you may > not review, use, > copy or distribute this message. If you received this in error, please > notify the sender by > reply e-mail and delete this message. > -- Alex Kozlov (408) 507-4987 (408) 830-9982 fax (650) 887-2135 efax ale...@gmail.com
Re: Parquet Array Support Broken?
Thank you - it works if the file is created in Spark On Mon, Sep 7, 2015 at 3:06 PM, Ruslan Dautkhanov wrote: > Read response from Cheng Lian on Aug/27th - it > looks the same problem. > > Workarounds > 1. write that parquet file in Spark; > 2. upgrade to Spark 1.5. > > -- > Ruslan Dautkhanov > > On Mon, Sep 7, 2015 at 3:52 PM, Alex Kozlov wrote: > >> No, it was created in Hive by CTAS, but any help is appreciated... >> >> On Mon, Sep 7, 2015 at 2:51 PM, Ruslan Dautkhanov >> wrote: >> >>> That parquet table wasn't created in Spark, is it? >>> >>> There was a recent discussion on this list that complex data types in >>> Spark prior to 1.5 often incompatible with Hive for example, if I remember >>> correctly. >>> On Mon, Sep 7, 2015, 2:57 PM Alex Kozlov wrote: >>> >>>> I am trying to read an (array typed) parquet file in spark-shell (Spark >>>> 1.4.1 with Hadoop 2.6): >>>> >>>> {code} >>>> $ bin/spark-shell >>>> log4j:WARN No appenders could be found for logger >>>> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). >>>> log4j:WARN Please initialize the log4j system properly. >>>> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig >>>> for more info. >>>> Using Spark's default log4j profile: >>>> org/apache/spark/log4j-defaults.properties >>>> 15/09/07 13:45:22 INFO SecurityManager: Changing view acls to: hivedata >>>> 15/09/07 13:45:22 INFO SecurityManager: Changing modify acls to: >>>> hivedata >>>> 15/09/07 13:45:22 INFO SecurityManager: SecurityManager: authentication >>>> disabled; ui acls disabled; users with view permissions: Set(hivedata); >>>> users with modify permissions: Set(hivedata) >>>> 15/09/07 13:45:23 INFO HttpServer: Starting HTTP Server >>>> 15/09/07 13:45:23 INFO Utils: Successfully started service 'HTTP class >>>> server' on port 43731. >>>> Welcome to >>>> __ >>>> / __/__ ___ _/ /__ >>>> _\ \/ _ \/ _ `/ __/ '_/ >>>>/___/ .__/\_,_/_/ /_/\_\ version 1.4.1 >>>> /_/ >>>> >>>> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java >>>> 1.8.0) >>>> Type in expressions to have them evaluated. >>>> Type :help for more information. >>>> 15/09/07 13:45:26 INFO SparkContext: Running Spark version 1.4.1 >>>> 15/09/07 13:45:26 INFO SecurityManager: Changing view acls to: hivedata >>>> 15/09/07 13:45:26 INFO SecurityManager: Changing modify acls to: >>>> hivedata >>>> 15/09/07 13:45:26 INFO SecurityManager: SecurityManager: authentication >>>> disabled; ui acls disabled; users with view permissions: Set(hivedata); >>>> users with modify permissions: Set(hivedata) >>>> 15/09/07 13:45:27 INFO Slf4jLogger: Slf4jLogger started >>>> 15/09/07 13:45:27 INFO Remoting: Starting remoting >>>> 15/09/07 13:45:27 INFO Remoting: Remoting started; listening on >>>> addresses :[akka.tcp://sparkDriver@10.10.30.52:46083] >>>> 15/09/07 13:45:27 INFO Utils: Successfully started service >>>> 'sparkDriver' on port 46083. >>>> 15/09/07 13:45:27 INFO SparkEnv: Registering MapOutputTracker >>>> 15/09/07 13:45:27 INFO SparkEnv: Registering BlockManagerMaster >>>> 15/09/07 13:45:27 INFO DiskBlockManager: Created local directory at >>>> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/blockmgr-bd1b8498-9f6a-47c4-ae59-8800563f97d0 >>>> 15/09/07 13:45:27 INFO MemoryStore: MemoryStore started with capacity >>>> 265.1 MB >>>> 15/09/07 13:45:27 INFO HttpFileServer: HTTP File server directory is >>>> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/httpd-3fbe0c9d-c0c5-41ef-bf72-4f0ef59bfa21 >>>> 15/09/07 13:45:27 INFO HttpServer: Starting HTTP Server >>>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'HTTP file >>>> server' on port 38717. >>>> 15/09/07 13:45:27 INFO SparkEnv: Registering OutputCommitCoordinator >>>> 15/09/07 13:45:27 WARN Utils: Service 'SparkUI' could not bind on port >>>> 4040. Attempting port 4041. >>>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'SparkUI' on >>>> port 4041. >>>> 15/09/07 13:45:27 INFO SparkUI: Started SparkUI at >>>> http://10.10.30.52:404
Re: Parquet Array Support Broken?
No, it was created in Hive by CTAS, but any help is appreciated... On Mon, Sep 7, 2015 at 2:51 PM, Ruslan Dautkhanov wrote: > That parquet table wasn't created in Spark, is it? > > There was a recent discussion on this list that complex data types in > Spark prior to 1.5 often incompatible with Hive for example, if I remember > correctly. > On Mon, Sep 7, 2015, 2:57 PM Alex Kozlov wrote: > >> I am trying to read an (array typed) parquet file in spark-shell (Spark >> 1.4.1 with Hadoop 2.6): >> >> {code} >> $ bin/spark-shell >> log4j:WARN No appenders could be found for logger >> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). >> log4j:WARN Please initialize the log4j system properly. >> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for >> more info. >> Using Spark's default log4j profile: >> org/apache/spark/log4j-defaults.properties >> 15/09/07 13:45:22 INFO SecurityManager: Changing view acls to: hivedata >> 15/09/07 13:45:22 INFO SecurityManager: Changing modify acls to: hivedata >> 15/09/07 13:45:22 INFO SecurityManager: SecurityManager: authentication >> disabled; ui acls disabled; users with view permissions: Set(hivedata); >> users with modify permissions: Set(hivedata) >> 15/09/07 13:45:23 INFO HttpServer: Starting HTTP Server >> 15/09/07 13:45:23 INFO Utils: Successfully started service 'HTTP class >> server' on port 43731. >> Welcome to >> __ >> / __/__ ___ _/ /__ >> _\ \/ _ \/ _ `/ __/ '_/ >>/___/ .__/\_,_/_/ /_/\_\ version 1.4.1 >> /_/ >> >> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0) >> Type in expressions to have them evaluated. >> Type :help for more information. >> 15/09/07 13:45:26 INFO SparkContext: Running Spark version 1.4.1 >> 15/09/07 13:45:26 INFO SecurityManager: Changing view acls to: hivedata >> 15/09/07 13:45:26 INFO SecurityManager: Changing modify acls to: hivedata >> 15/09/07 13:45:26 INFO SecurityManager: SecurityManager: authentication >> disabled; ui acls disabled; users with view permissions: Set(hivedata); >> users with modify permissions: Set(hivedata) >> 15/09/07 13:45:27 INFO Slf4jLogger: Slf4jLogger started >> 15/09/07 13:45:27 INFO Remoting: Starting remoting >> 15/09/07 13:45:27 INFO Remoting: Remoting started; listening on addresses >> :[akka.tcp://sparkDriver@10.10.30.52:46083] >> 15/09/07 13:45:27 INFO Utils: Successfully started service 'sparkDriver' >> on port 46083. >> 15/09/07 13:45:27 INFO SparkEnv: Registering MapOutputTracker >> 15/09/07 13:45:27 INFO SparkEnv: Registering BlockManagerMaster >> 15/09/07 13:45:27 INFO DiskBlockManager: Created local directory at >> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/blockmgr-bd1b8498-9f6a-47c4-ae59-8800563f97d0 >> 15/09/07 13:45:27 INFO MemoryStore: MemoryStore started with capacity >> 265.1 MB >> 15/09/07 13:45:27 INFO HttpFileServer: HTTP File server directory is >> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/httpd-3fbe0c9d-c0c5-41ef-bf72-4f0ef59bfa21 >> 15/09/07 13:45:27 INFO HttpServer: Starting HTTP Server >> 15/09/07 13:45:27 INFO Utils: Successfully started service 'HTTP file >> server' on port 38717. >> 15/09/07 13:45:27 INFO SparkEnv: Registering OutputCommitCoordinator >> 15/09/07 13:45:27 WARN Utils: Service 'SparkUI' could not bind on port >> 4040. Attempting port 4041. >> 15/09/07 13:45:27 INFO Utils: Successfully started service 'SparkUI' on >> port 4041. >> 15/09/07 13:45:27 INFO SparkUI: Started SparkUI at >> http://10.10.30.52:4041 >> 15/09/07 13:45:27 INFO Executor: Starting executor ID driver on host >> localhost >> 15/09/07 13:45:27 INFO Executor: Using REPL class URI: >> http://10.10.30.52:43731 >> 15/09/07 13:45:27 INFO Utils: Successfully started service >> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60973. >> 15/09/07 13:45:27 INFO NettyBlockTransferService: Server created on 60973 >> 15/09/07 13:45:27 INFO BlockManagerMaster: Trying to register BlockManager >> 15/09/07 13:45:27 INFO BlockManagerMasterEndpoint: Registering block >> manager localhost:60973 with 265.1 MB RAM, BlockManagerId(driver, >> localhost, 60973) >> 15/09/07 13:45:27 INFO BlockManagerMaster: Registered BlockManager >> 15/09/07 13:45:28 INFO SparkILoop: Created spark context.. >> Spark context available as sc. >> 15/09/07 13:45:28 INFO HiveContext: Initializing execution hive, version >> 0.13.1 >> 15/09/07 13:45:28 INFO HiveMet
Re: Parquet Array Support Broken?
The same error if I do: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val results = sqlContext.sql("SELECT * FROM stats") but it does work from Hive shell directly... On Mon, Sep 7, 2015 at 1:56 PM, Alex Kozlov wrote: > I am trying to read an (array typed) parquet file in spark-shell (Spark > 1.4.1 with Hadoop 2.6): > > {code} > $ bin/spark-shell > log4j:WARN No appenders could be found for logger > (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). > log4j:WARN Please initialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for > more info. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 15/09/07 13:45:22 INFO SecurityManager: Changing view acls to: hivedata > 15/09/07 13:45:22 INFO SecurityManager: Changing modify acls to: hivedata > 15/09/07 13:45:22 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(hivedata); > users with modify permissions: Set(hivedata) > 15/09/07 13:45:23 INFO HttpServer: Starting HTTP Server > 15/09/07 13:45:23 INFO Utils: Successfully started service 'HTTP class > server' on port 43731. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.4.1 > /_/ > > Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0) > Type in expressions to have them evaluated. > Type :help for more information. > 15/09/07 13:45:26 INFO SparkContext: Running Spark version 1.4.1 > 15/09/07 13:45:26 INFO SecurityManager: Changing view acls to: hivedata > 15/09/07 13:45:26 INFO SecurityManager: Changing modify acls to: hivedata > 15/09/07 13:45:26 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(hivedata); > users with modify permissions: Set(hivedata) > 15/09/07 13:45:27 INFO Slf4jLogger: Slf4jLogger started > 15/09/07 13:45:27 INFO Remoting: Starting remoting > 15/09/07 13:45:27 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkDriver@10.10.30.52:46083] > 15/09/07 13:45:27 INFO Utils: Successfully started service 'sparkDriver' > on port 46083. > 15/09/07 13:45:27 INFO SparkEnv: Registering MapOutputTracker > 15/09/07 13:45:27 INFO SparkEnv: Registering BlockManagerMaster > 15/09/07 13:45:27 INFO DiskBlockManager: Created local directory at > /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/blockmgr-bd1b8498-9f6a-47c4-ae59-8800563f97d0 > 15/09/07 13:45:27 INFO MemoryStore: MemoryStore started with capacity > 265.1 MB > 15/09/07 13:45:27 INFO HttpFileServer: HTTP File server directory is > /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/httpd-3fbe0c9d-c0c5-41ef-bf72-4f0ef59bfa21 > 15/09/07 13:45:27 INFO HttpServer: Starting HTTP Server > 15/09/07 13:45:27 INFO Utils: Successfully started service 'HTTP file > server' on port 38717. > 15/09/07 13:45:27 INFO SparkEnv: Registering OutputCommitCoordinator > 15/09/07 13:45:27 WARN Utils: Service 'SparkUI' could not bind on port > 4040. Attempting port 4041. > 15/09/07 13:45:27 INFO Utils: Successfully started service 'SparkUI' on > port 4041. > 15/09/07 13:45:27 INFO SparkUI: Started SparkUI at http://10.10.30.52:4041 > 15/09/07 13:45:27 INFO Executor: Starting executor ID driver on host > localhost > 15/09/07 13:45:27 INFO Executor: Using REPL class URI: > http://10.10.30.52:43731 > 15/09/07 13:45:27 INFO Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60973. > 15/09/07 13:45:27 INFO NettyBlockTransferService: Server created on 60973 > 15/09/07 13:45:27 INFO BlockManagerMaster: Trying to register BlockManager > 15/09/07 13:45:27 INFO BlockManagerMasterEndpoint: Registering block > manager localhost:60973 with 265.1 MB RAM, BlockManagerId(driver, > localhost, 60973) > 15/09/07 13:45:27 INFO BlockManagerMaster: Registered BlockManager > 15/09/07 13:45:28 INFO SparkILoop: Created spark context.. > Spark context available as sc. > 15/09/07 13:45:28 INFO HiveContext: Initializing execution hive, version > 0.13.1 > 15/09/07 13:45:28 INFO HiveMetaStore: 0: Opening raw store with > implemenation class:org.apache.hadoop.hive.metastore.ObjectStore > 15/09/07 13:45:29 INFO ObjectStore: ObjectStore, initialize called > 15/09/07 13:45:29 INFO Persistence: Property > hive.metastore.integral.jdo.pushdown unknown - will be ignored > 15/09/07 13:45:29 INFO Persistence: Property datanucleus.cache.level2 > unknown - will be ignored > 15/09/07 13:45:29 WARN Connection: BoneCP specified but not present in > CLASSP
Parquet Array Support Broken?
I am trying to read an (array typed) parquet file in spark-shell (Spark 1.4.1 with Hadoop 2.6): {code} $ bin/spark-shell log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/09/07 13:45:22 INFO SecurityManager: Changing view acls to: hivedata 15/09/07 13:45:22 INFO SecurityManager: Changing modify acls to: hivedata 15/09/07 13:45:22 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hivedata); users with modify permissions: Set(hivedata) 15/09/07 13:45:23 INFO HttpServer: Starting HTTP Server 15/09/07 13:45:23 INFO Utils: Successfully started service 'HTTP class server' on port 43731. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.4.1 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0) Type in expressions to have them evaluated. Type :help for more information. 15/09/07 13:45:26 INFO SparkContext: Running Spark version 1.4.1 15/09/07 13:45:26 INFO SecurityManager: Changing view acls to: hivedata 15/09/07 13:45:26 INFO SecurityManager: Changing modify acls to: hivedata 15/09/07 13:45:26 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hivedata); users with modify permissions: Set(hivedata) 15/09/07 13:45:27 INFO Slf4jLogger: Slf4jLogger started 15/09/07 13:45:27 INFO Remoting: Starting remoting 15/09/07 13:45:27 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.10.30.52:46083] 15/09/07 13:45:27 INFO Utils: Successfully started service 'sparkDriver' on port 46083. 15/09/07 13:45:27 INFO SparkEnv: Registering MapOutputTracker 15/09/07 13:45:27 INFO SparkEnv: Registering BlockManagerMaster 15/09/07 13:45:27 INFO DiskBlockManager: Created local directory at /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/blockmgr-bd1b8498-9f6a-47c4-ae59-8800563f97d0 15/09/07 13:45:27 INFO MemoryStore: MemoryStore started with capacity 265.1 MB 15/09/07 13:45:27 INFO HttpFileServer: HTTP File server directory is /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/httpd-3fbe0c9d-c0c5-41ef-bf72-4f0ef59bfa21 15/09/07 13:45:27 INFO HttpServer: Starting HTTP Server 15/09/07 13:45:27 INFO Utils: Successfully started service 'HTTP file server' on port 38717. 15/09/07 13:45:27 INFO SparkEnv: Registering OutputCommitCoordinator 15/09/07 13:45:27 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 15/09/07 13:45:27 INFO Utils: Successfully started service 'SparkUI' on port 4041. 15/09/07 13:45:27 INFO SparkUI: Started SparkUI at http://10.10.30.52:4041 15/09/07 13:45:27 INFO Executor: Starting executor ID driver on host localhost 15/09/07 13:45:27 INFO Executor: Using REPL class URI: http://10.10.30.52:43731 15/09/07 13:45:27 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60973. 15/09/07 13:45:27 INFO NettyBlockTransferService: Server created on 60973 15/09/07 13:45:27 INFO BlockManagerMaster: Trying to register BlockManager 15/09/07 13:45:27 INFO BlockManagerMasterEndpoint: Registering block manager localhost:60973 with 265.1 MB RAM, BlockManagerId(driver, localhost, 60973) 15/09/07 13:45:27 INFO BlockManagerMaster: Registered BlockManager 15/09/07 13:45:28 INFO SparkILoop: Created spark context.. Spark context available as sc. 15/09/07 13:45:28 INFO HiveContext: Initializing execution hive, version 0.13.1 15/09/07 13:45:28 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/09/07 13:45:29 INFO ObjectStore: ObjectStore, initialize called 15/09/07 13:45:29 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 15/09/07 13:45:29 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored 15/09/07 13:45:29 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/09/07 13:45:29 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/09/07 13:45:36 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" 15/09/07 13:45:36 INFO MetaStoreDirectSql: MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5. Encountered: "@" (64), after : "". 15/09/07 13:45:37 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table. 15/09/07 13:45:37 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model