Re: spark sql - reading data from sql tables having space in column names
You can use backticks to quote the column names. Cheng On 6/3/15 2:49 AM, David Mitchell wrote: I am having the same problem reading JSON. There does not seem to be a way of selecting a field that has a space, Executor Info from the Spark logs. I suggest that we open a JIRA ticket to address this issue. On Jun 2, 2015 10:08 AM, ayan guha guha.a...@gmail.com mailto:guha.a...@gmail.com wrote: I would think the easiest way would be to create a view in DB with column names with no space. In fact, you can pass a sql in place of a real table. From documentation: The JDBC table that should be read. Note that anything that is valid in a `FROM` clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses. Kindly let the community know if this works On Tue, Jun 2, 2015 at 6:43 PM, Sachin Goyal sachin.go...@jabong.com mailto:sachin.go...@jabong.com wrote: Hi, We are using spark sql (1.3.1) to load data from Microsoft sql server using jdbc (as described in https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases). It is working fine except when there is a space in column names (we can't modify the schemas to remove space as it is a legacy database). Sqoop is able to handle such scenarios by enclosing column names in '[ ]' - the recommended method from microsoft sql server. (https://github.com/apache/sqoop/blob/trunk/src/java/org/apache/sqoop/manager/SQLServerManager.java - line no 319) Is there a way to handle this in spark sql? Thanks, sachin -- Best Regards, Ayan Guha
Re: Problem reading Parquet from 1.2 to 1.3
Thanks Cheng, we have a workaround in place for Spark 1.3 (remove .metadata directory), good to know it will be resolved in 1.4. -Don On Sun, Jun 7, 2015 at 8:51 AM, Cheng Lian lian.cs@gmail.com wrote: This issue has been fixed recently in Spark 1.4 https://github.com/apache/spark/pull/6581 Cheng On 6/5/15 12:38 AM, Marcelo Vanzin wrote: I talked to Don outside the list and he says that he's seeing this issue with Apache Spark 1.3 too (not just CDH Spark), so it seems like there is a real issue here. On Wed, Jun 3, 2015 at 1:39 PM, Don Drake dondr...@gmail.com wrote: As part of upgrading a cluster from CDH 5.3.x to CDH 5.4.x I noticed that Spark is behaving differently when reading Parquet directories that contain a .metadata directory. It seems that in spark 1.2.x, it would just ignore the .metadata directory, but now that I'm using Spark 1.3, reading these files causes the following exceptions: scala val d = sqlContext.parquetFile(/user/ddrak/parq_dir) SLF4J: Failed to load class org.slf4j.impl.StaticLoggerBinder. SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown during a parallel computation: java.lang.RuntimeException: hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schema.avsc is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [116, 34, 10, 125] parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427) parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398) org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276) org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275) scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56) scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650) . . . java.lang.RuntimeException: hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schemas/1.avsc is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [116, 34, 10, 125] parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427) parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398) org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276) org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275) scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56) scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650) . . . java.lang.RuntimeException: hdfs://nameservice1/user/ddrak/parq_dir/.metadata/descriptor.properties is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [117, 101, 116, 10] parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427) parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398) org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276) org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275) scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56) scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650) . . . at scala.collection.parallel.package$$anon$1.alongWith(package.scala:87) at scala.collection.parallel.Task$class.mergeThrowables(Tasks.scala:86) at scala.collection.parallel.mutable.ParArray$Map.mergeThrowables(ParArray.scala:650) at scala.collection.parallel.Task$class.tryMerge(Tasks.scala:72) at scala.collection.parallel.mutable.ParArray$Map.tryMerge(ParArray.scala:650) at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:190) at
Re: Not understanding manually building EC2 cluster
- Remove localhost from the conf/slaves file, add the slaves private ip. - Make sure master and slave machines are on the same security group (this way all ports will be accessible to all machines) - In conf/spark-env.sh file, place export SPARK_MASTER_IP=MASTER-NODES-PUBLIC-OR-PRIVATE-IP and remove SPARK_LOCAL_IP These changes should get you started with spark cluster. If not, look in the logs file for more detailed information. bjameshunter wrote Hi, I've tried a half a dozen times to build a spark cluster on EC2, without using the ec2 scripts or EMR. I'd like to eventually get an IPython notebook server running on the master, and the ec2 scripts and EMR don't seem accommodating for that. I build an Ubuntu-spark-ipython machine. I setup the Ipython server. Ipython and spark work together. I make an image of the machine, spin up two of them. I can ssh between the original (master) and two new slaves without password (put master id_rsa.pub on slaves). I add slave public IPs to $SPARK_HOME/conf/slaves, underneath localhost I execute $SPARK_HOME/sbin/start-all.sh Slaves and master start. The master GUI shows only one slave - itself. --- Here's where, I think the documentation ends for me and I start trying random stuff. setting SPARK_MASTER_IP to the EC2 public IP on all machines. Setting SPARK_LOCAL_IP to 127.0.01. Changing hostname on master to the public IP, changing it to my domain name prefixed with spark-master and routing it through my digital ocean account, etc. All combinations of the above steps have been tried, and then some. Any clue what I don't understand here? Thanks, Ben https://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Not-understanding-manually-building-EC2-cluster-tp23194p23195.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Monitoring Spark Jobs
Hi Sam, Have a look at Sematext's SPM for your Spark monitoring needs. If the problem is CPU, IO, Network, etc. as Ahkil mentioned, you'll see that in SPM, too. As for the number of jobs running, you have see a chart with that at http://sematext.com/spm/integrations/spark-monitoring.html Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Sun, Jun 7, 2015 at 6:37 AM, SamyaMaiti samya.maiti2...@gmail.com wrote: Hi All, I have a Spark SQL application to fetch data from Hive, on top I have a akka layer to run multiple Queries in parallel. *Please suggest a mechanism, so as to figure out the number of spark jobs running in the cluster at a given instance of time. * I need to do the above as, I see the average response time increasing with increase in number of requests, in-spite of increasing the number of cores in the cluster. I suspect there is a bottleneck somewhere else. Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Monitoring-Spark-Jobs-tp23193.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Running SparkSql against Hive tables
On 6/6/15 9:06 AM, James Pirz wrote: I am pretty new to Spark, and using Spark 1.3.1, I am trying to use 'Spark SQL' to run some SQL scripts, on the cluster. I realized that for a better performance, it is a good idea to use Parquet files. I have 2 questions regarding that: 1) If I wanna use Spark SQL against *partitioned bucketed* tables with Parquet format in Hive, does the provided spark binary on the apache website support that or do I need to build a new spark binary with some additional flags ? (I found a note https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables in the documentation about enabling Hive support, but I could not fully get it as what the correct way of building is, if I need to build) Yes, Hive support is enabled by default now for the binaries on the website. However, currently Spark SQL doesn't support buckets yet. 2) Does running Spark SQL against tables in Hive downgrade the performance, and it is better that I load parquet files directly to HDFS or having Hive in the picture is harmless ? If you're using Parquet, then it should be fine since by default Spark SQL uses its own native Parquet support to read Parquet Hive tables. Thnx
Re: hiveContext.sql NullPointerException
Hi, How can I expect to work on HiveContext on the executor? If only the driver can see HiveContext, does it mean I have to collect all datasets (very large) to the driver and use HiveContext there? It will be memory overload on the driver and fail. BR, Patcharee On 07. juni 2015 11:51, Cheng Lian wrote: Hi, This is expected behavior. HiveContext.sql (and also DataFrame.registerTempTable) is only expected to be invoked on driver side. However, the closure passed to RDD.foreach is executed on executor side, where no viable HiveContext instance exists. Cheng On 6/7/15 10:06 AM, patcharee wrote: Hi, I try to insert data into a partitioned hive table. The groupByKey is to combine dataset into a partition of the hive table. After the groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). But the hiveContext.sql throws NullPointerException, see below. Any suggestions? What could be wrong? Thanks! val varWHeightFlatRDD = varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey() .foreach( x = { val zone = x._1._1 val z = x._1._2 val year = x._1._3 val month = x._1._4 val df_table_4dim = x._2.toList.toDF() df_table_4dim.registerTempTable(table_4Dim) hiveContext.sql(INSERT OVERWRITE table 4dim partition (zone= + zone + ,z= + z + ,year= + year + ,month= + month + ) + select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim); }) java.lang.NullPointerException at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Caching parquet table (with GZIP) on Spark 1.3.1
Is it possible that some Parquet files of this data set have different schema as others? Especially those ones reported in the exception messages. One way to confirm this is to use [parquet-tools] [1] to inspect these files: $ parquet-schema path-to-file Cheng [1]: https://github.com/apache/parquet-mr/tree/master/parquet-tools On 5/26/15 3:26 PM, shsh...@tsmc.com wrote: we tried to cache table through hiveCtx = HiveContext(sc) hiveCtx.cacheTable(table name) as described on Spark 1.3.1's document and we're on CDH5.3.0 with Spark 1.3.1 built with Hadoop 2.6 following error message would occur if we tried to cache table with parquet format GZIP though we're not sure if this error message has anything to do with the table format since we can execute SQLs on the exact same table, we just hope to use cachTable so that it might speed-up a little bit since we're querying on this table for several times. Any advise is welcomed! Thanks! 15/05/26 15:21:32 WARN scheduler.TaskSetManager: Lost task 227.0 in stage 0.0 (TID 278, f14ecats037): parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://f14ecat/tmp/tchart_0501_final/part-r-1198.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue (InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue (ParquetRecordReader.java:204) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext (NewHadoopRDD.scala:143) at org.apache.spark.InterruptibleIterator.hasNext (InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon $1.hasNext(InMemoryColumnarTableScan.scala:153) at org.apache.spark.storage.MemoryStore.unrollSafely (MemoryStore.scala:248) at org.apache.spark.CacheManager.putInBlockManager (CacheManager.scala:172) at org.apache.spark.CacheManager.getOrCompute (CacheManager.scala:79) at org.apache.spark.rdd.RDD.iterator(RDD.scala:242) at org.apache.spark.rdd.MapPartitionsRDD.compute (MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute (MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute (MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute (MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute (MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask (ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask (ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run (Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: parquet.io.ParquetDecodingException: The requested schema is not compatible with the file schema. incompatible types: optional binary dcqv_val (UTF8) != optional double dcqv_val at parquet.io.ColumnIOFactory $ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:105) at parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit (ColumnIOFactory.java:97) at parquet.schema.PrimitiveType.accept(PrimitiveType.java:386) at parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren (ColumnIOFactory.java:87) at parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit (ColumnIOFactory.java:61) at parquet.schema.MessageType.accept(MessageType.java:55) at parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:148) at parquet.hadoop.InternalParquetRecordReader.checkRead (InternalParquetRecordReader.java:125) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue (InternalParquetRecordReader.java:193) ... 31 more 15/05/26 15:21:32 INFO scheduler.TaskSetManager: Starting task 74.2 in stage 0.0 (TID 377, f14ecats025, NODE_LOCAL, 2153 bytes) 15/05/26 15:21:32 INFO scheduler.TaskSetManager: Lost task 56.2 in stage 0.0
Monitoring Spark Jobs
Hi All, I have a Spark SQL application to fetch data from Hive, on top I have a akka layer to run multiple Queries in parallel. *Please suggest a mechanism, so as to figure out the number of spark jobs running in the cluster at a given instance of time. * I need to do the above as, I see the average response time increasing with increase in number of requests, in-spite of increasing the number of cores in the cluster. I suspect there is a bottleneck somewhere else. Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Monitoring-Spark-Jobs-tp23193.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Avro or Parquet ?
Usually Parquet can be more efficient because of its columnar nature. Say your table has 10 columns but your join query only touches 3 of them, Parquet only reads those 3 columns from disk while Avro must load all data. Cheng On 6/5/15 3:00 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: We currently have data in avro format and we do joins between avro and sequence file data. Will storing these datasets in Parquet make joins any faster ? The dataset sizes are beyond are between 500 to 1000 GB. -- Deepak - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: hiveContext.sql NullPointerException
Spark SQL supports Hive dynamic partitioning, so one possible workaround is to create a Hive table partitioned by zone, z, year, and month dynamically, and then insert the whole dataset into it directly. In 1.4, we also provides dynamic partitioning support for non-Hive environment, and you can do something like this: df.write.partitionBy(zone, z, year, month).format(parquet).mode(overwrite).saveAsTable(tbl) Cheng On 6/7/15 9:48 PM, patcharee wrote: Hi, How can I expect to work on HiveContext on the executor? If only the driver can see HiveContext, does it mean I have to collect all datasets (very large) to the driver and use HiveContext there? It will be memory overload on the driver and fail. BR, Patcharee On 07. juni 2015 11:51, Cheng Lian wrote: Hi, This is expected behavior. HiveContext.sql (and also DataFrame.registerTempTable) is only expected to be invoked on driver side. However, the closure passed to RDD.foreach is executed on executor side, where no viable HiveContext instance exists. Cheng On 6/7/15 10:06 AM, patcharee wrote: Hi, I try to insert data into a partitioned hive table. The groupByKey is to combine dataset into a partition of the hive table. After the groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). But the hiveContext.sql throws NullPointerException, see below. Any suggestions? What could be wrong? Thanks! val varWHeightFlatRDD = varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey() .foreach( x = { val zone = x._1._1 val z = x._1._2 val year = x._1._3 val month = x._1._4 val df_table_4dim = x._2.toList.toDF() df_table_4dim.registerTempTable(table_4Dim) hiveContext.sql(INSERT OVERWRITE table 4dim partition (zone= + zone + ,z= + z + ,year= + year + ,month= + month + ) + select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim); }) java.lang.NullPointerException at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark ML decision list
What is decision list ? Inorder traversal (or some other traversal) of fitted decision tree On Jun 5, 2015 1:21 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote: Is there an existing way in SparkML to convert a decision tree to a decision list? On Thu, Jun 4, 2015 at 10:50 PM, Reza Zadeh r...@databricks.com wrote: The closest algorithm to decision lists that we have is decision trees https://spark.apache.org/docs/latest/mllib-decision-tree.html On Thu, Jun 4, 2015 at 2:14 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote: Hi, I have used weka machine learning library for generating a model for my training set. I have used the PART algorithm (decision lists) from weka. Now, I would like to use spark ML for the PART algo for my training set and could not seem to find a parallel. Could anyone point out the corresponding algorithm or even if its available in Spark ML? Thanks, Sateesh
Re: Does Apache Spark maintain a columnar structure when creating RDDs from Parquet or ORC files?
For the following code: val df = sqlContext.parquetFile(path) `df` remains columnar (actually it just reads from the columnar Parquet file on disk). For the following code: val cdf = df.cache() `cdf` is also columnar but that's different from Parquet. When a DataFrame is cached, Spark SQL turns it into a private in-memory columnar format. So for your last question, the answer is: yes. Cheng On 6/3/15 10:58 PM, lonikar wrote: When spark reads parquet files (sqlContext.parquetFile), it creates a DataFrame RDD. I would like to know if the resulting DataFrame has columnar structure (many rows of a column coalesced together in memory) or its a row wise structure that a spark RDD has. The section Spark SQL and DataFrames http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory says you need to call sqlContext.cacheTable(tableName) or df.cache() to make it columnar. What exactly is this columnar structure? To be precise: What does the row represent in the expression df.cache().map{row = ...}? Is it a logical row which maintains an array of columns and each column in turn is an array of values for batchSize rows? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Apache-Spark-maintain-a-columnar-structure-when-creating-RDDs-from-Parquet-or-ORC-files-tp23139.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkSQL: How to specify replication factor on the persisted parquet files?
Were you using HiveContext.setConf()? dfs.replication is a Hadoop configuration, but setConf() is only used to set Spark SQL specific configurations. You may either set it in your Hadoop core-site.xml. Cheng On 6/2/15 2:28 PM, Haopu Wang wrote: Hi, I'm trying to save SparkSQL DataFrame to a persistent Hive table using the default parquet data source. I don't know how to change the replication factor of the generated parquet files on HDFS. I tried to set dfs.replication on HiveContext but that didn't work. Any suggestions are appreciated very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: NullPointerException SQLConf.setConf
Are you calling hiveContext.sql within an RDD.map closure or something similar? In this way, the call actually happens on executor side. However, HiveContext only exists on the driver side. Cheng On 6/4/15 3:45 PM, patcharee wrote: Hi, I am using Hive 0.14 and spark 0.13. I got java.lang.NullPointerException when inserted into hive. Any suggestions please. hiveContext.sql(INSERT OVERWRITE table 4dim partition (zone= + ZONE + ,z= + zz + ,year= + YEAR + ,month= + MONTH + ) + select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim where z= + zz); java.lang.NullPointerException at org.apache.spark.sql.SQLConf.setConf(SQLConf.scala:196) at org.apache.spark.sql.SQLContext.setConf(SQLContext.scala:74) at org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:251) at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:250) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:95) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3$$anonfun$apply$1.apply(LoadWrfIntoHiveOptReduce1.scala:110) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3$$anonfun$apply$1.apply(LoadWrfIntoHiveOptReduce1.scala:107) at scala.collection.immutable.Range.foreach(Range.scala:141) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3.apply(LoadWrfIntoHiveOptReduce1.scala:107) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3.apply(LoadWrfIntoHiveOptReduce1.scala:107) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Best, Patcharee - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Monitoring Spark Jobs
It could be a CPU, IO, Network bottleneck, you need to figure out where exactly its chocking. You can use certain monitoring utilities (like top) to understand it better. Thanks Best Regards On Sun, Jun 7, 2015 at 4:07 PM, SamyaMaiti samya.maiti2...@gmail.com wrote: Hi All, I have a Spark SQL application to fetch data from Hive, on top I have a akka layer to run multiple Queries in parallel. *Please suggest a mechanism, so as to figure out the number of spark jobs running in the cluster at a given instance of time. * I need to do the above as, I see the average response time increasing with increase in number of requests, in-spite of increasing the number of cores in the cluster. I suspect there is a bottleneck somewhere else. Regards, Sam -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Monitoring-Spark-Jobs-tp23193.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Problem reading Parquet from 1.2 to 1.3
This issue has been fixed recently in Spark 1.4 https://github.com/apache/spark/pull/6581 Cheng On 6/5/15 12:38 AM, Marcelo Vanzin wrote: I talked to Don outside the list and he says that he's seeing this issue with Apache Spark 1.3 too (not just CDH Spark), so it seems like there is a real issue here. On Wed, Jun 3, 2015 at 1:39 PM, Don Drake dondr...@gmail.com mailto:dondr...@gmail.com wrote: As part of upgrading a cluster from CDH 5.3.x to CDH 5.4.x I noticed that Spark is behaving differently when reading Parquet directories that contain a .metadata directory. It seems that in spark 1.2.x, it would just ignore the .metadata directory, but now that I'm using Spark 1.3, reading these files causes the following exceptions: scala val d = sqlContext.parquetFile(/user/ddrak/parq_dir) SLF4J: Failed to load class org.slf4j.impl.StaticLoggerBinder. SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown during a parallel computation: java.lang.RuntimeException: hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schema.avsc is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [116, 34, 10, 125] parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427) parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398) org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276) org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275) scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56) scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650) . . . java.lang.RuntimeException: hdfs://nameservice1/user/ddrak/parq_dir/.metadata/schemas/1.avsc is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [116, 34, 10, 125] parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427) parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398) org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276) org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275) scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56) scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650) . . . java.lang.RuntimeException: hdfs://nameservice1/user/ddrak/parq_dir/.metadata/descriptor.properties is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [117, 101, 116, 10] parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:427) parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:398) org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:276) org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$5.apply(newParquet.scala:275) scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53) scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56) scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650) . . . at scala.collection.parallel.package$$anon$1.alongWith(package.scala:87) at scala.collection.parallel.Task$class.mergeThrowables(Tasks.scala:86) at scala.collection.parallel.mutable.ParArray$Map.mergeThrowables(ParArray.scala:650) at scala.collection.parallel.Task$class.tryMerge(Tasks.scala:72) at scala.collection.parallel.mutable.ParArray$Map.tryMerge(ParArray.scala:650) at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:190) at
Re: columnar structure of RDDs from Parquet or ORC files
Interesting, just posted on another thread asking exactly the same question :) My answer there quoted below: For the following code: val df = sqlContext.parquetFile(path) `df` remains columnar (actually it just reads from the columnar Parquet file on disk). For the following code: val cdf = df.cache() `cdf` is also columnar but that's different from Parquet. When a DataFrame is cached, Spark SQL turns it into a private in-memory columnar format. So for your last question, the answer is: yes. Some more details about the in-memory columnar structure: it's columnar, but much simpler than the one Parquet uses. The columnar byte arrays are split into batches with a fixed row count (configured by spark.sql.inMemoryColumnarStorage.batchSize). Also, each column is compressed with a compression scheme chose according to the data type and statistics information of that column. Supported compression schemes include RLE, DeltaInt, DeltaLong, BooleanBitSet, and DictionaryEncoding. You may find the implementation here: https://github.com/apache/spark/tree/master/sql/core/src/main/scala/org/apache/spark/sql/columnar Cheng On 6/3/15 10:40 PM, kiran lonikar wrote: When spark reads parquet files (sqlContext.parquetFile), it creates a DataFrame RDD. I would like to know if the resulting DataFrame has columnar structure (many rows of a column coalesced together in memory) or its a row wise structure that a spark RDD has. The section Spark SQL and DataFrames http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory says you need to call sqlContext.cacheTable(tableName) or df.cache() to make it columnar. What exactly is this columnar structure? To be precise: What does the row represent in the expression df.cache().map{row = ...}? Is it a logical row which maintains an array of columns and each column in turn is an array of values for batchSize rows? -Kiran
Re: Spark Streaming Stuck After 10mins Issue...
What is the code used to set up the kafka stream? On Sat, Jun 6, 2015 at 3:23 PM, EH eas...@gmail.com wrote: And here is the Thread Dump, where seems every worker is waiting for Executor #6 Thread 95: sparkExecutor-akka.actor.default-dispatcher-22 (RUNNABLE) to be complete: Thread 41: BLOCK_MANAGER cleanup timer (WAITING) Thread 42: BROADCAST_VARS cleanup timer (WAITING) Thread 44: shuffle-client-0 (RUNNABLE) Thread 45: shuffle-server-0 (RUNNABLE) Thread 47: Driver Heartbeater (TIMED_WAITING) Thread 48: Executor task launch worker-0 (RUNNABLE) Thread 56: threadDeathWatcher-2-1 (TIMED_WAITING) Thread 81: sparkExecutor-akka.actor.default-dispatcher-18 (WAITING) Thread 95: sparkExecutor-akka.actor.default-dispatcher-22 (RUNNABLE) ** sun.management.ThreadImpl.dumpThreads0(Native Method) sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:446) org.apache.spark.util.Utils$.getThreadDump(Utils.scala:1777) org.apache.spark.executor.ExecutorActor$$anonfun$receiveWithLogging$1.applyOrElse(ExecutorActor.scala:38) scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53) org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42) akka.actor.Actor$class.aroundReceive(Actor.scala:465) org.apache.spark.executor.ExecutorActor.aroundReceive(ExecutorActor.scala:34) akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) akka.actor.ActorCell.invoke(ActorCell.scala:487) akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) akka.dispatch.Mailbox.run(Mailbox.scala:220) akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ** Thread 112: sparkExecutor-akka.actor.default-dispatcher-25 (WAITING) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Stuck-After-10mins-Issue-tp23189p23190.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Accumulator map
Another approach would be to use a zookeeper. If you have zookeeper running somewhere in the cluster you can simply create a path like */dynamic-list* in it and then write objects/values to it, you can even create/access nested objects. Thanks Best Regards On Fri, Jun 5, 2015 at 7:06 PM, Cosmin Cătălin Sanda cosmincata...@gmail.com wrote: Hi, I am trying to gather some statistics from an RDD using accumulators. Essentially, I am counting how many times specific segments appear in each row of the RDD. This works fine and I get the expected results, but the problem is that each time I add a new segment to look for, I have to explicitly create an Accumulator for it and explicitly use the Accumulator in the foreach method. Is there a way to use a dynamic list of Accumulators in Spark? I want to control the segments from a single place in the code and the accumulators to be dynamically created and used based on the metrics list. BR, *Cosmin Catalin SANDA* Software Systems Engineer Phone: +45.27.30.60.35
Re: Spark Streaming Stuck After 10mins Issue...
Which consumer are you using? If you can paste the complete code then may be i can try reproducing it. Thanks Best Regards On Sun, Jun 7, 2015 at 1:53 AM, EH eas...@gmail.com wrote: And here is the Thread Dump, where seems every worker is waiting for Executor #6 Thread 95: sparkExecutor-akka.actor.default-dispatcher-22 (RUNNABLE) to be complete: Thread 41: BLOCK_MANAGER cleanup timer (WAITING) Thread 42: BROADCAST_VARS cleanup timer (WAITING) Thread 44: shuffle-client-0 (RUNNABLE) Thread 45: shuffle-server-0 (RUNNABLE) Thread 47: Driver Heartbeater (TIMED_WAITING) Thread 48: Executor task launch worker-0 (RUNNABLE) Thread 56: threadDeathWatcher-2-1 (TIMED_WAITING) Thread 81: sparkExecutor-akka.actor.default-dispatcher-18 (WAITING) Thread 95: sparkExecutor-akka.actor.default-dispatcher-22 (RUNNABLE) ** sun.management.ThreadImpl.dumpThreads0(Native Method) sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:446) org.apache.spark.util.Utils$.getThreadDump(Utils.scala:1777) org.apache.spark.executor.ExecutorActor$$anonfun$receiveWithLogging$1.applyOrElse(ExecutorActor.scala:38) scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:53) org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42) scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42) akka.actor.Actor$class.aroundReceive(Actor.scala:465) org.apache.spark.executor.ExecutorActor.aroundReceive(ExecutorActor.scala:34) akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) akka.actor.ActorCell.invoke(ActorCell.scala:487) akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) akka.dispatch.Mailbox.run(Mailbox.scala:220) akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ** Thread 112: sparkExecutor-akka.actor.default-dispatcher-25 (WAITING) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Stuck-After-10mins-Issue-tp23189p23190.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Driver crash at the end with InvocationTargetException when running SparkPi
Hi spark users: After I submitted a SparkPi job to spark, the driver crashed at the end of the job with the following log: WARN EventLoggingListener: Event log dir file:/d:/data/SparkWorker/work/driver-20150607200517-0002/logs/event does not exists, will newly create one. Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:59) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) Caused by: java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010) at org.apache.hadoop.util.Shell.runCommand(Shell.java:445) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) at org.apache.hadoop.util.Shell.execCommand(Shell.java:739) at org.apache.hadoop.util.Shell.execCommand(Shell.java:722) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633) at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:467) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:135) at org.apache.spark.SparkContext.init(SparkContext.scala:401) at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:28) at org.apache.spark.examples.SparkPi.main(SparkPi.scala) From the log, I can see that the driver has added jars from HDFS, connected to master, scheduled executors and all the executors were running. And then this error occurred. The command I use to submit job(I'm running spark 1.3.1 with standalone mode on windows): ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://localhost:7077 \ --deploy-mode cluster Hdfs://localhost:443/spark-examples-1.3.1-hadoop2.4.0.jar \ 1000 Any ideas about the error? I've found a similar error in JIRA https://issues.apache.org/jira/browse/SPARK-1407 but It only occurred at FileLogger when using yarn and eventlog set to HDFS. In my case, I use standalone mode and event log set to local, and my error is caused by Hadoop.util.Shell.runCommand. Best Regards Dong Lei
FlatMap in DataFrame
Hi, I'm trying to write a custom transformer in Spark ML and since that uses DataFrames, am trying to use flatMap function in DataFrame class in Java. Can you share a simple example of how to use the flatMap function to do word count on single column of the DataFrame. Thanks. Dimple -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/FlatMap-in-DataFrame-tp23199.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: columnar structure of RDDs from Parquet or ORC files
Thanks for replying twice :) I think I sent this question by email and somehow thought I did not sent it, hence created the other one on the web interface. Lets retain this thread since you have provided more details here. Great, it confirms my intuition about DataFrame. It's similar to Shark columnar layout, with the addition of compression. There it used java nio's ByteBuffer to hold actual data. I will go through the code you pointed. I have another question about DataFrame: The RDD operations are divided in two groups: *transformations *which are lazily evaluated and return a new RDD and *actions *which evaluate lineage defined by transformations, invoke actions and return results. What about DataFrame operations like join, groupBy, agg, unionAll etc which are all transformations in RDD? Are they lazily evaluated or immediately executed?
Examples of flatMap in dataFrame
Hi, I'm trying to write a custom transformer in Spark ML and since that uses DataFrames, am trying to use flatMap function in DataFrame class in Java. Can you share a simple example of how to use the flatMap function to do word count on single column of the DataFrame. Thanks Dimple
Optimization module in Python mllib
Am I right in thinking that Python mllib does not contain the optimization module? Are there plans to add this to the Python api? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Optimization-module-in-Python-mllib-tp23191.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: hiveContext.sql NullPointerException
Hi, This is expected behavior. HiveContext.sql (and also DataFrame.registerTempTable) is only expected to be invoked on driver side. However, the closure passed to RDD.foreach is executed on executor side, where no viable HiveContext instance exists. Cheng On 6/7/15 10:06 AM, patcharee wrote: Hi, I try to insert data into a partitioned hive table. The groupByKey is to combine dataset into a partition of the hive table. After the groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). But the hiveContext.sql throws NullPointerException, see below. Any suggestions? What could be wrong? Thanks! val varWHeightFlatRDD = varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey() .foreach( x = { val zone = x._1._1 val z = x._1._2 val year = x._1._3 val month = x._1._4 val df_table_4dim = x._2.toList.toDF() df_table_4dim.registerTempTable(table_4Dim) hiveContext.sql(INSERT OVERWRITE table 4dim partition (zone= + zone + ,z= + z + ,year= + year + ,month= + month + ) + select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim); }) java.lang.NullPointerException at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113) at no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798) at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org